Wordle.net, word clouds are easy for anyone to create. To draw a loose analogy: Wordle has been to text analysis what Blogger or Facebook has been to online publishing - great tools to democratize what used to be an elite skill. Are there substantial limitations to this word cloud format that need to be taken into consideration, though?Word clouds: No doubt you've seen these graphic representations of the most commonly used words in a body of text, floating around the internet. They are especially popular after big political speeches. Thanks to IBM researcher Jonathan Feinberg's web site
New York University PhD student of political science Drew Conway thinks there are. Conway hosted an interesting debate on his blog this week about one of the key concerns about word clouds and he offered an alternate model for understanding bodies of text. He calls word clouds "spatial visualization wherein space is meaningless." That's hard to argue with. Check out one of the models he proposes as a possible next step of the word cloud's evolution.
Above, Conway's visualization of words used by both President Obama and Sarah Palin in their speeches about the shooting of Congresswoman Giffords in Tucson. On the left, words used by both but more by Palin, on the right, words used by both but more by Obama. Click to view full size.
The big picture here is that space, both the x and the y axes, are opportunities to convey something important.
Conway explains what he did:
To understand how these speeches compared I first needed to create a term-frequency matrix, which contained only words used in both speeches. After removing common English stop words and the word 'applause' (Obama's speech was in front of a live audience), and retaining only words contained in both speeches at least once, I was left with 103 words to visualize.
To show how the two speeches contrasted, I decided to use the x-axis position to pull words used more by one politician closer to either the left or right of the plot. Words used more by Palin are to the left, and likewise words to the right were used more by Obama. The color reinforces this information, making words Palin words darker red, and Obama darker blue...
While this is a very simple extension of the traditional word cloud, much more can be learned from it. For example, both politicians used the words "congresswomen," and "america" equally but also frequently. While the word "tragedy" is used often in both speeches, but slightly more by Obama. The edges are most interesting. Palin repeated the shared terms "ideas," "debate," "victims," "values," and "strength," while Obama focused on "people," "lives," and "life."
Conway acknowledges that this visualization has its own limitations, specifically that the words not intersecting between both speeches aren't represented here. He posts the Wordle charts for both speeches as well in his blog post.
Nielsen's Keith Stewart points, for example, to that company's word association visualizations using concentric circles to represent associations across Twitter, blogs and Google search results. Stewart says that accurate sentiment analysis is a painstaking process with high overhead. This kind of association analysis seems much more scalable to me, though, if cruder.
What are your thoughts, readers, on how visualization of text analysis could evolve to become more effective?
Disclosure: IBM, the company where Wordle was in large part developed, is a ReadWriteWeb sponsor. We think that's pretty cool.