I like to follow developments in unstructured data and text mining. Advances in these areas can mean big things for PR and social media marketing. The din is only growing in social media chatter and online content. Those with the best tools will be better equipped to glean insight and turn it into action.
One area in the field that I feel has much potential is topic modeling. It came to my attention during a series of conversations with Stanford University researcher Jure Leskovec (I had been speaking with him about how info spreads online; see this post, which describes the findings of Jure and his team).
I had wanted to learn about technologies and research that can help understand online content, and answer questions like “what topics are being shared, discussed and trending?”
On the surface, this might sound simple. There are many monitoring tools, trending reports and social media dashboards that claim to do just that. But they might not do such a good job when different words are used to describe the same topic, or in slotting content into very granular “buckets”, or spotting totally new trends / keyword / topics.
Topic modeling is an example of unsupervised machine learning. This means that its algorithms can identify the topics in content without being told what to look for. One of the most promising methods is Latent Dirichlet Allocation (LDA).
It can get very complicated, which is why I was thrilled to run across the Stone and the Shell blog, which is run by Ted Underwood of the University of Illinois. His post Topic Modeling Just Simple Enough offered a great overview of the subject.
It whetted my appetite, and I wanted to learn more – so I reached out to Ted, and he graciously consented to an email interview, summarized below.
I will follow this up as I learn more; meanwhile, I wish to thank Ted Underwood for helping to shed light on topic modeling and its implications for PR.
I hope my blog visitors find the information helpful and interesting, thanks for reading.
Can you use LDA to:
(A) categorize short form content such as tweets to topics?
TU: Yes, LDA does work on short-form content, but tweets are short enough that you may lose some conceptual connections that would be visible in longer forms. (LDA will only see the connection if ‘both parts’ of it are contained in a single document, so very short documents become a limitation). Some researchers have recommended aggregating all the tweets of one author, in order to make those connections more visible.
(B) Discover rising topics on Twitter?
TU: Yes, potentially, although I think in reality you might be better off just looking for words or short phrases on the rise. The “topics” produced by LDA are diffuse enough that they can often be a little tricky to interpret. This makes them interesting, but it’s not necessarily what users would want for a “trending topic.” If you wanted to do this you would probably also want to select a topic-modeling algorithm that’s designed to identify topics with a particular temporal profile: something like “Topics over Time” could be tuned to reveal especially topics that are on the rise. Otherwise every topic model could reveal a lot of topics that are just, e.g. “youthful slang” or “scientific jargon” (kinds of language linked by demographic patterns rather than trends).
(C) Identify favored topics of influencers by analyzing article content and social media updates?
TU: This is an example of a place where I think predictive analytics (supervised learning algorithms) would likely perform better than an unsupervised method like LDA. Unsupervised models can be startling because they’re able to find patterns without being told what to find. But if you actually already know what you want to find (e.g., if you want to know how a particular tweeter, or a particular influential group of them, differs from others) there are usually simpler and more direct ways to model that boundary.
Do all this in near real time (assuming you have access to article text and the Twitter “fire hose”)?
TU: Here you’d really need to talk to someone with more CS or business background than I have, because this becomes a question about optimizing the performance of really large systems. My historical data mining sometimes gets big (a million volumes), but I’m never required to do that on the fly in real time as text is produced. In principle, I’m sure it’s doable; I know people have worked on ways to make topic models “updatable” so you don’t have to re-run the whole thing every time you get more data. For instance Hoffman and Blei have this article. But there are going to be challenges, and I wouldn’t know exactly how severe they are in practice.
Also, does topic modeling take a semantic approach, i.e. identify the words and content that belong to a topic, when the words used to describe the topic may vary?
Yes, this is its great strength, and it’s the exception to what I said above about mere word-charting probably being better for “trending topics.” If a topic could be described in lots of different ways, LDA might actually be better at revealing it. (On the other hand, this flexibility also means that LDA may reveal things we don’t think of as “topics” — e.g., patterns that are really just the typical diction of particular demographic groups, etc.).