Aug 14, 2015

Topic Analysis Exploration

I've been experimenting with Natural Language Processing, and I'm keenly interested in unsupervised techniques such as LDA and LSI. I have a fascination with unsupervised techniques, like clustering and neural networks, that have the ability to provide meaning without preconceived influence. The basic steps to set up LSI or LDA analysis are covered in the Gensim tutorials. If you don't know Gensim, it's a pretty sweet set of libraries for topic analysis and there's even a port of Google's Word2Vec to Python with some key performance improvements. I appreciate the focus on performance here, something that I think is rare in academic-like libraries.

My current knowledge on NLP is still pretty elementary, but I've focused on seeing a) what's possible and b) what has tutorials/libraries to get going. For purposes of this blog I'll stick to topic analysis which Gensim does well. Roughly speaking here was my R&D process, which wasn't rigorous or scientific by any means. Lot's of trial and error.

  1. Pull blogs from Elasticsearch by country
  2. Filter stop words, and perform lemmatization/stemming
  3. Create a corpus and dictionary
  4. Run that corpus through LDA or LSI
  5. 'Read the tea leaves' (a topic is a collection of words which can be difficult and require some insight)

First I tried LSI and at times you have to really investigate what the topic about. However, this sample below (from LDA analysis) is a bit more straight forward from an extraction on Uganda travel blogs. The print format is a bit confusing: probability*word + probablity2*word2... A collection of words is listed with the corresponding 'strength' of that word. This is a topic on safaris in Queen Elizabeth National Park

topic #3 (0.010): 0.010*bwindi + 0.010*lions + 0.008*queen + 0.008*elephants + 0.008*elizabeth + 0.007*tracking + 0.007*impenetrable + 0.007*gorillas. + 0.005*elephant + 0.005*park

This one is a bit tougher, take a look. Goats and beneficiaries? WTF? Type those 2 words into africa.wanderight.com and filter by 'Uganda'. You'll see a few blogs related to Vets without Borders (VWB). Pretty cool, huh?

topic #24 (0.010): 0.012*goats + 0.005*goats, + 0.005*beneficiaries + 0.003*pens + 0.003*disabled + 0.002*chuck + 0.002*background + 0.002*vaccinate + 0.002*tracked + 0.002*right?

Having a fast search engine on hand to pair words together has been super helpful at figuring what a topic really is. But it's hit or miss. I have no idea what the one below is about. Maybe you can figure it out.

topic #34 (0.010): 0.004*learned + 0.004*played + 0.004*stories + 0.004*tents + 0.004*resort + 0.004*dance + 0.004*grateful + 0.003*treat + 0.003*exhausted + 0.003*medicine

Here are some things I've tried to get better topics.

  • Improved stop words. Originally I used the NLTK list, but then I just used this list
  • I've recently played with stemming and I think the results have improved slightly (Why didn't I use the raw Snowball field within Elasticsearch? In short, I couldn't find anything with 5 minutes of googling, but really I liked having more control over the data, like stopwords)
  • With LDA, I tried more passes which improves results at the cost of performance. Fine for my exercise. Similarly, I tried more iterations with LSI.
  • Vary the number of topics. With some of these countries I don't have a ton of blogs. For Uganda I have just a bit over 400. I haven't nailed this down yet, but 50 seems to do ok.
  • Just now I tried a bigram method since some of these blog posts are so long. The results aren't as strong, and I can see that it's just clustering random things from single blogs. But still some telling words in pairs. 'white water', 'health center', 'gorilla tracking'.

I probably should have used a more rigorous method for optimizing the inputs, but I talked myself into thinking these are subjective enough anyway. Since I ran so many quick trials, I was able to know what should show up if I varied something. Another method I used to just determine the efficacy of topic analysis in general was to see if I could find 'Things to Do' listed in TripAdvisor. Getting those topics, I presume, would just gets my foot in the door. The reality is that I hope to find things that are difficult to track down in TripAdvisor or a general Google search. Like Vets without Borders. Volunteering as a means of travel is totally legit, but not a money-making adventure. Probably why you can't find it within travel channels, which is why I think online travel planning sucks.

My overall impression with LDA and LSI luke warm at best. I can find some interesting things fast, but there are only a few of those things in 50 or 100 topics. The rest you have to dig a bit. So I might have half a dozen topics that are solid. Perhaps part of the stems from certain blogs that are really many blogs combined into one. Words and documents are key pieces in these methodologies that provide boundaries for the model to learn from. Perhaps to get better data I could make a document a paragraph. Something worth trying. But for now I'd be willing to bet I progressed to the edge of the 80/20 rule. Any future gains would be incrementalish.

Back to my TripAdvisor hypothesis. It turns out topic analysis combined with manual interpretation can't match the 'things to do' in TripAdvisor. My sense is that there is a collection of techniques that will get me there. Some of those things I'm trying, so I've still got a lot to learn. If you're interested in diving a bit deeper on topic analysis, check this out.