I'm headed to Croatia soon, so I decided to try another run at topic analysis from more than a cursory look. One of the challenges I've had with various topic analyses was how to validate the result set. Recently I've found a tool called LDAVis which does just that for LDA topic analysis. I won't go into LDA too much here, but Ted Underwood does *the* best job at explaining LDA in an intuitive way. LDAVis provides the means to discover conditions of your topic through the visual relation to other topics, how distinct topics are from one another, and term frequency/saliency within topics. That sounds pretty heady, but I'll walk through how I used the tool to execute many different topic extraction/data cleaning runs to ultimately end up with some interesting results.
My dataset includes blogs scrapped from travelblog.org and travelpod.com where I included information from the body and title. I have this dataset locally on my computer in ElasticSearch, which includes 1500 blogs on Croatia alone. ElasticSearch isn't really necessary, but since I have a gui, Wanderight, that searches blogs, I used a customized version to look up words I don't recognize within topics. Tangentially, this quick process of term discovery and blog review helped me to identify activities and regions traveled for my upcoming trip. Nerd heaven. Below is a sample screenshot from my "Croatian" version of Wanderight. You can see that I searched "organ" which showed up in one of the major topics, but struck me as odd. It turns out that there is a pipe organ in the city of Zadar called Sea Organ. (described here, 2nd paragraph from the bottom). It plays music by way of sea waves underneath a large set of marble steps. Cool!
If you looked at that last blog, that is a typical representation of a good travel blog: descriptive, narrative, lots of pictures, and focused on a distinct region. It's important to note that these blogs are mostly journal-style catalogs, not 'Top 10' posts. The cool thing about that, is theoretically whatever you find in this data are real experiences that people have had. As you can tell, I'm a fan of this dataset, but it can be pretty messy to analyze. Some things I've had to account for:
- difficult to filter HTML (even using BeautifulSoup for the main post of the body)
- only filtering english blogs (using a package called LangId)
- removing stop words ('a', 'the', 'an'....). I used a much more expansive list than the out of the box NLTK one.
- removed people names. more on this later.
- remove words that appear infrequently
Here is the simple source code
You can see from the code that n-grams can be a simple and powerful way to do some discovery on your dataset. If you take a look you can see 'Game Thrones' in the Bigrams list. 'Of' was filtered out in stop words, but a good deal of filming for the Game of Thrones takes place in Croatia, but I'll leave you to look up specifics elsewhere ;). Alot of national parks and towns are in these lists as well. The same idea here of putting these terms into my search engine yields some interesting discoveries. I'm a bit interested in the white water rafting.
Below is a sample of LDAVis UI for topic analysis. I debated whether I should write this post in IPython (where I did the LDAVis analysis). Ultimately, I like having my posts in a single place. But IPython (now Jupyter)is a fantastic way for analysts to share their work. You can now run these notebooks off of GitHub , and I stuck the best of my data runs and data in this Jupyter Notebook. After this post I'll commit my code + Jupyter notebook to GitHub, but it needs a fair amount of clean up. The process of getting at this result, what this interactive chart says, and the process of getting there are worth some note. As you can see there isn't as much distinction between the topics as I would like. There is some amount of overlap, but the results still provide a good bit of insight. Topic #2 contains 'zadar, water, beach, sea, organ' as some of the most frequent+salient terms. Now what does salient mean? In a nutshell, the word 'organ' doesn't appear much in the corpus but when it does it belongs exclusively to this topic. When lambda is .5, it tends to be a good mix of popular words and strong words. Lambda is essentially a balance of frequency and saliency that you can control. I found that some topics were awash in terms that made it difficult to discern the topic. Using the lamba you can see, perhaps, what is defining about the topic. Topic #2 is what gave me pause to investigate "organ" and discover the Sea Organ in Zadar.
I used Gensim again to perform the LDA analysis. This library takes lengths to scale and be performant, which is important so I don't have to switch to a new library to perform large analyses. Here is the sample code used to produce my corpus, dictionary, and LDA. The general process is as follows:
- Get blog text from ElasticSearch
- Get an expansive list of firstnames and stopwords to filter out
- Only include terms that appear 4 or more times
- Generate Dictionary, Corpus, and LDA model
Let's take a look at a few more topics listed in the notebook. Be sure to set the slider on the right graph to .5 (Lambda) and set the topic on the left graph. For example Topic #4 is a bit confusing. Let's play with Lambda down to .33 and you get bus, hostel, squirt (nickname I should have filtered)..cheaper and backpacks stand out to me. I think this topic has alot of budget traveler information. Tagging these blogs for would-be travelers could be very compelling when searching for things to do. I'm lost on Topic #1 and #3. Topic #5 is obviously about Dubrovnik. Topic #6 is about some of the popular islands off the coast. Topic #7 is about some of the history and architecture. Topic #8 is about camping and motorhomes. Topic #9 is about some of the national parks. I think you get the idea. To reiterate, I often move back and forth from the LDAVis to my search to check out terms in context of the blog. That's how I knew that I missed a nickname in Topic #4 since it's always the same author and the first few results were about their son.
That about summarizes this post. I didn't go too indepth on LDA, but you should really read the previous blog by Ted Underwood. The intro to LDA written by one of the original authors, David Blei, is here. The paper behind LDAVis is interesting too. But their site is best to get a better intuition for their methodology of LDA analysis. Finally, were I a grad student, I think I might be interested to improve LDA for blogs by adding another dimension for blog metadata like author and perhaps custom treatment for title as well. If anyone is interested in running their own analysis with the search engine + code that I used, just ping me and I'll post instructions and cleaner code in my github repo.