I'm headed to Croatia soon, so I decided to try another run at topic analysis from more than a cursory look. One of the challenges I've had with various topic analyses was validating the result set. Recently I've discovered LDAVis which does just that for a topic analysis. I won't go into LDA too much here, but Ted Underwood does *the* best job at explaining LDA in an intuitive way. LDAVis provides the means to discover conditions of your topic through the visual relation to other topics, how distinct topics are from one another, and term frequency/saliency within topics. Here I'll walk through how I used the tool to execute many different topic extraction/data cleaning runs to reveal interesting insights into Croatian travel.
The dataset includes blogs scrapped from travelblog.org and travelpod.com where I included information from the body and title. I have this dataset locally on my computer in ElasticSearch, which includes 1500 blogs on Croatia alone. ElasticSearch isn't really necessary, for the analysis itself, but it was convenient for document storage, term discovery, and blog review. Below is a sample screenshot from my "Croatian" version using a custom search tool. You can see that I searched "organ" which showed up in one of the major topics, and that struck me as odd. It turns out that there is a pipe organ in the city of Zadar called Sea Organ. (described here, 2nd paragraph from the bottom). It plays music by way of sea waves underneath a large set of marble steps.
If you looked at that last blog, that is a typical representation of a good travel blog: descriptive, narrative, lots of pictures, and focused on a distinct region. It's important to note that these blogs are mostly journal-style catalogs, not 'Top 10' posts. The compelling notion is that, theoretically, whatever you find in this data are real experiences that people have had. As you can tell, I'm a fan of this dataset, but it can be pretty messy to analyze. Some things I've had to account for:
- difficult to filter HTML (even using BeautifulSoup for the main post of the body)
- only filtering english blogs (using a package called LangId)
- removing stop words ('a', 'the', 'an'....). I used a much more expansive list than the out of the box NLTK one.
- removed people names. more on this later.
- remove words that appear infrequently
Here is the simple source code
def ngrams (): corpus = '' for blogtext in getblogs (): corpus += blogtext stops = longstopwords() prime_words = [word for word in utils.simple_preprocess(corpus) if word.lower() not in stops] bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = BigramCollocationFinder.from_words(prime_words) finderII = TrigramCollocationFinder.from_words(prime_words) # only bigrams that appear 3+ times finder.apply_freq_filter(3) finderII.apply_freq_filter(3) print "TOP UNIGRAMS:" for t in nltk.FreqDist(prime_words).most_common(100): print '<li>' + t[0] + '</li>' print '-----------------' print "TOP BIGRAMS:" for bigram in finder.score_ngrams(bigram_measures.raw_freq)[:100]: print '<li>' + " ".join(bigram[0]) + '</li>' print '-----------------' print "TOP TRIGAMS:" for trigram in finderII.score_ngrams(trigram_measures.raw_freq)[:40]: print '<li>' + " ".join(trigram[0]) + '</li>'
You can see from the code that n-grams can be a simple and powerful way to do some discovery on your dataset. If you take a look you can see 'Game Thrones' in the Bigrams list. 'Of' was filtered out in stop words, but a good deal of filming for the Game of Thrones takes place in Croatia. Numerous national parks and towns are in these lists as well. Searching these terms into my search engine yields some interesting discoveries and I've already discovered an activity of interest: white water rafting.
Below is a sample of LDAVis UI for topic analysis. I debated whether I should write this post in IPython (where I did the LDAVis analysis). Ultimately, I like having my posts in a single place. But IPython (now Jupyter)is a convenient way for analysts to share their work. You can now run these notebooks off of GitHub , and I stuck the best of my data runs and data in this Jupyter Notebook. The process of getting at this result, what this interactive chart says, and the process of getting there are worth some note. As you can see there isn't as much distinction between the topics as I would like, as noted by the overlap in topic circles. However, the results still provide a good bit of insight. Topic #2 contains 'zadar, water, beach, sea, organ' as some of the most frequent+salient terms. Basically, the word 'organ' doesn't appear much in the corpus but when it does it belongs exclusively to this topic. When the lambda slider is set to .5, we see a good mix of popular words vs strong words. Lambda is essentially a balance of frequency and saliency that you can control. Some topics were awash in terms in ambiguous terms, but manipulating Lamba can provide clarity. Topic #2 is what gave me pause to investigate "organ" and discover the Sea Organ in Zadar.
I used Gensim again to perform the LDA analysis. This library takes lengths to scale and be performant, which is important so I don't have to switch to a new library to perform large analyses. Here is the sample code used to produce my corpus, dictionary, and LDA. The general process is as follows:
- Get blog text from ElasticSearch
- Get an expansive list of firstnames and stopwords to filter out
- Only include terms that appear 4 or more times
- Generate Dictionary, Corpus, and LDA model
def Dictionary_Corpus_LDA_Withoutnames(): #get array of blogs stored in ElasticSearch documents = getblogs () #get a list of first names to filter out (Gathered from US Census + half a dozen foreign names + nicknames) firsts = getfirstnames () #get a long list of stopwords to filter out stops = longstopwords () words_to_exclude = stops + firsts.keys() logger.info("Removing stop words and names") documents = [[word for word in utils.simple_preprocess (doc) if word not in words_to_exclude] for doc in documents] logger.info("Removing words that only occur once") token_frequency = defaultdict(int) for doc in documents: for token in doc: token_frequency[token] += 1 #remove words that occur 3 or fewer times documents = [ [token for token in doc if token_frequency[token] > 3] for doc in documents ] logger.info("Saving Corpus and Dictionary") dictionary = corpora.Dictionary(documents) dictionary.save('CroatiaTravelBlogs_NoNames.dict') corpus = [dictionary.doc2bow(doc) for doc in documents] corpora.MmCorpus.serialize('CroatiaTravelBlogs_NoNames.mm', corpus) topics = 50 passes = 100 lda = models.LdaModel (corpus, id2word=dictionary, num_topics=topics, passes=passes) lda.print_topics(topics) lda.save("Croatia_Topic_No_Names-{0}_Passes-{1}.lda".format(topics, passes))
Take a look at a few more topics listed in the notebook. Set the Lamba slider on the right graph to .5 and choose a topic number on the left graph. For example Topic #4 is a bit confusing. Toggle the Lambda down to .33 and you'll get 'bus', 'hostel', 'squirt' (nickname I should have filtered)..'cheaper' and 'backpacks'. I think this topic contains blog for budget travelers. Tagging these blogs for would-be travelers could be very compelling feature to the search engine. I'm unclear on Topic #1 and #3. Topic #5 is about Dubrovnik. Topic #6 is relates to the popular islands just off the coast. Topic #7 is about the history and architecture. Topic #8 is about camping and motorhomes. Topic #9 is about some of the national parks. I think you get the idea. To reiterate, I often move back and forth from the LDAVis to my search to check out terms in context of the blog. That's how I knew that I missed a nickname in Topic #4 since it's always the same author and the first few results were about their son.
I didn't go too indepth on LDA here, but its worth reading the blog by Ted Underwood mentioned above. The intro to LDA written by one of the original authors, David Blei, is here. The paper behind LDAVis is interesting too. he website is best to get a better intuition for their methodology of LDA analysis. Finally, if I were to take the exercise further academically, I'd improve LDA specifically for blogs by adding another dimension for author and custom treatment for title.
No comments:
Post a Comment