Sep 23, 2015

Topic Analysis In Depth - Croatia Travel Blogs

I'm headed to Croatia soon, so I decided to try another run at topic analysis from more than a cursory look. One of the challenges I've had with various topic analyses was validating the result set. Recently I've discovered LDAVis which does just that for a topic analysis. I won't go into LDA too much here, but Ted Underwood does *the* best job at explaining LDA in an intuitive way. LDAVis provides the means to discover conditions of your topic through the visual relation to other topics, how distinct topics are from one another, and term frequency/saliency within topics. Here I'll walk through how I used the tool to execute many different topic extraction/data cleaning runs to reveal interesting insights into Croatian travel.

The dataset includes blogs scrapped from and where I included information from the body and title. I have this dataset locally on my computer in ElasticSearch, which includes 1500 blogs on Croatia alone. ElasticSearch isn't really necessary, for the analysis itself, but it was convenient for document storage, term discovery, and blog review. Below is a sample screenshot from my "Croatian" version using a custom search tool. You can see that I searched "organ" which showed up in one of the major topics, and that struck me as odd. It turns out that there is a pipe organ in the city of Zadar called Sea Organ. (described here, 2nd paragraph from the bottom). It plays music by way of sea waves underneath a large set of marble steps.

If you looked at that last blog, that is a typical representation of a good travel blog: descriptive, narrative, lots of pictures, and focused on a distinct region. It's important to note that these blogs are mostly journal-style catalogs, not 'Top 10' posts. The compelling notion is that, theoretically, whatever you find in this data are real experiences that people have had. As you can tell, I'm a fan of this dataset, but it can be pretty messy to analyze. Some things I've had to account for:

  • difficult to filter HTML (even using BeautifulSoup for the main post of the body)
  • only filtering english blogs (using a package called LangId)
  • removing stop words ('a', 'the', 'an'....). I used a much more expansive list than the out of the box NLTK one.
  • removed people names. more on this later.
  • remove words that appear infrequently

Before diving into the topic analysis, a great way to discover some features (and bad data) is by looking at some different n-grams. Basically, what are the most common single words, word pairs, and words that appear in 3's.
Popular Unigrams (starting with most popular)
  • day
  • town
  • time
  • city
  • croatia
  • water
  • dubrovnik
  • people
  • split
  • good
  • bus
  • night
  • boat
  • island
  • walk
  • beautiful
  • well
  • small
  • great
  • walked
  • place
  • dinner
  • will
  • bit
  • park
  • sea
  • nice
  • trip
  • beach
  • decided
  • going
  • morning
  • tour
  • croatian
  • lunch
  • zagreb
  • today
  • hotel
  • road
  • headed
  • didn
  • find
  • left
  • long
  • hours
  • restaurant
  • walls
  • arrived
  • area
  • hour
  • car
  • local
  • couple
  • top
  • hvar
  • pretty
  • side
  • days
  • wine
  • walking
  • early
  • lovely
  • apartment
  • coast
  • lot
  • ferry
  • room
  • lakes
  • food
  • breakfast
  • head
  • big
  • ve
  • view
  • palace
  • amazing
  • group
  • ride
  • pm
  • streets
  • sun
  • don
  • lots
  • minutes
  • bar
  • best
  • started
  • stopped
  • main
  • set
  • called
  • full
  • large
  • views
  • finally
  • train
  • adriatic
  • tomorrow
  • told
  • swim
Popular Bigrams
  • national park
  • city walls
  • ice cream
  • plitvice lakes
  • adriatic sea
  • diocletian palace
  • bus station
  • cable car
  • game thrones
  • walking tour
  • walked town
  • walk town
  • stari grad
  • years ago
  • early morning
  • crystal clear
  • couple hours
  • tour guide
  • walled city
  • cruise ships
  • bell tower
  • half hour
  • train station
  • hvar island
  • bus ride
  • marco polo
  • cruise ship
  • long time
  • small town
  • hvar town
  • main square
  • day dubrovnik
  • island hvar
  • full day
  • narrow streets
  • decided head
  • pretty good
  • town dubrovnik
  • dalmatian coast
  • minute walk
  • olive oil
  • plitvice national
  • split croatia
  • unesco heritage
  • lakes national
  • rest day
  • bosnia herzegovina
  • day split
  • day trip
  • decided walk
  • heritage site
  • boat ride
  • city dubrovnik
  • walk city
  • upper town
  • white wine
  • arrived dubrovnik
  • broken relationships
  • caught bus
  • dubrovnik croatia
  • great day
  • spent time
  • early night
  • long day
  • top hill
  • well worth
  • headed town
  • main street
  • museum broken
  • people watching
  • rental car
  • blue water
  • couple days
  • grocery store
  • red wine
  • roman emperor
  • tour group
  • beautiful city
  • good time
  • shops restaurants
  • top deck
  • air conditioning
  • clear water
  • sail croatia
  • side road
  • blue cave
  • day day
  • emperor diocletian
  • views city
  • amazing views
  • beautiful place
  • glass wine
  • great time
  • pile gate
  • upper lakes
  • bus town
  • day croatia
  • island brac
  • lokrum island
  • side island
Popular Trigrams
  • lakes national park
  • plitvice national park
  • plitvice lakes national
  • museum broken relationships
  • unesco heritage site
  • krka national park
  • roman emperor diocletian
  • crystal clear water
  • walk city walls
  • clear blue water
  • water crystal clear
  • game thrones filmed
  • marco polo born
  • walking tour town
  • bad blue boys
  • cable car top
  • main bus station
  • second largest city
  • zagreb capital croatia
  • hour bus ride
  • largest city croatia
  • caught local bus
  • good nights sleep
  • st mark church
  • crystal clear waters
  • hula hula bar
  • national park croatia
  • town stari grad
  • walked city walls
  • white water rafting
  • austro hungarian empire
  • built roman emperor
  • cruise ship passengers
  • decided head apartment
  • eating ice cream
  • ice cream shop
  • mljet national park
  • national park plitvice
  • spent couple hours
  • town hvar island

Here is the simple source code
def ngrams ():

    corpus = ''
    for blogtext in getblogs ():
        corpus += blogtext

    stops = longstopwords()

    prime_words = [word for word in utils.simple_preprocess(corpus) if word.lower() not in stops]

    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(prime_words)
    finderII = TrigramCollocationFinder.from_words(prime_words)

    # only bigrams that appear 3+ times

    print "TOP UNIGRAMS:"
    for t in nltk.FreqDist(prime_words).most_common(100):
        print '<li>' + t[0] + '</li>'

    print '-----------------'

    print "TOP BIGRAMS:"
    for bigram in finder.score_ngrams(bigram_measures.raw_freq)[:100]:
        print '<li>' + " ".join(bigram[0]) + '</li>'

    print '-----------------'

    print "TOP TRIGAMS:"
    for trigram in finderII.score_ngrams(trigram_measures.raw_freq)[:40]:
        print '<li>' + " ".join(trigram[0]) + '</li>'

You can see from the code that n-grams can be a simple and powerful way to do some discovery on your dataset. If you take a look you can see 'Game Thrones' in the Bigrams list. 'Of' was filtered out in stop words, but a good deal of filming for the Game of Thrones takes place in Croatia. Numerous national parks and towns are in these lists as well. Searching these terms into my search engine yields some interesting discoveries and I've already discovered an activity of interest: white water rafting.

Below is a sample of LDAVis UI for topic analysis. I debated whether I should write this post in IPython (where I did the LDAVis analysis). Ultimately, I like having my posts in a single place. But IPython (now Jupyter)is a convenient way for analysts to share their work. You can now run these notebooks off of GitHub , and I stuck the best of my data runs and data in this Jupyter Notebook. The process of getting at this result, what this interactive chart says, and the process of getting there are worth some note. As you can see there isn't as much distinction between the topics as I would like, as noted by the overlap in topic circles. However, the results still provide a good bit of insight. Topic #2 contains 'zadar, water, beach, sea, organ' as some of the most frequent+salient terms. Basically, the word 'organ' doesn't appear much in the corpus but when it does it belongs exclusively to this topic. When the lambda slider is set to .5, we see a good mix of popular words vs strong words. Lambda is essentially a balance of frequency and saliency that you can control. Some topics were awash in terms in ambiguous terms, but manipulating Lamba can provide clarity. Topic #2 is what gave me pause to investigate "organ" and discover the Sea Organ in Zadar.

I used Gensim again to perform the LDA analysis. This library takes lengths to scale and be performant, which is important so I don't have to switch to a new library to perform large analyses. Here is the sample code used to produce my corpus, dictionary, and LDA. The general process is as follows:

  1. Get blog text from ElasticSearch
  2. Get an expansive list of firstnames and stopwords to filter out
  3. Only include terms that appear 4 or more times
  4. Generate Dictionary, Corpus, and LDA model
def Dictionary_Corpus_LDA_Withoutnames():
    #get array of blogs stored in ElasticSearch
    documents = getblogs ()

    #get a list of first names to filter out (Gathered from US Census + half a dozen foreign names + nicknames)
    firsts = getfirstnames ()

    #get a long list of stopwords to filter out
    stops = longstopwords ()
    words_to_exclude = stops + firsts.keys()"Removing stop words and names")

    documents = [[word for word in utils.simple_preprocess (doc) if word not in words_to_exclude]
              for doc in documents]"Removing words that only occur once")

    token_frequency = defaultdict(int)
    for doc in documents:
        for token in doc:
            token_frequency[token] += 1

    #remove words that occur 3 or fewer times
    documents = [ [token for token in doc if token_frequency[token] > 3]
                for doc in documents  ]"Saving Corpus and Dictionary")

    dictionary = corpora.Dictionary(documents)'CroatiaTravelBlogs_NoNames.dict')
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    corpora.MmCorpus.serialize('', corpus)

    topics = 50
    passes = 100
    lda = models.LdaModel (corpus, id2word=dictionary, num_topics=topics, passes=passes)
    lda.print_topics(topics)"Croatia_Topic_No_Names-{0}_Passes-{1}.lda".format(topics, passes))

Take a look at a few more topics listed in the notebook. Set the Lamba slider on the right graph to .5 and choose a topic number on the left graph. For example Topic #4 is a bit confusing. Toggle the Lambda down to .33 and you'll get 'bus', 'hostel', 'squirt' (nickname I should have filtered)..'cheaper' and 'backpacks'. I think this topic contains blog for budget travelers. Tagging these blogs for would-be travelers could be very compelling feature to the search engine. I'm unclear on Topic #1 and #3. Topic #5 is about Dubrovnik. Topic #6 is relates to the popular islands just off the coast. Topic #7 is about the history and architecture. Topic #8 is about camping and motorhomes. Topic #9 is about some of the national parks. I think you get the idea. To reiterate, I often move back and forth from the LDAVis to my search to check out terms in context of the blog. That's how I knew that I missed a nickname in Topic #4 since it's always the same author and the first few results were about their son.

I didn't go too indepth on LDA here, but its worth reading the blog by Ted Underwood mentioned above. The intro to LDA written by one of the original authors, David Blei, is here. The paper behind LDAVis is interesting too. he website is best to get a better intuition for their methodology of LDA analysis. Finally, if I were to take the exercise further academically, I'd improve LDA specifically for blogs by adding another dimension for author and custom treatment for title.