....: September 2015

I'm headed to Croatia soon, so I decided to try another run at topic analysis from more than a cursory look. One of the challenges I've had with various topic analyses was validating the result set. Recently I've discovered LDAVis which does just that for a topic analysis. I won't go into LDA too much here, but Ted Underwood does *the* best job at explaining LDA in an intuitive way. LDAVis provides the means to discover conditions of your topic through the visual relation to other topics, how distinct topics are from one another, and term frequency/saliency within topics. Here I'll walk through how I used the tool to execute many different topic extraction/data cleaning runs to reveal interesting insights into Croatian travel.

The dataset includes blogs scrapped from travelblog.org and travelpod.com where I included information from the body and title. I have this dataset locally on my computer in ElasticSearch, which includes 1500 blogs on Croatia alone. ElasticSearch isn't really necessary, for the analysis itself, but it was convenient for document storage, term discovery, and blog review. Below is a sample screenshot from my "Croatian" version using a custom search tool. You can see that I searched "organ" which showed up in one of the major topics, and that struck me as odd. It turns out that there is a pipe organ in the city of Zadar called Sea Organ. (described here, 2nd paragraph from the bottom). It plays music by way of sea waves underneath a large set of marble steps.

If you looked at that last blog, that is a typical representation of a good travel blog: descriptive, narrative, lots of pictures, and focused on a distinct region. It's important to note that these blogs are mostly journal-style catalogs, not 'Top 10' posts. The compelling notion is that, theoretically, whatever you find in this data are real experiences that people have had. As you can tell, I'm a fan of this dataset, but it can be pretty messy to analyze. Some things I've had to account for:

difficult to filter HTML (even using BeautifulSoup for the main post of the body)
only filtering english blogs (using a package called LangId)
removing stop words ('a', 'the', 'an'....). I used a much more expansive list than the out of the box NLTK one.
removed people names. more on this later.
remove words that appear infrequently

Before diving into the topic analysis, a great way to discover some features (and bad data) is by looking at some different n-grams. Basically, what are the most common single words, word pairs, and words that appear in 3's.

Popular Unigrams (starting with most popular)

day
town
time
city
croatia
water
dubrovnik
people
split
good
bus
night
boat
island
walk
beautiful
well
small
great
walked
place
dinner
will
bit
park
sea
nice
trip
beach
decided
going
morning
tour
croatian
lunch
zagreb
today
hotel
road
headed
didn
find
left
long
hours
restaurant
walls
arrived
area
hour
car
local
couple
top
hvar
pretty
side
days
wine
walking
early
lovely
apartment
coast
lot
ferry
room
lakes
food
breakfast
head
big
ve
view
palace
amazing
group
ride
pm
streets
sun
don
lots
minutes
bar
best
started
stopped
main
set
called
full
large
views
finally
train
adriatic
tomorrow
told
swim

Popular Bigrams

national park
city walls
ice cream
plitvice lakes
adriatic sea
diocletian palace
bus station
cable car
game thrones
walking tour
walked town
walk town
stari grad
years ago
early morning
crystal clear
couple hours
tour guide
walled city
cruise ships
bell tower
half hour
train station
hvar island
bus ride
marco polo
cruise ship
long time
small town
hvar town
main square
day dubrovnik
island hvar
full day
narrow streets
decided head
pretty good
town dubrovnik
dalmatian coast
minute walk
olive oil
plitvice national
split croatia
unesco heritage
lakes national
rest day
bosnia herzegovina
day split
day trip
decided walk
heritage site
boat ride
city dubrovnik
walk city
upper town
white wine
arrived dubrovnik
broken relationships
caught bus
dubrovnik croatia
great day
spent time
early night
long day
top hill
well worth
headed town
main street
museum broken
people watching
rental car
blue water
couple days
grocery store
red wine
roman emperor
tour group
beautiful city
good time
shops restaurants
top deck
air conditioning
clear water
sail croatia
side road
blue cave
day day
emperor diocletian
views city
amazing views
beautiful place
glass wine
great time
pile gate
upper lakes
bus town
day croatia
island brac
lokrum island
side island

Popular Trigrams

lakes national park
plitvice national park
plitvice lakes national
museum broken relationships
unesco heritage site
krka national park
roman emperor diocletian
crystal clear water
walk city walls
clear blue water
water crystal clear
game thrones filmed
marco polo born
walking tour town
bad blue boys
cable car top
main bus station
second largest city
zagreb capital croatia
hour bus ride
largest city croatia
caught local bus
good nights sleep
st mark church
crystal clear waters
hula hula bar
national park croatia
town stari grad
walked city walls
white water rafting
austro hungarian empire
built roman emperor
cruise ship passengers
decided head apartment
eating ice cream
ice cream shop
mljet national park
national park plitvice
spent couple hours
town hvar island

Here is the simple source code

def ngrams ():

    corpus = ''
    for blogtext in getblogs ():
        corpus += blogtext

    stops = longstopwords()

    prime_words = [word for word in utils.simple_preprocess(corpus) if word.lower() not in stops]

    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(prime_words)
    finderII = TrigramCollocationFinder.from_words(prime_words)

    # only bigrams that appear 3+ times
    finder.apply_freq_filter(3)
    finderII.apply_freq_filter(3)

    print "TOP UNIGRAMS:"
    for t in nltk.FreqDist(prime_words).most_common(100):
        print '<li>' + t[0] + '</li>'

    print '-----------------'

    print "TOP BIGRAMS:"
    for bigram in finder.score_ngrams(bigram_measures.raw_freq)[:100]:
        print '<li>' + " ".join(bigram[0]) + '</li>'

    print '-----------------'

    print "TOP TRIGAMS:"
    for trigram in finderII.score_ngrams(trigram_measures.raw_freq)[:40]:
        print '<li>' + " ".join(trigram[0]) + '</li>'

You can see from the code that n-grams can be a simple and powerful way to do some discovery on your dataset. If you take a look you can see 'Game Thrones' in the Bigrams list. 'Of' was filtered out in stop words, but a good deal of filming for the Game of Thrones takes place in Croatia. Numerous national parks and towns are in these lists as well. Searching these terms into my search engine yields some interesting discoveries and I've already discovered an activity of interest: white water rafting.

Below is a sample of LDAVis UI for topic analysis. I debated whether I should write this post in IPython (where I did the LDAVis analysis). Ultimately, I like having my posts in a single place. But IPython (now Jupyter)is a convenient way for analysts to share their work. You can now run these notebooks off of GitHub , and I stuck the best of my data runs and data in this Jupyter Notebook. The process of getting at this result, what this interactive chart says, and the process of getting there are worth some note. As you can see there isn't as much distinction between the topics as I would like, as noted by the overlap in topic circles. However, the results still provide a good bit of insight. Topic #2 contains 'zadar, water, beach, sea, organ' as some of the most frequent+salient terms. Basically, the word 'organ' doesn't appear much in the corpus but when it does it belongs exclusively to this topic. When the lambda slider is set to .5, we see a good mix of popular words vs strong words. Lambda is essentially a balance of frequency and saliency that you can control. Some topics were awash in terms in ambiguous terms, but manipulating Lamba can provide clarity. Topic #2 is what gave me pause to investigate "organ" and discover the Sea Organ in Zadar.

I used Gensim again to perform the LDA analysis. This library takes lengths to scale and be performant, which is important so I don't have to switch to a new library to perform large analyses. Here is the sample code used to produce my corpus, dictionary, and LDA. The general process is as follows:

Get blog text from ElasticSearch
Get an expansive list of firstnames and stopwords to filter out
Only include terms that appear 4 or more times
Generate Dictionary, Corpus, and LDA model

def Dictionary_Corpus_LDA_Withoutnames():
    
    #get array of blogs stored in ElasticSearch
    documents = getblogs ()

    #get a list of first names to filter out (Gathered from US Census + half a dozen foreign names + nicknames)
    firsts = getfirstnames ()

    #get a long list of stopwords to filter out
    stops = longstopwords ()
   
    words_to_exclude = stops + firsts.keys()

    logger.info("Removing stop words and names")

    documents = [[word for word in utils.simple_preprocess (doc) if word not in words_to_exclude]
              for doc in documents]

    logger.info("Removing words that only occur once")

    token_frequency = defaultdict(int)
    for doc in documents:
        for token in doc:
            token_frequency[token] += 1

    #remove words that occur 3 or fewer times
    documents = [ [token for token in doc if token_frequency[token] > 3]
                for doc in documents  ]

    logger.info("Saving Corpus and Dictionary")

    dictionary = corpora.Dictionary(documents)
    dictionary.save('CroatiaTravelBlogs_NoNames.dict')
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    corpora.MmCorpus.serialize('CroatiaTravelBlogs_NoNames.mm', corpus)

    topics = 50
    passes = 100
    lda = models.LdaModel (corpus, id2word=dictionary, num_topics=topics, passes=passes)
    lda.print_topics(topics)
    lda.save("Croatia_Topic_No_Names-{0}_Passes-{1}.lda".format(topics, passes))

Take a look at a few more topics listed in the notebook. Set the Lamba slider on the right graph to .5 and choose a topic number on the left graph. For example Topic #4 is a bit confusing. Toggle the Lambda down to .33 and you'll get 'bus', 'hostel', 'squirt' (nickname I should have filtered)..'cheaper' and 'backpacks'. I think this topic contains blog for budget travelers. Tagging these blogs for would-be travelers could be very compelling feature to the search engine. I'm unclear on Topic #1 and #3. Topic #5 is about Dubrovnik. Topic #6 is relates to the popular islands just off the coast. Topic #7 is about the history and architecture. Topic #8 is about camping and motorhomes. Topic #9 is about some of the national parks. I think you get the idea. To reiterate, I often move back and forth from the LDAVis to my search to check out terms in context of the blog. That's how I knew that I missed a nickname in Topic #4 since it's always the same author and the first few results were about their son.

I didn't go too indepth on LDA here, but its worth reading the blog by Ted Underwood mentioned above. The intro to LDA written by one of the original authors, David Blei, is here. The paper behind LDAVis is interesting too. he website is best to get a better intuition for their methodology of LDA analysis. Finally, if I were to take the exercise further academically, I'd improve LDA specifically for blogs by adding another dimension for author and custom treatment for title.

....

Sep 23, 2015

Topic Analysis In Depth - Croatia Travel Blogs

Contributors