May 31, 2016

An Original Work

For the last couple of years I’ve been curious on the nature of original work, or invention, in so far as software is concerned. I’ve had the fortune of speaking with inventors, patent holders, fresh PhD graduates, and entrepreneurs. Each time I meet these people, I feel engaged. Engagement is that notion of true interest and focus that I’ve lost perspective on over the years. Work turns into momentum and somewhere along the way I lost my engagement. I can't remember meeting someone who holds a PhD that wasn’t engaging. Perhaps for 2 reasons: first at some point they opted out of the comfortable, but traditional path of private industry and, second, they’ve spent considerable time devoted to more graduated and critical thinking. More than once I’ve considered going back to school. I even took the GRE last fall. Ultimately, I couldn’t settle on a field of study that I was excited enough about to devote a period of my life to. My favorite summary of why to get a PhD: "Pursuing a Ph.D. is the only way to spend 4 to 8 years being paid to work on something that the market does not directly value in the short term." I haven't found anything yet that I'd like to make that long of a bet on, but there are a few questions that could be good canidates: how can machines interface directly with the physical brain to expand cognition, can machines ask good questions, can nuclear energy be safe and affordable. I believe all of these to be in the realm of could be now or in 50 years. I’ve had some conversations with PhD students, over many beers, who were disillusioned with the process of choosing a focus for their studies. My limited impression is that you apply to school thinking you’d love to study X. But then you realize that there are only Y professors taking on new candidates, even fewer with whom you have a chemistry, and even fewer who have funding to support your work. A lot of work to not be doing what you’d like to be doing. Additionally, you have to enjoy the process, and having been a pragmatist most of my working life, the idea of research is much different than engineering/development: your work is not a means to a specific end.

For now I’ve put my academic ambitions on hold. I've turned my attention towards inventors and/or entrepreneurs. I believe there are 2 broad notions of producing an original work: academic or hacking. I’ve been a hack for sometime. Make it work, make it right, make it fast. The lack of discipline in that approach can have it’s own pernicious effects. In fact, if you watch this fantastic talk on the background behind the recent resurgence in neural networks, you can draw a line to Geoffery Hinton who can draw a line to many other researchers going back to the 70’s. Geoffery makes a strong argument for the case of pure research. But the very nature of how he came up on this discovery proves that his research could easily have not produced something so applicable outside academia. For decades neural networks were a mostly academic exercise until he found a couple of key revelations. The previously mentioned article on research vs tinkering strikes a chord and makes me want to be more disciplined in my hackery. Not to long ago I did some of my own hackery on LDA as it relates to travel blogs. Likely, somebody has already done similar work, but I did little research to build on previous work. I also was quite sloppy in execution. However, my aim was to address a basic premise: was this problem solvable? To that end, I enjoyed the process very much. A personal challenge of sorts, similar to the reason do long distance triathlon, speak publicly, or venture into surf outside my comfort zone: I aim to push myself and grow.

There are innumerable ways for me to produce an original work. I've decided to focus on 3 tenants: the work should be applicable at some point in the near future (less than 2 years), that work must be distinct enough from other works which can include how that work was executed, and, finally, I’d prefer to leverage my existing talents that exploit knowledge across unique domains. My current expression of that original work is through a new startup that aims to disintermediate traditional retailers and advertising by connecting brands more directly with consumers. If you are one of the 2 followers of my blog, you'll note this is a departure from my consulting adventures. With good reason, I need to scratch an itch to create something from nothing. There are many interesting machine learning challenges to tackle in this new startup adventure: recommendation systems, identifying unique products and their variations, and reinforcement learning. Within that space, I aim to do some exciting work that will, no doubt, leverage lots of other’s work. Perhaps I might be able to leverage the work of an academic whose efforts haven't yet seen production. A marriage of academic and entrepreneurial endeavors seems a fitting compromise.

Mar 6, 2016

Investor CTO

For the past 6 months I've contributed my time to 2 startups. In that time, I've become keenly aware of two things: 1) there are a lot of startups that need technical and management help and 2) my background in software development and executive leadership enables me to provide valuable advice and hands-on implementation to help those startups. The model is simple: I take a reduced compensation rate in addition to an equity stake in the company that has an attached pre-defined equity maturation event. Now my time is limited, so 2 or 3 startups constitutes a full book at the moment. It pays to be selective like most investors are.

I've taken alot of meetings with founders and would-be founders. Building a pipeline of potential clients is as much work for me as it is the startups I work for. I see founders that fit in 2 camps: those that have deep expertise in a domain and decide to take a run at making a business out of it. We'll call them domain founders. Then there are those that have deep experience working for companies and are trying to make a go at a market that is new to them. We'll call them experienced founders. With domain founders, it's important to be able to work with someone who knows that they can't do it all and is open to "coaching up". I tend to provide input well beyond technical aspects and more into how to scale or growth-hack various parts of the business. Experienced founders typically underestimate how the lack of knowledge in a new market will hinder their path. I call them experienced founders because they know how to scale at a certain point, but they lack that zero to one growth experience. With these founders its key to emphasize how risk is viewed differently in a startup. Getting to market fast, cutting scalability corners, honing in on your hypothesis, and testing the market in various ways is key.

My most selective criteria is whether I can work with the team. I'll take an initial meeting and a first engagement at no cost. It's worth it to invest my time upfront for free if there is an opportunity. What makes a good team? Proper role definition and open mindedness go a long way in my book. They say you need a hacker and a hustler. That's a good attempt at initial role definition. Often times, I'll come into a company that doesn't fit that mold, and you have people stepping on each others toes. That's fine, but it needs to be sorted quickly. And that process is never ending. Being open minded, is a must-have. Nobody can predict the future, as Daniel Kahneman states "experts are no better at predicting the future than dart throwing monkeys". If there is contention, state how you intend to objectively move through that disagreement and what the risk is associated with you being wrong. I've been hugely wrong. But in the end, I was glad to be proven wrong, because that was a big win the for business. Third, I'm a fan of the no asshole rule. I've met founders that are bullies and run over a conversation. They interrupt their people and generally think they are God's gift to free market enterprise. You spend more time working with people than you do your family, so my investment is more than fiscal. I need to enjoy working with these people, and assholes just won't cut it. Finally, I need to work with competent people. Both intellectually and emotionally competent people. There is a famous TED talk that says "people don't buy what you do, but why you do it". Often times that 'why' can get distorted and convoluted . I would restate "people buy who you do it with first over what you do and why you do it". Team is number 1 in my book. And it seems that I'm not the only one. In a recent survey of software devs, culture and colleagues rank above building something significant.

Second to the team, is knowing how my skills can help this business succeed. If you're working with hardware, I'm not a great fit. If you're working with enterprise software, I'll pass. If you need to build an MVP, I'm interested. If you need help managing your development team, I'm all ears. If you're trying to raise capital, I can help you scope your efforts and be an experienced asset on the team. It pays to tell people 'no' and be specific about my value-add. After all, I'm expecting these fledgling companies to give me a longer term stake in their business and I have to feel that I can provide value.

As the team and product grow and scale, I get to see who isn't pulling their weight. I don't time for passive equity holders. A new startup will go through phases. Maybe there are more technical needs in the beginning, and marketing needs to hold off. This is why role definition (which changes over time too) is important. Accountability needs to be clear. Startups go through phases: building an MVP, getting early adopter traction, scaling past $1 million, raising capital... I like to have a good sense for who is going to contribute and how much. If I see passive equity holders along the way, I will voice my concerns. After all this is an investment for me, and I don't need dead weight pulling down my investment. Warren Buffet's model is simple "estimate an investment’s intrinsic value, handicap its risk, buy using margin of safety, concentrate, stay in the circle of competence, let it roll as compounding did the work.” I believe as a partial investor in a company I can provide much more value for the long term and decrease the odds of failure substantially.

One aspect of this position that I enjoy more than I ever thought is the creative thinking you can apply to building a business. Tech is full of business models that never existed. In fact, the concept of "business model" didn't really exist until the dot-com tech bubble. Adjusting to the constantly changing field of information: competitors, market, full of opportunities to exploit. If you can't see the opportunity in things, startups are the wrong game for you. As John Maynard Keynes states "A large proportion of our positive activities depend on spontaneous optimism rather than on a mathematical expectation”. I enjoy the possibilities of the game. And that is likely the biggest draw for me. I could make more money in a corner office, but I much prefer the wild west of startups. As Warren Buffet states "we enjoy the process far more than the proceeds".

Being an investor-CTO is as much a new adventure for me as the startups are for those I consult for. I still have alot to learn. But Boulder is a burgeoning town for startups. And I'll never close the door on joining one of these teams fulltime.

Oct 17, 2015

Short term travel for the vagabonder wannabe

Traveling is best when there is an open ended expanse of time. But what if that's what you don't have? For the last decade, my wife and I have longed to head out on that open-ended wanderlusting journey, but career ambitions have gotten the better of us. Thus, we’ve been crafting the art of short term ( 1-3 weeks) international travel. What follows is an outline of the mindset and approach needed to make the most of your limited time. Before you read on, I assume that you:

  • like to experience a trip instead of ‘checking-out’
  • you’ll have more to spend (particularly on logistics) than the long-term vagabonder
  • you see the value in a flexible itinerary
  • you’d like to avoid the traditional tourist line when you can
If that’s not your cup of tea, you may find the suggestions in here not fitting to you. If you typically book cruises, close this browser immediately.


First and foremost, be researched, but not regimented. There are young travelers and newbs who need to have an itinerary, but a flexible plan is better. What if the weather sucks? What if you want to stay longer or shorter? How do you stay flexible? It’s so much easier now that’s it used to be. Our trip to Italy, 10 years ago, we’d start calling down the list of Lonely Planet’s accommodation section the day before, but more often the morning of, our intended arrival. Now, Airbnb, cheap international data plans, and a few handy smartphone apps make flexible travel much easier.

What does it mean to be researched? Logistics, areas of interest, and knowing what needs to be planned. Let's start with areas of interest. I'm assuming you know where you want to go. But let’s be more specific. What is there to do in a region? What are the highlights? Food, adventure, sights, culture... Use TripAdvisor and Google to get a lay of the land. Read travel blogs to find compelling experiences. You are looking for depth here not the postcard stuff, but what's underneath. Time of year, touristy rating, cost, time commitment... That's the depth you are looking for. Be judicious in determining whether your content was paid by an interest other than traveling (like a hotel chain sponsoring a blogger, a country promoted post). Not that sponsored content is useless, but the intentions of the author can be misleading and often outright deceptive. This scathing review of a TripAdvisor post is just the tip of the iceberg.


Now if you have strung together a few places of interest, start flushing out logistics. How do people get around: train, bus, car, motorhome, walking, scooter, bike...Travel forums and travel blogs are great at digging this stuff out. Renting a car buys you the ultimate freedom, but knowing the bus/ferry/train schedules can work well too. Don't be intimidated, but read up on other's impressions of those transportation modes respective to your frame of reference. For instance, do you drive on the other side of the road? Do trains tend to run late? Is English common? For instance, in Croatia, the ferry schedule changes with the season. Finding accurate info online can be a challenge. This is why I created a search engine based on narrative travel blogs.

Commit at the last responsible moment

What needs to be planned and what can wait? Do you need to book that tour in advance because it sells out? Obviously you need plane tickets there and back. What about inter-region flights? Only a few places to stay at a particular area? Book it. The more you can widdle this down the better. Here is the key: there are lots of things to do in a region and reserving your commitments to the last possible instant increases your chance of finding that key piece of information that might totally change your mind. For instance, we did a zip line tour that we booked hours earlier. Most people book, weeks or months in advance. For us, it turns out the tour was on the way between cities we were traveling to. Lots of people recommended we visit Plitvice Lakes in Croatia because of the expansive beauty and unique landscapes. Well the weather was crap when we would pass by this area, and another less busy park, was convenient for our itinerary and the weather was great. You could also go swimming there and its was less crowded. We decided to go there the morning of. It’s these last minute plans that seem to have the best effect on perceived travel experience. Serendipity, can bring a lot of joy to a trip. Working to find serendipity is a bit like hammering water to make it flow, but you can create the conditions that improve the chance of making it happen.

Sample Itineraries

You need to put it all together by knowing where to start. What direction to head? Sample itineraries? What-if scenarios on that itinerary. There are some sample itinerary websites out there. You can get a sense from ’trip blogs’ of where to head. Q&A travel sites can also help with itinerary generation. I posted this on Trippy and got 3 pretty decent answers. In the end, we didn't pick any of these specifically, but the information provided was helpful.

On-the-ground planning

Lastly, I relish the experience of piecing things together on the ground. You can pick up information from locals or other travelers. Once you are on the ground use Google Maps, TripAdvisor, saved information from your travel investigations, travel books, and importantly Airbnb. Lots of times the host will provide you some key information in making the most of you time. On the ground intel, adjusting, re-routing, last-minute decisions...helps to seed those moments of serendipity.

Sep 23, 2015

Topic Analysis In Depth - Croatia Travel Blogs

I'm headed to Croatia soon, so I decided to try another run at topic analysis from more than a cursory look. One of the challenges I've had with various topic analyses was validating the result set. Recently I've discovered LDAVis which does just that for a topic analysis. I won't go into LDA too much here, but Ted Underwood does *the* best job at explaining LDA in an intuitive way. LDAVis provides the means to discover conditions of your topic through the visual relation to other topics, how distinct topics are from one another, and term frequency/saliency within topics. Here I'll walk through how I used the tool to execute many different topic extraction/data cleaning runs to reveal interesting insights into Croatian travel.

The dataset includes blogs scrapped from and where I included information from the body and title. I have this dataset locally on my computer in ElasticSearch, which includes 1500 blogs on Croatia alone. ElasticSearch isn't really necessary, for the analysis itself, but it was convenient for document storage, term discovery, and blog review. Below is a sample screenshot from my "Croatian" version using a custom search tool. You can see that I searched "organ" which showed up in one of the major topics, and that struck me as odd. It turns out that there is a pipe organ in the city of Zadar called Sea Organ. (described here, 2nd paragraph from the bottom). It plays music by way of sea waves underneath a large set of marble steps.

If you looked at that last blog, that is a typical representation of a good travel blog: descriptive, narrative, lots of pictures, and focused on a distinct region. It's important to note that these blogs are mostly journal-style catalogs, not 'Top 10' posts. The compelling notion is that, theoretically, whatever you find in this data are real experiences that people have had. As you can tell, I'm a fan of this dataset, but it can be pretty messy to analyze. Some things I've had to account for:

  • difficult to filter HTML (even using BeautifulSoup for the main post of the body)
  • only filtering english blogs (using a package called LangId)
  • removing stop words ('a', 'the', 'an'....). I used a much more expansive list than the out of the box NLTK one.
  • removed people names. more on this later.
  • remove words that appear infrequently

Before diving into the topic analysis, a great way to discover some features (and bad data) is by looking at some different n-grams. Basically, what are the most common single words, word pairs, and words that appear in 3's.
Popular Unigrams (starting with most popular)
  • day
  • town
  • time
  • city
  • croatia
  • water
  • dubrovnik
  • people
  • split
  • good
  • bus
  • night
  • boat
  • island
  • walk
  • beautiful
  • well
  • small
  • great
  • walked
  • place
  • dinner
  • will
  • bit
  • park
  • sea
  • nice
  • trip
  • beach
  • decided
  • going
  • morning
  • tour
  • croatian
  • lunch
  • zagreb
  • today
  • hotel
  • road
  • headed
  • didn
  • find
  • left
  • long
  • hours
  • restaurant
  • walls
  • arrived
  • area
  • hour
  • car
  • local
  • couple
  • top
  • hvar
  • pretty
  • side
  • days
  • wine
  • walking
  • early
  • lovely
  • apartment
  • coast
  • lot
  • ferry
  • room
  • lakes
  • food
  • breakfast
  • head
  • big
  • ve
  • view
  • palace
  • amazing
  • group
  • ride
  • pm
  • streets
  • sun
  • don
  • lots
  • minutes
  • bar
  • best
  • started
  • stopped
  • main
  • set
  • called
  • full
  • large
  • views
  • finally
  • train
  • adriatic
  • tomorrow
  • told
  • swim
Popular Bigrams
  • national park
  • city walls
  • ice cream
  • plitvice lakes
  • adriatic sea
  • diocletian palace
  • bus station
  • cable car
  • game thrones
  • walking tour
  • walked town
  • walk town
  • stari grad
  • years ago
  • early morning
  • crystal clear
  • couple hours
  • tour guide
  • walled city
  • cruise ships
  • bell tower
  • half hour
  • train station
  • hvar island
  • bus ride
  • marco polo
  • cruise ship
  • long time
  • small town
  • hvar town
  • main square
  • day dubrovnik
  • island hvar
  • full day
  • narrow streets
  • decided head
  • pretty good
  • town dubrovnik
  • dalmatian coast
  • minute walk
  • olive oil
  • plitvice national
  • split croatia
  • unesco heritage
  • lakes national
  • rest day
  • bosnia herzegovina
  • day split
  • day trip
  • decided walk
  • heritage site
  • boat ride
  • city dubrovnik
  • walk city
  • upper town
  • white wine
  • arrived dubrovnik
  • broken relationships
  • caught bus
  • dubrovnik croatia
  • great day
  • spent time
  • early night
  • long day
  • top hill
  • well worth
  • headed town
  • main street
  • museum broken
  • people watching
  • rental car
  • blue water
  • couple days
  • grocery store
  • red wine
  • roman emperor
  • tour group
  • beautiful city
  • good time
  • shops restaurants
  • top deck
  • air conditioning
  • clear water
  • sail croatia
  • side road
  • blue cave
  • day day
  • emperor diocletian
  • views city
  • amazing views
  • beautiful place
  • glass wine
  • great time
  • pile gate
  • upper lakes
  • bus town
  • day croatia
  • island brac
  • lokrum island
  • side island
Popular Trigrams
  • lakes national park
  • plitvice national park
  • plitvice lakes national
  • museum broken relationships
  • unesco heritage site
  • krka national park
  • roman emperor diocletian
  • crystal clear water
  • walk city walls
  • clear blue water
  • water crystal clear
  • game thrones filmed
  • marco polo born
  • walking tour town
  • bad blue boys
  • cable car top
  • main bus station
  • second largest city
  • zagreb capital croatia
  • hour bus ride
  • largest city croatia
  • caught local bus
  • good nights sleep
  • st mark church
  • crystal clear waters
  • hula hula bar
  • national park croatia
  • town stari grad
  • walked city walls
  • white water rafting
  • austro hungarian empire
  • built roman emperor
  • cruise ship passengers
  • decided head apartment
  • eating ice cream
  • ice cream shop
  • mljet national park
  • national park plitvice
  • spent couple hours
  • town hvar island

Here is the simple source code
def ngrams ():

    corpus = ''
    for blogtext in getblogs ():
        corpus += blogtext

    stops = longstopwords()

    prime_words = [word for word in utils.simple_preprocess(corpus) if word.lower() not in stops]

    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(prime_words)
    finderII = TrigramCollocationFinder.from_words(prime_words)

    # only bigrams that appear 3+ times

    print "TOP UNIGRAMS:"
    for t in nltk.FreqDist(prime_words).most_common(100):
        print '<li>' + t[0] + '</li>'

    print '-----------------'

    print "TOP BIGRAMS:"
    for bigram in finder.score_ngrams(bigram_measures.raw_freq)[:100]:
        print '<li>' + " ".join(bigram[0]) + '</li>'

    print '-----------------'

    print "TOP TRIGAMS:"
    for trigram in finderII.score_ngrams(trigram_measures.raw_freq)[:40]:
        print '<li>' + " ".join(trigram[0]) + '</li>'

You can see from the code that n-grams can be a simple and powerful way to do some discovery on your dataset. If you take a look you can see 'Game Thrones' in the Bigrams list. 'Of' was filtered out in stop words, but a good deal of filming for the Game of Thrones takes place in Croatia. Numerous national parks and towns are in these lists as well. Searching these terms into my search engine yields some interesting discoveries and I've already discovered an activity of interest: white water rafting.

Below is a sample of LDAVis UI for topic analysis. I debated whether I should write this post in IPython (where I did the LDAVis analysis). Ultimately, I like having my posts in a single place. But IPython (now Jupyter)is a convenient way for analysts to share their work. You can now run these notebooks off of GitHub , and I stuck the best of my data runs and data in this Jupyter Notebook. The process of getting at this result, what this interactive chart says, and the process of getting there are worth some note. As you can see there isn't as much distinction between the topics as I would like, as noted by the overlap in topic circles. However, the results still provide a good bit of insight. Topic #2 contains 'zadar, water, beach, sea, organ' as some of the most frequent+salient terms. Basically, the word 'organ' doesn't appear much in the corpus but when it does it belongs exclusively to this topic. When the lambda slider is set to .5, we see a good mix of popular words vs strong words. Lambda is essentially a balance of frequency and saliency that you can control. Some topics were awash in terms in ambiguous terms, but manipulating Lamba can provide clarity. Topic #2 is what gave me pause to investigate "organ" and discover the Sea Organ in Zadar.

I used Gensim again to perform the LDA analysis. This library takes lengths to scale and be performant, which is important so I don't have to switch to a new library to perform large analyses. Here is the sample code used to produce my corpus, dictionary, and LDA. The general process is as follows:

  1. Get blog text from ElasticSearch
  2. Get an expansive list of firstnames and stopwords to filter out
  3. Only include terms that appear 4 or more times
  4. Generate Dictionary, Corpus, and LDA model
def Dictionary_Corpus_LDA_Withoutnames():
    #get array of blogs stored in ElasticSearch
    documents = getblogs ()

    #get a list of first names to filter out (Gathered from US Census + half a dozen foreign names + nicknames)
    firsts = getfirstnames ()

    #get a long list of stopwords to filter out
    stops = longstopwords ()
    words_to_exclude = stops + firsts.keys()"Removing stop words and names")

    documents = [[word for word in utils.simple_preprocess (doc) if word not in words_to_exclude]
              for doc in documents]"Removing words that only occur once")

    token_frequency = defaultdict(int)
    for doc in documents:
        for token in doc:
            token_frequency[token] += 1

    #remove words that occur 3 or fewer times
    documents = [ [token for token in doc if token_frequency[token] > 3]
                for doc in documents  ]"Saving Corpus and Dictionary")

    dictionary = corpora.Dictionary(documents)'CroatiaTravelBlogs_NoNames.dict')
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    corpora.MmCorpus.serialize('', corpus)

    topics = 50
    passes = 100
    lda = models.LdaModel (corpus, id2word=dictionary, num_topics=topics, passes=passes)
    lda.print_topics(topics)"Croatia_Topic_No_Names-{0}_Passes-{1}.lda".format(topics, passes))

Take a look at a few more topics listed in the notebook. Set the Lamba slider on the right graph to .5 and choose a topic number on the left graph. For example Topic #4 is a bit confusing. Toggle the Lambda down to .33 and you'll get 'bus', 'hostel', 'squirt' (nickname I should have filtered)..'cheaper' and 'backpacks'. I think this topic contains blog for budget travelers. Tagging these blogs for would-be travelers could be very compelling feature to the search engine. I'm unclear on Topic #1 and #3. Topic #5 is about Dubrovnik. Topic #6 is relates to the popular islands just off the coast. Topic #7 is about the history and architecture. Topic #8 is about camping and motorhomes. Topic #9 is about some of the national parks. I think you get the idea. To reiterate, I often move back and forth from the LDAVis to my search to check out terms in context of the blog. That's how I knew that I missed a nickname in Topic #4 since it's always the same author and the first few results were about their son.

I didn't go too indepth on LDA here, but its worth reading the blog by Ted Underwood mentioned above. The intro to LDA written by one of the original authors, David Blei, is here. The paper behind LDAVis is interesting too. he website is best to get a better intuition for their methodology of LDA analysis. Finally, if I were to take the exercise further academically, I'd improve LDA specifically for blogs by adding another dimension for author and custom treatment for title.

Aug 14, 2015

Topic Analysis Exploration

I've been experimenting with Natural Language Processing, and I'm keenly interested in unsupervised techniques such as LDA and LSI. I have a fascination with unsupervised techniques, like clustering and neural networks, that have the ability to provide meaning without preconceived influence. The basic steps to set up LSI or LDA analysis are covered in the Gensim tutorials. If you don't know Gensim, it's a pretty sweet set of libraries for topic analysis and there's even a port of Google's Word2Vec to Python with some key performance improvements. I appreciate the focus on performance here, something that I think is rare in academic-like libraries.

My current knowledge on NLP is still pretty elementary, but I've focused on seeing a) what's possible and b) what has tutorials/libraries to get going. For purposes of this blog I'll stick to topic analysis which Gensim does well. Roughly speaking here was my R&D process, which wasn't rigorous or scientific by any means. Lot's of trial and error.

  1. Pull blogs from Elasticsearch by country
  2. Filter stop words, and perform lemmatization/stemming
  3. Create a corpus and dictionary
  4. Run that corpus through LDA or LSI
  5. 'Read the tea leaves' (a topic is a collection of words which can be difficult and require some insight)

First I tried LSI and at times you have to really investigate what the topic about. However, this sample below (from LDA analysis) is a bit more straight forward from an extraction on Uganda travel blogs. The print format is a bit confusing: probability*word + probablity2*word2... A collection of words is listed with the corresponding 'strength' of that word. This is a topic on safaris in Queen Elizabeth National Park

topic #3 (0.010): 0.010*bwindi + 0.010*lions + 0.008*queen + 0.008*elephants + 0.008*elizabeth + 0.007*tracking + 0.007*impenetrable + 0.007*gorillas. + 0.005*elephant + 0.005*park

This one is a bit tougher, take a look. Goats and beneficiaries? WTF? Type those 2 words into and filter by 'Uganda'. You'll see a few blogs related to Vets without Borders (VWB). Pretty cool, huh?

topic #24 (0.010): 0.012*goats + 0.005*goats, + 0.005*beneficiaries + 0.003*pens + 0.003*disabled + 0.002*chuck + 0.002*background + 0.002*vaccinate + 0.002*tracked + 0.002*right?

Having a fast search engine on hand to pair words together has been super helpful at figuring what a topic really is. But it's hit or miss. I have no idea what the one below is about. Maybe you can figure it out.

topic #34 (0.010): 0.004*learned + 0.004*played + 0.004*stories + 0.004*tents + 0.004*resort + 0.004*dance + 0.004*grateful + 0.003*treat + 0.003*exhausted + 0.003*medicine

Here are some things I've tried to get better topics.

  • Improved stop words. Originally I used the NLTK list, but then I just used this list
  • I've recently played with stemming and I think the results have improved slightly (Why didn't I use the raw Snowball field within Elasticsearch? In short, I couldn't find anything with 5 minutes of googling, but really I liked having more control over the data, like stopwords)
  • With LDA, I tried more passes which improves results at the cost of performance. Fine for my exercise. Similarly, I tried more iterations with LSI.
  • Vary the number of topics. With some of these countries I don't have a ton of blogs. For Uganda I have just a bit over 400. I haven't nailed this down yet, but 50 seems to do ok.
  • Just now I tried a bigram method since some of these blog posts are so long. The results aren't as strong, and I can see that it's just clustering random things from single blogs. But still some telling words in pairs. 'white water', 'health center', 'gorilla tracking'.

I probably should have used a more rigorous method for optimizing the inputs, but I talked myself into thinking these are subjective enough anyway. Since I ran so many quick trials, I was able to know what should show up if I varied something. Another method I used to just determine the efficacy of topic analysis in general was to see if I could find 'Things to Do' listed in TripAdvisor. Getting those topics, I presume, would just gets my foot in the door. The reality is that I hope to find things that are difficult to track down in TripAdvisor or a general Google search. Like Vets without Borders. Volunteering as a means of travel is totally legit, but not a money-making adventure. Probably why you can't find it within travel channels, which is why I think online travel planning sucks.

My overall impression with LDA and LSI luke warm at best. I can find some interesting things fast, but there are only a few of those things in 50 or 100 topics. The rest you have to dig a bit. So I might have half a dozen topics that are solid. Perhaps part of the stems from certain blogs that are really many blogs combined into one. Words and documents are key pieces in these methodologies that provide boundaries for the model to learn from. Perhaps to get better data I could make a document a paragraph. Something worth trying. But for now I'd be willing to bet I progressed to the edge of the 80/20 rule. Any future gains would be incrementalish.

Back to my TripAdvisor hypothesis. It turns out topic analysis combined with manual interpretation can't match the 'things to do' in TripAdvisor. My sense is that there is a collection of techniques that will get me there. Some of those things I'm trying, so I've still got a lot to learn. If you're interested in diving a bit deeper on topic analysis, check this out.

Jul 24, 2015

Wanderight, a technical journey

While I've been vague about my idea and project, things are coming into focus. In this time of entrepreneurial exploration I've had time to get reacquainted with developing software. Every technology I've employed so far is new to me: operating system, cloud hosting platform, programing language, text based search platform... But like most other software devs I enjoy learning new things. The pragmatist in me knows that many of these tools are the best for the job. Python is good at manipulating text, data analysis, and building things quick. Elasticsearch is great at encapsulating the complexity and learning curve of running Lucene or Solr. Linux is the best platform to run and operate Python and Elasticsearch. Azure is the best platform for giving me free hosting which is awesome at keeping my burn rate low.

I'd like focus on the technology aspects of what I'm working on from the perspective of a beginner on many fronts. But I also have the belief that I'll be able to do somethings with technologies that I barely understand as of yet. That's where the fun comes in. Without explaining the product and the opportunity/market...yada yada, I'll focus on the operational learnings and a bit about the motive for the focus. So first, a bit of a primer: I believe that many textual sources have a wealth of information that is mostly untapped. If you think Google News can aggregate well and Google Search can find things well, I'd hope to do something similar with travel. As a person that's traveled and explored in over a dozen countries, I hate the over-sponsored, highly-advertised world of online travel. There are people out there giving authentic information on travel blogs, forums... you just need to tease it out. Make the content easy to find and relatable to the end user, and be diligent about spotting and identifying that which is 'sponsored'. The easiest way I could think to start is with narrative travel blogs. Which brings me to my tech. Parse, clean, and analyze travel blogs. What I eventually want to have is a site that will serve up authentic travel information to the international planning traveler. As far as I can tell, there are 3 distinct categories.

  1. Build a 'basic' search engine based on a semi-currated content feed. By semi-currated I mean that I'll only be putting in content that I presume isn't sponsored. Some of that authentic content is apparent, some of it is tricky, so start with the obvious.
  2. Discover your data set. Yay data science. Understand key characteristics about it. Get a sense for what is 'good' and make and define those as hypotheses in your product journey. Clean your dataset. Don't even allow things in that are in a foreign language of a certain content length and aren't based in the region of focus. Not to mention that cleaning and data janitorial work that needs to be done before it can be searched. Luckily Elasticsearch does alot of that lifting for you: stemming, stopwords... Obviously I need to pick out the main content, key in on meta data while, and clean html while parsing (covered in step 1).
  3. Generally speaking, find more intelligent ways to serve content up to the user. Things like region discovery through topic analysis. Or categorization to filter content for a traveler archtype. Summarize long posts to make searching faster. Extract places to enhance your search and discover capability. Recommend blogs to read based on blogs of interest. I believe that information extraction, natural language processing, and machine learning have become much more accessible to the non-academic. That being said, this step isn't plug and play but there is wealth of books, platforms, and sample code to get this autodidact well on my way. If nothing else, this is waaaaay better than going back to school for a degree in what I can learn with the power of the internets.

Now the details and my reflection 6 weeks in with so much new technology. To address step 1) I choose 2 key technologies:Python and Elasticsearch. Python for its text parsing prowess and Elasticsearch for the heavy lifting of Lucene. BeautifulSoup is an excellent library for finding information in blogs and scraping out what you want. Learning Python was fairly straight forward. The 2 languages I'm most familiar with are C# and Javascript. Python is alot more like Javascript. The key difference building software this time around is that I have a veteran (read cantankerous) view on technology. Everything in its place. I used to get wrapped around the axle with proper coding technique, architecture, scalability...but that doesn't have much place here. Those things matter most when you have different problems to solve than trying to see if other people give a shit about what you're building. This graphic is telling of the different development mindsets. My nature is to be a 'Settler', but I've longed to be a 'Pioneer'. Which basically means now I'm a giant fucking hack. You could drive a mack truck through my hacks. It's fun and awesome. But I think the key difference is (and this is a big *if*) I would know how to scale things when the time comes. Whether I'd need to burn it to the ground and rewrite or build up what's there, I can figure it out because I have in the past. That's a comfort that not many others have and I consider it a key personal asset.

A long time ago I worked with Lucene. Like 10 years ago. What Elasticsearch has done for text search since that time is awesome. Getting going out of the box is easy. The tutorials and the amount of StackOverflow help make the ease of adoption so much less intimidating than trying to learn what TF-IDF is when you just want to get shit working. Get it working first, dive deep later. Elasticsearch is awesome at that. I don't need to care that it runs on 5 shards or that I need to use a Snowball analyzer for proper stemming. I can get going and find lots of information on the internets when I run into issues. That being said the run-into-walls approach only works so long. Eventually you need to dive deep, like yesterday. I spent the at least half the day understanding how custom query scoring works, how to avoid using TF-IDF on certain fields that are short, and looking at very complex 'explain' query chains. I went deep into the rabbit hole that day my friends, but my search is so much better. Being able to find terms in a statistically relevant way just gets your foot in the door. Boosting blogs that are more recently, longer in length, and contain more pictures really makes a difference in the quality of your first few results.

For hosting I originally started with AWS. Every major cloud platform has some sort of offer for startups. AWS really raises the bar for getting anything free out of them. Like for $1k/year, I'd need to take an online MIT course on entrepreneurship. No thanks. I bought the professor's book on audio, time better spent. Azure has a BizSpark program that lowers the bar to entry and is awesome for people like me who are bootstrapping an MVP. 2 weeks after I applied, I got in. I've got a basic Web App hosting a static single page that uses Angular and other javascript goodies to to talk to the backend. I've got another Ubuntu VM running Elasticsearch (and an experimental Django API). That's it, and my shit is pretty fast, but I've only got like 4k blogs at the moment. It's hard to admit this, but I've been a Windows guy the majority of my career. I was a self-deprecating Windows user and I knew that Linux was a better platform for building most software (save C#), I'd just never taken the time to learn it. I'm a newish Mac user and now I've refreshed my memory on VI, I 'grep' shit, write bash scripts, and in general still have little idea what I'm doing. But the advantage to working in the command line is becoming clear when you get the shortcuts down, especially installing software. pip install is my friend, but don't get me started on how I don't have a virtual environment for my Python libraries. Did you know there are 2 locations for libraries when using Python? It sucks.

Right now I have a domain where you can go search narrative travel blogs in Africa. Eventually, I'll tell the world about it, but I have a few more big ticket items to address. That being said, I'm looking for early adopters! Free internet scouring if you are traveling to Africa!

Jun 22, 2015

Entrepreneurial Beginnings

It's been sometime since my last post on quitting. Since then I've been endeavoring on my entrepreneurial journey, while admittedly soaking up summertime in Boulder. I wanted to recap, if for no other reason than, to keep a running catalog of the course of events.

First week after I quit, I jumped right into Boulder Startup week. This event has come and gone with my attention for years, so my timing this year wasn't all that coincidental. I went to a number of activities and got to catch up and with those I've known professionally at some point in my 8 year tenure living in Boulder. In general, TechStars and it's attitude pervades the startup culture here. There are other players in town, but TechStars helped to pioneer the culture as it is. One thing I've felt personally and hear from others in the community is how Boulder is a 'give first' community. That's straight from Brad Feld and his notions on Startup Communities. The most helpful events were the TechStars mentor sessions and startup demo presentations. The Founder Stories were great too. These sessions reminded me of a book I read years back called Founders at Work. Super helpful for newbs like me. One thing I wish there were more of were sessions hosted by the bootstrapped. I've generally worked for bootstrapped software companies for the last 10 years, and there is something to be said for not taking money and making your own way. There was one excellent and informal session put on by SnapEngage. I learned some about Open Book Management and liked 3 things: they are super diligent about who they bring in, have a great operational model where workers can meaningfully contribute to company metrics, and everyone shares *equally* in the company's success. Seems like a great company to work for and something I'd hope to model my business operations after.

In the general search for more information on how to start my adventures, I'm keenly aware of the need for other cofounders. Not only does TechStars lean heavily on this, I've always enjoyed, as many do, the company you keep at work. Not to mention the numerous benefits of having someone else to do some lifting, execute where you are weak, and balancing each others life/mental roller coaster out. Finding a cofounder isn't easy, but nothing sends a message that you are serious like quitting a well paying job in the prime of your career. I'm lucky that I have a good background and narrative that seems to help when generating interest from others. I've been trying out CoFounder's Lab. Basically its a social network for those looking for startup opportunities and building a team. It's great to meet others who can help give you feedback on your idea, network, or form partnerships. The forum of networking needs a lot of love, and the random meets don't produce much return. You've got to talk to 20 people to find something of interest. I believe firmly that these things need to happen organically over time. It's obviously much more advantageous to have a cofounder that you've worked with in the past or even someone that's in your network, but my list has been tested already. Networking and planting seeds now is key, you just need to have a long term mindset.

I've had to put my ego and introversion aside. I'm cold contacting lots of folks in the area and in the industry. I'd guess I have an 80% response rate so far and persistence pays off for those high value players. Not knowing the industry I'm entering into, I put priority to those who know the industry or those who've started startups. It's important to know what the prevailing opinions are. You can't carve a niche out of an industry if you don't first know how it works. In the last 2 weeks a travel focused incubator, TravelPort Labs, has opened up in Denver, which seems to have alot of pieces that might help me. I met with them 2 weeks ago and I like that they are starting their own initiative for the first time like me, I'd get a decent amount of attention, they have experienced mentors, and helpful UX/UI resources. My reservations are whether we'd have conflicting operational approaches, the 1.25 hr drive each way, and if I'm limited to the travel industry in case the technology takes me elsewhere. In any case, I'll apply because I had a great conversation with 2 guys from the program.

Not knowing the industry and never having done market research, I generally was a bit disorganized at doing this on my own. But I found a few footholds with large market research firms, competitor analysis, and a gem of a site, Tnooz, which focuses on latest trends in travel, especially startups. I have a sense of the challenges in the travel planning space, the size, behaviors of the target demographic. The space is so crowded and generally everybody is faced with the challenge of 1 or 2 engagements per year and a high CAC. There's been a number of times I've gone to a travel planning site and been discouraged at the fact that a) it exists b) it seems well done from a UI standpoint. Then I test it and find the confidence to keep moving forward and carving out a niche. I'm the sort of entrepreneur that is a builder and have a personal need for the product, the question is whether others agree and are willing to pay for it. Nonetheless, it's at best a journey and a reasonable place to start.

So far I've taken a stab at the modern version of a business plan, the Lean Kanban board. I've identified my hypotheses and have strategies to get there. When you talk to the serial entrepreneurs around here it's "get out of the building", test, pivot, product/market fit. But in this "process", i think you need to have some core narrative or principles you are revolving around. For me it's technology, operational narratives, and an industry that you can give a shit about. I'd like to have all 3, but the buck will stop somewhere, that's what the hypothesis are for. From here on out when starting a business, you'll get tested on where to go and where to start. You need to really have your head on straight about who you are and why you are doing this.

That mostly sums up the activities of the first few weeks. Now I'm onto writing code and building a product, a subject for the next blog post.