Jul 24, 2015

Wanderight, a technical journey

While I've been vague about my idea and project, things are coming into focus. In this time of entrepreneurial exploration I've had time to get reacquainted with developing software. Every technology I've employed so far is new to me: operating system, cloud hosting platform, programing language, text based search platform... But like most other software devs I enjoy learning new things. The pragmatist in me knows that many of these tools are the best for the job. Python is good at manipulating text, data analysis, and building things quick. Elasticsearch is great at encapsulating the complexity and learning curve of running Lucene or Solr. Linux is the best platform to run and operate Python and Elasticsearch. Azure is the best platform for giving me free hosting which is awesome at keeping my burn rate low.

I'd like focus on the technology aspects of what I'm working on from the perspective of a beginner on many fronts. But I also have the belief that I'll be able to do somethings with technologies that I barely understand as of yet. That's where the fun comes in. Without explaining the product and the opportunity/market...yada yada, I'll focus on the operational learnings and a bit about the motive for the focus. So first, a bit of a primer: I believe that many textual sources have a wealth of information that is mostly untapped. If you think Google News can aggregate well and Google Search can find things well, I'd hope to do something similar with travel. As a person that's traveled and explored in over a dozen countries, I hate the over-sponsored, highly-advertised world of online travel. There are people out there giving authentic information on travel blogs, forums... you just need to tease it out. Make the content easy to find and relatable to the end user, and be diligent about spotting and identifying that which is 'sponsored'. The easiest way I could think to start is with narrative travel blogs. Which brings me to my tech. Parse, clean, and analyze travel blogs. What I eventually want to have is a site that will serve up authentic travel information to the international planning traveler. As far as I can tell, there are 3 distinct categories.

  1. Build a 'basic' search engine based on a semi-currated content feed. By semi-currated I mean that I'll only be putting in content that I presume isn't sponsored. Some of that authentic content is apparent, some of it is tricky, so start with the obvious.
  2. Discover your data set. Yay data science. Understand key characteristics about it. Get a sense for what is 'good' and make and define those as hypotheses in your product journey. Clean your dataset. Don't even allow things in that are in a foreign language of a certain content length and aren't based in the region of focus. Not to mention that cleaning and data janitorial work that needs to be done before it can be searched. Luckily Elasticsearch does alot of that lifting for you: stemming, stopwords... Obviously I need to pick out the main content, key in on meta data while, and clean html while parsing (covered in step 1).
  3. Generally speaking, find more intelligent ways to serve content up to the user. Things like region discovery through topic analysis. Or categorization to filter content for a traveler archtype. Summarize long posts to make searching faster. Extract places to enhance your search and discover capability. Recommend blogs to read based on blogs of interest. I believe that information extraction, natural language processing, and machine learning have become much more accessible to the non-academic. That being said, this step isn't plug and play but there is wealth of books, platforms, and sample code to get this autodidact well on my way. If nothing else, this is waaaaay better than going back to school for a degree in what I can learn with the power of the internets.

Now the details and my reflection 6 weeks in with so much new technology. To address step 1) I choose 2 key technologies:Python and Elasticsearch. Python for its text parsing prowess and Elasticsearch for the heavy lifting of Lucene. BeautifulSoup is an excellent library for finding information in blogs and scraping out what you want. Learning Python was fairly straight forward. The 2 languages I'm most familiar with are C# and Javascript. Python is alot more like Javascript. The key difference building software this time around is that I have a veteran (read cantankerous) view on technology. Everything in its place. I used to get wrapped around the axle with proper coding technique, architecture, scalability...but that doesn't have much place here. Those things matter most when you have different problems to solve than trying to see if other people give a shit about what you're building. This graphic is telling of the different development mindsets. My nature is to be a 'Settler', but I've longed to be a 'Pioneer'. Which basically means now I'm a giant fucking hack. You could drive a mack truck through my hacks. It's fun and awesome. But I think the key difference is (and this is a big *if*) I would know how to scale things when the time comes. Whether I'd need to burn it to the ground and rewrite or build up what's there, I can figure it out because I have in the past. That's a comfort that not many others have and I consider it a key personal asset.

A long time ago I worked with Lucene. Like 10 years ago. What Elasticsearch has done for text search since that time is awesome. Getting going out of the box is easy. The tutorials and the amount of StackOverflow help make the ease of adoption so much less intimidating than trying to learn what TF-IDF is when you just want to get shit working. Get it working first, dive deep later. Elasticsearch is awesome at that. I don't need to care that it runs on 5 shards or that I need to use a Snowball analyzer for proper stemming. I can get going and find lots of information on the internets when I run into issues. That being said the run-into-walls approach only works so long. Eventually you need to dive deep, like yesterday. I spent the at least half the day understanding how custom query scoring works, how to avoid using TF-IDF on certain fields that are short, and looking at very complex 'explain' query chains. I went deep into the rabbit hole that day my friends, but my search is so much better. Being able to find terms in a statistically relevant way just gets your foot in the door. Boosting blogs that are more recently, longer in length, and contain more pictures really makes a difference in the quality of your first few results.

For hosting I originally started with AWS. Every major cloud platform has some sort of offer for startups. AWS really raises the bar for getting anything free out of them. Like for $1k/year, I'd need to take an online MIT course on entrepreneurship. No thanks. I bought the professor's book on audio, time better spent. Azure has a BizSpark program that lowers the bar to entry and is awesome for people like me who are bootstrapping an MVP. 2 weeks after I applied, I got in. I've got a basic Web App hosting a static single page that uses Angular and other javascript goodies to to talk to the backend. I've got another Ubuntu VM running Elasticsearch (and an experimental Django API). That's it, and my shit is pretty fast, but I've only got like 4k blogs at the moment. It's hard to admit this, but I've been a Windows guy the majority of my career. I was a self-deprecating Windows user and I knew that Linux was a better platform for building most software (save C#), I'd just never taken the time to learn it. I'm a newish Mac user and now I've refreshed my memory on VI, I 'grep' shit, write bash scripts, and in general still have little idea what I'm doing. But the advantage to working in the command line is becoming clear when you get the shortcuts down, especially installing software. pip install is my friend, but don't get me started on how I don't have a virtual environment for my Python libraries. Did you know there are 2 locations for libraries when using Python? It sucks.

Right now I have a domain where you can go search narrative travel blogs in Africa. Eventually, I'll tell the world about it, but I have a few more big ticket items to address. That being said, I'm looking for early adopters! Free internet scouring if you are traveling to Africa!