While I've been vague about my idea and project, things are coming into focus. In this time of entrepreneurial exploration I've had time to get reacquainted with developing software. Every technology I've employed so far is new to me: operating system, cloud hosting platform, programing language, text based search platform... But like most other software devs I enjoy learning new things. The pragmatist in me knows that many of these tools are the best for the job. Python is good at manipulating text, data analysis, and building things quick. Elasticsearch is great at encapsulating the complexity and learning curve of running Lucene or Solr. Linux is the best platform to run and operate Python and Elasticsearch. Azure is the best platform for giving me free hosting which is awesome at keeping my burn rate low.
I'd like focus on the technology aspects of what I'm working on from the perspective of a beginner on many fronts. But I also have the belief that I'll be able to do somethings with technologies that I barely understand as of yet. That's where the fun comes in. Without explaining the product and the opportunity/market...yada yada, I'll focus on the operational learnings and a bit about the motive for the focus. So first, a bit of a primer: I believe that many textual sources have a wealth of information that is mostly untapped. If you think Google News can aggregate well and Google Search can find things well, I'd hope to do something similar with travel. As a person that's traveled and explored in over a dozen countries, I hate the over-sponsored, highly-advertised world of online travel. There are people out there giving authentic information on travel blogs, forums... you just need to tease it out. Make the content easy to find and relatable to the end user, and be diligent about spotting and identifying that which is 'sponsored'. The easiest way I could think to start is with narrative travel blogs. Which brings me to my tech. Parse, clean, and analyze travel blogs. What I eventually want to have is a site that will serve up authentic travel information to the international planning traveler. As far as I can tell, there are 3 distinct categories.
- Build a 'basic' search engine based on a semi-currated content feed. By semi-currated I mean that I'll only be putting in content that I presume isn't sponsored. Some of that authentic content is apparent, some of it is tricky, so start with the obvious.
- Discover your data set. Yay data science. Understand key characteristics about it. Get a sense for what is 'good' and make and define those as hypotheses in your product journey. Clean your dataset. Don't even allow things in that are in a foreign language of a certain content length and aren't based in the region of focus. Not to mention that cleaning and data janitorial work that needs to be done before it can be searched. Luckily Elasticsearch does alot of that lifting for you: stemming, stopwords... Obviously I need to pick out the main content, key in on meta data while, and clean html while parsing (covered in step 1).
- Generally speaking, find more intelligent ways to serve content up to the user. Things like region discovery through topic analysis. Or categorization to filter content for a traveler archtype. Summarize long posts to make searching faster. Extract places to enhance your search and discover capability. Recommend blogs to read based on blogs of interest. I believe that information extraction, natural language processing, and machine learning have become much more accessible to the non-academic. That being said, this step isn't plug and play but there is wealth of books, platforms, and sample code to get this autodidact well on my way. If nothing else, this is waaaaay better than going back to school for a degree in what I can learn with the power of the internets.
A long time ago I worked with Lucene. Like 10 years ago. What Elasticsearch has done for text search since that time is awesome. Getting going out of the box is easy. The tutorials and the amount of StackOverflow help make the ease of adoption so much less intimidating than trying to learn what TF-IDF is when you just want to get shit working. Get it working first, dive deep later. Elasticsearch is awesome at that. I don't need to care that it runs on 5 shards or that I need to use a Snowball analyzer for proper stemming. I can get going and find lots of information on the internets when I run into issues. That being said the run-into-walls approach only works so long. Eventually you need to dive deep, like yesterday. I spent the at least half the day understanding how custom query scoring works, how to avoid using TF-IDF on certain fields that are short, and looking at very complex 'explain' query chains. I went deep into the rabbit hole that day my friends, but my search is so much better. Being able to find terms in a statistically relevant way just gets your foot in the door. Boosting blogs that are more recently, longer in length, and contain more pictures really makes a difference in the quality of your first few results.
Right now I have a domain where you can go search narrative travel blogs in Africa. Eventually, I'll tell the world about it, but I have a few more big ticket items to address. That being said, I'm looking for early adopters! Free internet scouring if you are traveling to Africa!