Nov 28, 2009

Adventures in Information Extraction (Part I)

I'm currently embarking on a personal project to parse and index content from certain specific blogs.  Before this adventure, my knowledge of the relevant body of information science was merely a term, Natural Language Processing (NLP), that I had no understanding of, and still don't.  Now that I've done some google-jerking, and with a bit of helpful clarification from a friend of mine enrolled grad school, I've hopefully started to define my initial effort into some more specific technologies and fields of study.  Nonetheless, alot of this will change as time goes on.

First let me describe, in general, my problem set and let me preface by saying, this is currently my 10,000 ft view of the problem approach.  I wish to target specific blogs and extract from them strictly blog content, minus the useless parts of the page (navigation, footer, header, ect).  Now, I hope to initially target two sites that contain lots of other blogs.  In doing this, I should be able to extract a general template (a common body for example) that contains the content of interest.  With content in hand, I would like to store this stuff in an index for quick retrieval using Lucene.Net.  Now what I am going to extract remains a bit muddy at the moment, as I try to explore the relationship between certain Natural Language Processing techniques and Lucene.  For instance, Lucene has the ability to tokenize (basically, find words), perform stemming (map variations such as 'traveled', 'traveling', and 'travel' to the root 'travel'), and filter stopwords ('the', 'a').  These things seem foundational to search in general, but what I really would like to index specifically are place names like 'The Taj Mahal', 'Pismo Beach', or 'Moe's Tavern'.  My research has pointed me to Named Entity Extraction, a subfield of Information Extraction.  Now the field of  IE starts to get a bit murky for me, and this theme will perpetuate as I go further down the rabbit hole.  I first started to try and find a good api and accompanying tutorial on these things.  Many roads pointed to NLTK and it's quite descriptive, and free, online book.  The library is built with Python, and I've traditionally been brewing C# and .NET for awhile.  But the idea of  learning a new language is welcoming to me.  My initial opinion of Python, is that its much easier to learn and use than a regimented procedural language like C# (though .NET has been making impressive headway into functional programming with F#).  Additionally, the syntax and libraries are more adept at parsing text.  The plan so far is to work my way through the online book to Chapter 7, which goes into exactly what I'm looking for.  Anyway, all this stuff is like drinking from a fire hose: new language, two new api's, and new subject matter, but I'm hoping I'll be able to put together a prototype in my spare time.  So far, I've been able to do some fun things with Python using Frequency Distribution (how many times a word appears in text), finding a word in context (collocation), and filter out some proper nouns using POS Tagging, a small step closer to getting my place names.  So for now, I'll keep inching towards Named Entity Extraction.  Once I have a working prototype for getting place names, I can circle back on some of the other things like feeding content into a simple Lucene index and figuring out the relationship between NLTK and Lucene.  Until I have some automatic extraction methods, I'm using Dapper to get me some data to play with.

It seems funny writing all this up, because tomorrow the direction will change.  But I'll stay flexible, and hopefully writing about my experience with help others avoid my stupid mistakes.

No comments:

Post a Comment