Nov 28, 2009

Adventures in Information Extraction (Part I)

I'm currently embarking on a personal project to parse and index content from certain specific blogs.  Before this adventure, my knowledge of the relevant body of information science was merely a term, Natural Language Processing (NLP), that I had no understanding of, and still don't.  Now that I've done some google-jerking, and with a bit of helpful clarification from a friend of mine enrolled grad school, I've hopefully started to define my initial effort into some more specific technologies and fields of study.  Nonetheless, alot of this will change as time goes on.

First let me describe, in general, my problem set and let me preface by saying, this is currently my 10,000 ft view of the problem approach.  I wish to target specific blogs and extract from them strictly blog content, minus the useless parts of the page (navigation, footer, header, ect).  Now, I hope to initially target two sites that contain lots of other blogs.  In doing this, I should be able to extract a general template (a common body for example) that contains the content of interest.  With content in hand, I would like to store this stuff in an index for quick retrieval using Lucene.Net.  Now what I am going to extract remains a bit muddy at the moment, as I try to explore the relationship between certain Natural Language Processing techniques and Lucene.  For instance, Lucene has the ability to tokenize (basically, find words), perform stemming (map variations such as 'traveled', 'traveling', and 'travel' to the root 'travel'), and filter stopwords ('the', 'a').  These things seem foundational to search in general, but what I really would like to index specifically are place names like 'The Taj Mahal', 'Pismo Beach', or 'Moe's Tavern'.  My research has pointed me to Named Entity Extraction, a subfield of Information Extraction.  Now the field of  IE starts to get a bit murky for me, and this theme will perpetuate as I go further down the rabbit hole.  I first started to try and find a good api and accompanying tutorial on these things.  Many roads pointed to NLTK and it's quite descriptive, and free, online book.  The library is built with Python, and I've traditionally been brewing C# and .NET for awhile.  But the idea of  learning a new language is welcoming to me.  My initial opinion of Python, is that its much easier to learn and use than a regimented procedural language like C# (though .NET has been making impressive headway into functional programming with F#).  Additionally, the syntax and libraries are more adept at parsing text.  The plan so far is to work my way through the online book to Chapter 7, which goes into exactly what I'm looking for.  Anyway, all this stuff is like drinking from a fire hose: new language, two new api's, and new subject matter, but I'm hoping I'll be able to put together a prototype in my spare time.  So far, I've been able to do some fun things with Python using Frequency Distribution (how many times a word appears in text), finding a word in context (collocation), and filter out some proper nouns using POS Tagging, a small step closer to getting my place names.  So for now, I'll keep inching towards Named Entity Extraction.  Once I have a working prototype for getting place names, I can circle back on some of the other things like feeding content into a simple Lucene index and figuring out the relationship between NLTK and Lucene.  Until I have some automatic extraction methods, I'm using Dapper to get me some data to play with.

It seems funny writing all this up, because tomorrow the direction will change.  But I'll stay flexible, and hopefully writing about my experience with help others avoid my stupid mistakes.

Nov 16, 2009

NHibernate 2.1 Performance Issues with Flush and LinFu

We recently discovered our nightly processes were taking much longer (50 - 80%) and wondered why. My focus was on the upgrade from straight to 2.1.1. First thing I tried was to change the LinFu Proxy Factory to Castle. This had a surprising boost. For this particular operation some object properties were being accessed 100,000's of times. Redgate Profiler showed that, in aggregate, proxied object access can add up. 3300 objects loaded, with 3 many-to-one objects eagerly loaded, had some some properties accessed 350,000 times. One particular property being accessed took roughly 2 seconds over the life of the process, and some of the others took roughly a second. All adding up to roughly 12 seconds on a 100 second process. Whoa. Keep in mind that's with Redgate's deepest profiling on. Switching to Castle seemed to cut that to 5 seconds. Anyone know what's going on here?

But the big surprise of the day was what happened with flushing from 2.0.1 to 2.1+. A basic flush on this process on 2.0.1 took .42 seconds, but on 2.1.1 it took .79 seconds. Really? What the hell happened in between those versions? Just to be sure I checked 2.1 and 2.0.1, both were the same. I wonder if abstracting dynamic proxying from NH had some unintended consequences.

Needless to say we are back on 2.0.1 for the time being. This process that was taking 1 min and 11 seconds now takes 47 seconds. Why? A couple of reasons: first the process calls embedded business logic many times that ends up doing "safety" flushing when it's a mostly readonly session.  The next refactor is to fix that problem, but in all the flush increase difference accounted for 10 - 15 seconds.  Second, was the proxies which accounted for the remainder of the difference. I didn't do anymore benchmarks on 2.0.1 vs Castle on 2.1.1, but I'm sure there are differences there from all the testing that was done. In either case there's not anything in 2.1+ that's worth these performance tradeoffs we experienced. NH.Linq seemed like it would have been a good move, but it's not yet ready for production use. So for now I'll take the perf benefits of 2.0.1 for the minor upgrades of 2.1+.