Step 2: Growing the Data

Once I had a program that could generate a map of relationships between words in "Alice in Wonderland", I needed to run that on other books. And, to make the data as accurate as possible, I needed to run it on a LOT of other books. And where can you get the most free books online? (All out of copyright, of course!) Project Gutenberg!

Project Gutenberg is an online repository of thousands of out-of-copyright books (and other things) which anyone can access online. Why pay for a copy of Sherlock Holmes when you can just download one? For my purposes, the fact that most of the books are 50+ years old was just fine. 

But as I was running through this block of text, I came across a terrible realization: NLTK was slow and I wasn't smart enough to optimize it. It was taking around 8 minutes for every megabyte of text I could parse. To get even a portion of the way through Project Gutenberg (let alone other types of more recent texts such as wikipedia), it would take more than 40 days. If I could optimize the NLTK tagger, it might go faster, but since I could not what was I left to do?

Go massively parallel.

This problem is what you might call "embarrassingly parallel." The parsing of one sentence doesn't really affect any other sentence and you can do them in any order. But what you DO need to do is, at the end, consolidate all of these associations into a weighted graph or tree. Fortunately, that's mostly an arithmetic problem. This type of problem maps very well into cluster programming: you can split up all of the text processing into one phase, generating a list of words and their contexts, and then have a second phase where all of the usages of every word can be tabulated together. (You also need a zeroth step where you split the data into chunkable sentences.)

Through the use of Hadoop, an open source clustering system, plus StarCluster, an open source cluster management solution, I was able to launch (at one point) 85 servers to simultaneously cycle through this repository of knowledge. By making this process parallel, it could reduce the time it takes to do the work by almost 1/85th. (It scales almost linearly.) 

Of course, Amazon resources cost money and this project is still not cost-effective. The normal cost of an 85 nodes of "medium" strength on Amazon is around $30 per hour. Because I was using spot instances, I could get the cost down quite a bit lower -- but this type and volume of data just takes a while. Because the time scales linearly, it would be just as expensive to run it on half as many nodes for twice the amount of time, so I felt good with launching this at scale.

But, there were challenges...