Step 1: Figuring Out The Lingo

For a computer program, figuring out the English language is tricky. This starts with even simple things like the question of "How do I break this paragraph into sentences?" A naive implementation might say that a sentence is anything that ends in a period, an exclamation point, or a question mark . But English isn't that simple: we use periods for abbreviations, and question marks and exclamation points can be used in some type of technical texts to mean other things.

But that's only the beginning of the challenge. The real fun comes when you want to figure out what the structure of a sentence is. For example, in the sentence "It was a dark and stormy night." we know that "it" is a third-person neutral pronoun, "was" is the past-tense form of the verb "to be", "a" is an article, "dark" and "stormy" are adjectives, "and" is a conjunction, and finally, "night" is a noun. We know this (I hope!) because we can read and parse English without even thinking about it. But for a computer algorithm to do the same is very complex.

This is where NLTK, the Natural Language Toolkit comes in. This is a library (available for Python and a number of other languages) that has many features. Two of the features that I needed for this project were the ability to split texts into sentences, while doing intelligent things with punctuation, and guessing the parts of speech of words in a sentence. I'm saying "guess" because NLTK doesn't speak English either. It uses complex probability systems and pre-tagged texts to "learn" what a sentence is and what different types of words are in roughly the same way that a spam filter "learns" what email you want and what email you don't want. And since you know how well your spam filter works, you should be able to guess how well NLTK works. But with these tools, and a little of my own engineering, I was able to build a system that took a novel or other work, divide it into sentences, and then tag those sentences with parts of speech. 

From there, I went off on my own and added heuristics to figure out even more information about a sentence. It wasn't enough to know that "dark" was an adjective and that "night" was a noun, I wanted to know that "dark" referred to "night". I did this by adding many of my own rules, for example by saying that adjectives and adverbs can be chained (separated by commas or conjunctions), but that while adverbs can be either before or after a verb ("The boy ran quickly." is roughly the same as "The boy quickly ran."), adjectives are always before their nouns.

From there, a quick tag job of "Alice in Wonderland" was easy! But a small book doesn't represent the whole of the English language and in order to come to real conclusions I needed to think bigger.

Continue on! How we made the data set large...