In the end, I was never able to run the full Project Gutenberg corpus. With more time and more money, I could have done it, but I exceeded what I was willing to spend on the class. What stopped me?

I call them "Sentences of Doom".

Somewhere in the corpus of Project Gutenberg that I was running - and I had reduced it to around 6% of the text - there was a sentence or sentences that was causing my program to explode in a particularly nasty way without warning and without leaving anything in the log. This was absolutely reproducible, but even 6% of the text would have taken around 3 days on a single machine to track down and I didn't have the time to do it. I still want to know what that sentence is and defeat the bug, but that will have to wait for another day.

But even without all of the data parsed, I was still able to run this on an exceptionally large data set and make conclusions (and graphs) about the data.

And what I have now...