README for Joe Pranevivich's CS264 (CSCI E-292) Final Project
Copyright 2010, Joe Pranevich
"Graphing the English Language (For Fun And Profit)"
Part 1: Describe Code and Application Files
-------------------------------------------
In this directory are the several files necessary to run the application.
* starcluster.sample.config
This sample config file should be used with your starcluster
installation to correctly configure nltk
* starcluster_plugins/nltk.py
Plugin for StarCluster (written by me - very basic) that will
install the relevant libraries and data files for NLTK
usage. Place in your "plugins" folder.
* sentences.py
The first stage sentence tokenizer. Written using the NLTK
libraries, it using the Punkt tokenizer to divide an arbitrary
text into sentences while doing intelligent things with
abbreviations, etc. This is required as the base unit of the
other NLTK tokenizers is a sentence.
./sentences.py <data file> > data.sentences
* word_tag_map.py
Second stage part-of-speech tagger. Takes in a general input
from the first stage and outputs it in a map/reduce-able format
with <key>\t<context>
This is the workhorse of the project and will take most of the
execution time.
* word_tag_map_profile.py
Same as the above, but uses the cProfiler to output why it takes so
long. It's "pos_tag", unfortunately.
* tab-sort.pl
Quick and dirty script to sort stdin based on the first tab-delimited
field. Replicates what Hadoop map/reduce would do in the sort phase.
* dictionary_reduce.py
Takes in the context andanalyzes it for the word associations.
Unlike the previous steps which exclusively used NLTK to determine
parts of speech, this script uses a different (my own, more simple)
method of knowing what words are associated with others. This will
make it not language-portable, but is necessary until NLTK can
become more robust.
* commands.sh
A wrapper script to do a full run of a data directory on Hadoop.
./commands.sh <project dir>
Output will appear in <project dir>-out
* analyze_output.py
Quick script to output statistics on words and a very very simple
text-based graph. (Unfortunately, the graph is based on the
full count of the word used and not the top 15, but it's good
enough.)
Part 2: How Your Program Should Be Run
--------------------------------------
Step 1: Configure StarCluster
If running on Amazon / StarCluster, move the nltk.py plugin into you
StarCluster plugins folder (.starcluster/plugins) and merge in the
starcluster.sample.config into your own StarCluster configuration.
It is assumed that you have available all of the AMIs and StarCluster
development version, as was used in Homework 4.
This is required as the nltk plugin will need to install nltk data files
and packages at EC2 startup.
If running locally, you will need to manually assure that nltk-python is
installed. From there, type "python -m nltk.downloader all" to preload
the corpus materials. (This is required for the use of the punkt tokenizer,
for example.) You may need to move /root/nltk_data to /usr/lib when this
command completes. (I needed to do it on the AMI, but not on the Cloudera
image.)
Step 2: Gather Source Data
I have included some example Project Gutenberg data, although you may
install other files from Gutenberg at http://gutenberg.org
If you want to go all out, my data was gathered from here:
http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/
However, I did significant pruning from here to remove bilingual texts,
dictionaries, and duplicated data files. There were also some bugs
discovered, so see my project documentation.
Step 3: Generate Sentences
Run "sentences.py" on each data file you want to process and place them
in a directory for EC2 processing. This script will write to standard out.
./sentences.py <input file>
You may want to use the "split" command to divide the data into units of
10-50K lines, for easier map/reducing. Using map/reduce's default data
size of 64MB will result in poor scaling across multiple nodes, so
cheat it by just using many files and take the FS blocksize hit.
Step 4: Run the batch
./commands.sh <directory name>
This will trigger all of the relevant commands to start the batch on EC2
and download the result files. You may want to tune the numReducers
parameter in this file before running.
If you want to do a test by hand:
cat <sentence data> | ./word_tag_map.py | ./tab-sort.pl | ./dictionary_reduce.pl
will effectively simulate an EC2 run.
Step 5: Analyze data
Modify the "analyze_output.py" script to use your resultant output. It's
not smart, so just cat all of your "part" files together.
After that, just run:
./analyze_output.py <pos> <word>
For example:
./analyze_output.py noun dog