Mahout’s Naïve Bayes Classification algorithm executes in two phases:
- Train Phase: Trains a model using pre-processed train data
- Test Phase: Classify documents (pre-processed) with the help of the model
This blog post provides code-level understanding of the training process in the algorithm. And testing phase is covered in the next blog. The Mahout command "trainnb" is used to train a Naive Bayes model in Mahout.
"trainnb" command in Mahout
For reference, above is the structure of the train data directory specified as an input (similar to the one used in data pre-processing).
Command Line options for “trainnb”
-archives <paths>: comma separated archives to be unarchived on the compute machines.
-conf <configuration file>: specify an application configuration file
-D <property=value>: use value for given property
-files <paths>: comma separated files to be copied to the map reduce cluster
-fs<local|namenode:port>: specify a namenode
-jt<local|jobtracker:port>: specify a job tracker
-libjars<paths>: comma separated jar files to include inthe classpath.
-tokenCacheFile<tokensFile>: name of the file with the tokens
--input (-i) input: Path to job input directory.
--output (-o) output: The directory pathname for output.
--labels (-l) labels:comma-separated list of labels to include in training
--extractLabels (-el):Extract the labels from the input
--alphaI (-a) alphaI: smoothing parameter
--trainComplementary (-c):train complementary?
--labelIndex (-li) labelIndex: The path to store the label index in
--overwrite (-ow): If present, overwrite the output directory before running job
--help (-h):Print out help
--tempDirtempDir: Intermediate output directory
--startPhasestartPhase:First phase to run
--endPhaseendPhase: Last phase to run
Flow of execution of the "trainnb" command
Hope it helped!