Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.
Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.
Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.
Overview of Lucene Document Classification
Lucene’s classifier for document classification is defined as the Classifier interface.
public interface Classifier<T> {
/**
* Assign a class (with score) to the given text String
* @param text a String containing text to be classified
* @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score
* @throws IOException If there is a low-level I/O error.
*/
public ClassificationResult<T> assignClass(String text) throws IOException;
/**
* Train the classifier using the underlying Lucene index
* @param atomicReader the reader to use to access the Lucene index
* @param textFieldName the name of the field used to compare documents
* @param classFieldName the name of the field containing the class assigned to documents
* @param analyzer the analyzer used to tokenize / filter the unseen text
* @param query the query to filter which documents use for training
* @throws IOException If there is a low-level I/O error.
*/
public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query)
throws IOException;
}
You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.
Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().
public class ClassificationResult<T> {
private final T assignedClass;
private final double score;
/**
* Constructor
* @param assignedClass the class <code>T</code> assigned by a {@link Classifier}
* @param score the score for the assignedClass as a <code>double</code>
*/
public ClassificationResult(T assignedClass, double score) {
this .assignedClass = assignedClass;
this .score = score;
}
/**
* retrieve the result class
* @return a <code>T</code> representing an assigned class
*/
public T getAssignedClass() {
return assignedClass;
}
/**
* retrieve the result score
* @return a <code>double</code> representing a result score
*/
public double getScore() {
return score;
}
}
Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.
Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.
As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.
Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.
Using Lucene SimpleNaiveBayesClassifier
SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).
We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.
<? xml version = "1.0" encoding = "UTF-8" ?>
< schema name = "example" version = "1.5" >
< fields >
< field name = "url" type = "string" indexed = "true" stored = "true" required = "true" multiValued = "false" />
< field name = "cat" type = "string" indexed = "true" stored = "true" required = "true" multiValued = "false" />
< field name = "title" type = "text_ja" indexed = "true" stored = "true" multiValued = "false" />
< field name = "body" type = "text_ja" indexed = "true" stored = "true" multiValued = "true" />
< field name = "date" type = "date" indexed = "true" stored = "true" />
</ fields >
< uniqueKey >url</ uniqueKey >
< types >
< fieldType name = "string" class = "solr.StrField" sortMissingLast = "true" />
< fieldType name = "boolean" class = "solr.BoolField" sortMissingLast = "true" />
< fieldType name = "int" class = "solr.TrieIntField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "float" class = "solr.TrieFloatField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "long" class = "solr.TrieLongField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "double" class = "solr.TrieDoubleField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "date" class = "solr.TrieDateField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "text_ja" class = "solr.TextField" positionIncrementGap = "100" autoGeneratePhraseQueries = "false" >
< analyzer >
< tokenizer class = "solr.JapaneseTokenizerFactory" mode = "search" />
< filter class = "solr.JapaneseBaseFormFilterFactory" />
< filter class = "solr.JapanesePartOfSpeechStopFilterFactory" tags = "lang/stoptags_ja.txt" />
< filter class = "solr.CJKWidthFilterFactory" />
< filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "lang/stopwords_ja.txt" />
< filter class = "solr.JapaneseKatakanaStemFilterFactory" minimumLength = "4" />
< filter class = "solr.LowerCaseFilterFactory" />
</ analyzer >
</ fieldType >
</ types >
</ schema >
Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.
Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.
public final class TestLuceneIndexClassifier {
public static final String INDEX = "solr2/collection1/data/index" ;
public static final String[] CATEGORIES = {
"dokujo-tsushin" ,
"it-life-hack" ,
"kaden-channel" ,
"livedoor-homme" ,
"movie-enter" ,
"peachy" ,
"smax" ,
"sports-watch" ,
"topic-news"
};
private static int [][] counts;
private static Map<String, Integer> catindex;
public static void main(String[] args) throws Exception {
init();
final long startTime = System.currentTimeMillis();
SimpleNaiveBayesClassifier classifier = new SimpleNaiveBayesClassifier();
IndexReader reader = DirectoryReader.open(dir());
AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader);
classifier.train(ar, "body" , "cat" , new JapaneseAnalyzer(Version.LUCENE_46));
final int maxdoc = reader.maxDoc();
for ( int i = 0 ; i < maxdoc; i++){
Document doc = ar.document(i);
String correctAnswer = doc.get( "cat" );
final int cai = idx(correctAnswer);
ClassificationResult<BytesRef> result = classifier.assignClass(doc.get( "body" ));
String classified = result.getAssignedClass().utf8ToString();
final int cli = idx(classified);
counts[cai][cli]++;
}
final long endTime = System.currentTimeMillis();
final int elapse = ( int )(endTime - startTime) / 1000 ;
// print results
int fc = 0 , tc = 0 ;
for ( int i = 0 ; i < CATEGORIES.length; i++){
for ( int j = 0 ; j < CATEGORIES.length; j++){
System.out.printf( " %3d " , counts[i][j]);
if (i == j){
tc += counts[i][j];
}
else {
fc += counts[i][j];
}
}
System.out.println();
}
float accrate = ( float )tc / ( float )(tc + fc);
float errrate = ( float )fc / ( float )(tc + fc);
System.out.printf( "\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n" , accrate, errrate, elapse, maxdoc);
reader.close();
}
static Directory dir() throws IOException {
return FSDirectory.open( new File(INDEX));
}
static void init(){
counts = new int [CATEGORIES.length][CATEGORIES.length];
catindex = new HashMap<String, Integer>();
for ( int i = 0 ; i < CATEGORIES.length; i++){
catindex.put(CATEGORIES[i], i);
}
}
static int idx(String cat){
return catindex.get(cat);
}
}
Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.
Executing this program displays the followings.
760 0 4 23 37 37 2 2 5
40 656 7 44 25 4 90 1 3
87 57 392 102 68 24 113 5 16
40 15 6 391 33 8 16 2 0
14 2 0 5 845 2 0 1 1
134 2 2 26 107 549 19 3 0
43 36 13 17 26 36 693 5 1
6 0 0 23 35 0 1 829 6
10 9 9 25 66 6 5 45 595
*** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs
The classification accuracy rate went up to 77%.
Using Lucene KNearestNeighborClassifier
Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.
The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.
Executing the same program as KNearestNeighborClassifier will display the following when k=1.
724 14 28 22 6 30 8 18 20
121 630 41 13 2 9 35 6 13
165 28 582 10 5 16 26 7 25
229 15 15 213 6 14 6 2 11
134 37 15 8 603 12 19 7 35
266 38 39 24 14 412 22 9 18
810 16 1 3 2 3 32 1 2
316 18 14 12 5 7 8 439 81
362 17 29 10 1 7 7 16 321
*** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs
Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.
652 5 78 3 7 40 13 38 34
127 540 82 15 1 10 58 23 14
169 34 553 3 7 16 38 15 29
242 10 32 156 12 13 15 10 21
136 30 21 9 592 11 19 15 37
309 34 58 5 23 318 40 28 27
810 8 3 1 0 10 37 1 0
312 8 44 7 5 2 13 442 67
362 11 45 5 6 10 16 34 281
*** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs
Document Classification by NLP4L and Mahout
If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.
The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.
Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.
Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 7128 96.7689%
Incorrectly Classified Instances : 238 3.2311%
Total Classified Instances : 7366
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i <--Classified as
823 1 1 6 12 19 2 4 2 | 870 a = dokujo-tsushin
1 848 2 1 0 1 11 4 2 | 870 b = it-life-hack
5 6 830 1 1 0 3 1 17 | 864 c = kaden-channel
2 6 6 486 3 1 6 0 0 | 510 d = livedoor-homme
0 0 1 1 865 1 0 1 1 | 870 e = movie-enter
31 3 6 12 14 762 6 4 4 | 842 f = peachy
0 0 2 0 0 1 867 0 0 | 870 g = smax
0 0 0 1 0 0 0 897 2 | 900 h = sports-watch
2 4 1 1 0 0 0 12 750 | 770 i = topic-news
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.955
Accuracy 96.7689%
Reliability 87.0076%
Reliability (standard deviation) 0.307
Also, Mahout Random Forest shows accuracy rate of 97%.
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 7156 97.1359%
Incorrectly Classified Instances : 211 2.8641%
Total Classified Instances : 7367
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i <--Classified as
838 5 2 6 3 7 2 0 1 | 864 a = kaden-channel
0 895 0 1 4 0 0 0 0 | 900 b = sports-watch
0 0 869 0 0 1 0 0 0 | 870 c = smax
0 2 0 839 1 0 14 2 12 | 870 d = dokujo-tsushin
1 17 0 0 748 0 2 0 2 | 770 e = topic-news
1 5 0 1 5 855 2 0 1 | 870 f = it-life-hack
0 1 0 23 0 0 793 1 24 | 842 g = peachy
0 11 0 14 1 2 18 454 11 | 511 h = livedoor-homme
0 1 0 2 0 0 2 0 865 | 870 i = movie-enter
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9608
Accuracy 97.1359%
Reliability 87.0627%
Reliability (standard deviation) 0.3076
Summary
In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.
I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.
I hope this article will help you all in some way.
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html