Generate tf-idf based feature vector from a document.
Generate tf-idf based feature vector from a document.
tf and idf calculations can be varied according to "tfMode" and "idfMode" parameters. See http://nlp.stanford.edu/IR-book/html/htmledition/variant-tf-idf-functions-1.html for theoretical backgrounds.
In default, when tfMode and idfMode are not given, weight for each term is given by this basic tf-idf formula
(tf) * log(N / df)
where tf is term frequency of the term in given document, N is the total number of documents and df is document frequency for the term.
the IReader instance
the field name for counting words
the Lucene document id
the set of words(terms) considered as feature. All words(terms) will be taken as features if empty set is given.
tf calculation mode. Expected values are "n" (normal), "l" (logarithm), "m" (maximum normalization), "b" (boolean), "L" (Log ave), "w" (sublinear weighted). The default value is "n"
the smoothing term for tfMode "m". The default value is 0.4.
idf calculation mode. Expected values are "n" (no), "t" (idf), "p" (prob idf). The default value is "t"
the Vector of words and the feature vector
Generate tf-idf based feature vector from a document.
Generate tf-idf based feature vector from a document.
See also documentation for tfIdfVector().
the IReader instance
the field name for counting words
the list of Lucene document id
the set of words(terms) considered as feature. All words(terms) will be taken as features if empty set is given.
tf calculation mode. The default value is "n"
the smoothing term for tfMode "m". The default value is 0.4.
idf calculation mode. The default value is "t"
the pair of words and the feature vectors
Generate simple tf based feature vector from a document.
Generate simple tf based feature vector from a document.
the IReader instance
the field name for counting words
the Lucene document id
the set of words(terms) considered as feature. All words(terms) will be taken as features if empty set is given.
the Vector of words and the feature vector
Generate simple tf based feature vector from specified documents.
Generate simple tf based feature vector from specified documents.
the IReader instance
the field name for counting words
the list of Lucene document id
the set of words(terms) considered as feature. All words(terms) will be taken as features if empty set is given.
the pair of words and the feature vectors
Utility object to generate feature vectors representing documents/corpus weighted by tf-idf.