def get_features_by_wordbag():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1 )
print vectorizer
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,
min_df=1 )
print vectorizer
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
词袋模型示例:
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
>>>
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
... ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
>>>
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)