freediscovery.feature_weighting.SmartTfidfTransformer¶

class
freediscovery.feature_weighting.
SmartTfidfTransformer
(weighting='nsc', norm_alpha=0.75, norm_pivot=None, compute_df=False, copy=True)[source]¶ TFIDF weighting and normalization with the SMART IR notation
This class is similar to
sklearn.feature_extraction.text.TfidfTransformer
but supports a larger number of TFIDF weighting and normalization schemes. It should be fitted on the documentterm matrix computed bysklearn.feature_extraction.text.CountVectorizer
.The TFIDF transform consists of three subsequent operations, determined by the
weighting
parameter,Term frequency weighing:
natural (
n
), log (l
), augmented (a
), boolean (b
), log average (L
)Document frequency weighting:
none (
n
), idf (t
), smoothed idf (s
), probabilistic (p
), smoothed probabilistic (d
)Document normalization:
none (
n
), cosine (c
), length (l
), unique (u
).
Following the SMART IR notation, the
weighting
parameter is written as the concatenation of thee characters describing each processing step. In addition the pivoted normalization can be enabled with a fourth characterp
.See the TFIDF schemes documentation section for more details.
Parameters:  weighting (str, default='nsc') – the SMART notation for document, term weighting and normalization.
In the form
[nlabL][ntspd][ncb][p]
.  norm_alpha (float, default=0.75) – the α parameter in the pivoted normalization. This parameter is only
used when
weighting='???p'
.  norm_pivot (float, default=None) – the pivot value used for the normalization. If not provided
it is computed as the mean of the
norm(tf*idf)
. This parameter is only used whenweighting='???p'
.  compute_df (bool, default=False) – compute the document frequency (
df_
attribute) even when it’s not explicitly required by the weighting scheme.  copy (boolean, default=True) – Whether to copy the input array and operate on the copy or perform inplace operations in fit and transform.
References
[Manning2008] C.D. Manning, P. Raghavan, H. Schütze, “Document and query weighting schemes” , 2008 [Singhal1996] A. Singhal, C. Buckley, and M. Mitra. “Pivoted document length normalization.” , 1996 
fit
(X, y=None)[source]¶ Learn the document lenght and document frequency vector (if necessary).
Parameters: X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts

fit_transform
(X, y=None)[source]¶ Apply document term weighting and normalization on text features
Parameters: X (sparse matrix, [n_samples, n_features]) – a matrix of term/token counts

get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any

set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: Return type: self