This is what
ForwardIndex does. The simplest method, currently, is probably to load up the documents into a
Dataset like so:
fidx = metapy.index.make_forward_index('config.toml')
dset = metapy.learn.Dataset(fidx)
(Note that this will load all of the documents into memory. If that's a problem, load things in batches by providing a list of
doc_ids as the second argument to the
The things in
dset are all
Instance objects, which have property
weights, which is a sparse vector (
FeatureVector) that contains all of the non-zero feature counts (which will be term counts if you're just using unigram words).
You can then modify these vectors to transform the raw term frequency information into tf-idf weights. I've also just pushed a new version of metapy that includes bindings for a helper funciton present in the C++ code, which you can probably use instead of doing this yourself:
iidx = metapy.index.make_inverted_index('config.toml')
metapy.learn.tfidf_transform(dset, iidx, metapy.index.OkapiBM25()) # or any other ranker
That should replace the weights in the weight vectors for all of the
dset with the tf-idf weights (according to OkapiBM25's TF and IDF formulas).
The vectors in any
Dataset will already be represented in a sparse way (internally, as vectors of pairs for non-zero weights).
FeatureVector should support
If you need to use scipy's stuff for other reasons, though, it would definitely be wise to store things using some similarly sparse format. You'll need to do the conversion from
FeatureVector to that sparse representation yourself, though, as there isn't a tight integration yet. Shouldn't be too hard though.