Metapy: How to get Term Frequency vectors for each document?


#1

I am wondering how to get the term frequency vector for each document in the form of a list of (term_id, count) tuples, or something similar. I imagine that I would have to extend the metapy.index.RankingFunction class to override the score_one method to return that tuple, and “score” all of the documents with a dummy query somehow. Does anyone have any ideas?

And another concern is storing the vectors and using them in computations? I have 100000 documents with a total vocab size of a few million terms. So using a dense matrix representation for the TF vectors would be costly. Would scipy’s sparse matrix work well? Or is there another sparse matrix representation that works well with numpy and scipy functions?


#2

This is what ForwardIndex does. The simplest method, currently, is probably to load up the documents into a Dataset like so:

import metapy
fidx = metapy.index.make_forward_index('config.toml')
dset = metapy.learn.Dataset(fidx)

(Note that this will load all of the documents into memory. If that’s a problem, load things in batches by providing a list of doc_ids as the second argument to the Dataset constructor.)

The things in dset are all Instance objects, which have property weights, which is a sparse vector (FeatureVector) that contains all of the non-zero feature counts (which will be term counts if you’re just using unigram words).

You can then modify these vectors to transform the raw term frequency information into tf-idf weights. I’ve also just pushed a new version of metapy that includes bindings for a helper funciton present in the C++ code, which you can probably use instead of doing this yourself:

iidx = metapy.index.make_inverted_index('config.toml')
metapy.learn.tfidf_transform(dset, iidx, metapy.index.OkapiBM25()) # or any other ranker

That should replace the weights in the weight vectors for all of the Instances in dset with the tf-idf weights (according to OkapiBM25’s TF and IDF formulas).

The vectors in any Dataset will already be represented in a sparse way (internally, as vectors of pairs for non-zero weights). FeatureVector should support dot() and cosine().

If you need to use scipy’s stuff for other reasons, though, it would definitely be wise to store things using some similarly sparse format. You’ll need to do the conversion from FeatureVector to that sparse representation yourself, though, as there isn’t a tight integration yet. Shouldn’t be too hard though.