I’m currently implementing the Divergence Minimization Model and the Relevance Model in MeTA, however, I noticed some weird behavior. To construct the new (expanded) query, I’m using the document’s “increment” function to assign probabilities to words. This function accepts values of type double, thus allowing the assignment of probabilities (and not just integer counts). The problem happens when I search using the expanded query; in ranker.cpp, each word’s weight (which can be a double) is saved into the query_term_count field of an instance of score_data, however, query_term_count is an unsigned integer. Since the weights of the words are probabilities in the range [0,1], after casting to integers all the words will get a zero weight and consequently all documents will have a score of 0. One simple fix that worked for me is to change the type of query_term_count in score_data.h from uint64_t to double. I think it would be a good idea to even change the name of the variable to something like query_term_weight to indicate that it is not restricted to counts (i.e. it can be a probability or some other kind of weight).
Ah, yeah, I think this is an oddity that’s a result of how the code developed over time…
I think it’s fine to do the refactoring you suggest for the
score_data objects. We ought to be able to support term weights on queries rather than just raw term counts. The index itself, however, will always be count-based since we use integer-based compression techniques.
@hussein: Do you want to do this refactoring since you found the issue? You can target a pull request at either the
master branch, depending on what one you’re using as the base to develop your models (which I’d love to have in the codebase when you’re done, btw!).
Going forward (in a separate refactoring that we might want to do later), I wonder whether it’s a good idea to actually make a separate
query class instead of re-using
document as a query? We could then have
document use integer counts only and
query be the class that supports general term weights. Right now, the only place where
document is used outside of query time (I think, right @smassung?) is as the return value out of the
corpus classes as input to the tokenization process in inverted index creation. Since the inverted index itself must be count-based for compression reasons (and no analyzer being run on raw text will output non-integer counts at the moment), there’s really no reason for
document to have doubles in it. It’d also clear up this confusion a bit, since at the moment document sometimes has floating point values, but most of the time doesn’t.
Thanks Chase! I opened a pull request.
I think that it makes sense to create a separate class for queries as they require decimal term weights. However, changing the type of term weights in the document class to integer might create problems with other modules; for instance, this will not allow us to use arbitrary features when doing classification (one solution is to create a special document class for classification).
BTW, I will add the Divergence Minimization Model and the Relevance Model when I finish testing, probably next week.