Document Length Type


#1

I noticed that document::length is being stored as size_t and that document::length() returns uint64_t. Although this might be expected for a document, it is creating problems for updated queries in feedback. Also, this creates a minor inconsistency; the increment function accepts a double, but the length, which is an integer, will get an incorrect value due to casting:

void document::increment(const std::string& term, double amount)
{
    counts_[term] += amount;
    length_ += amount;
}

For feedback, the problem happens when using the initial_score function in lm_ranker:

double language_model_ranker::initial_score(const score_data& sd) const
{
    return sd.query.length() * std::log(doc_constant(sd));
}

sd.query.length() will return a value of zero for feedback queries, which cancels out documentation length normalization thus leading to poor performance. A simple fix that worked for me is to change document::length and document::length() to double (this significantly increased the performance of DMM and RM). Should I go ahead and do the changes?

BTW, my implementation of the feedback module is almost complete. I will add the code very soon!


#2

Yeah, this is another instance where there really ought to be a difference between a document and a query. In terms of indexing, document can’t have a non-integer length, but in terms of querying, it’s reasonable to want to have a non-integer length.

I think instead of changing document::length_ and document::length() to be double, we should instead actually break out a query class from document and put it in the index namespace, making it clear that corpus::document only supports integer counts and is used during indexing only, and index::query supports floating point values for term weights as well as “length” (which is maybe better termed total_weight).

This would mean we’d need to change the references to corpus::document to index::query in e.g. the rankers and other related code, but I think this change is important to make at this point since it’s getting a little muddy about where doubles are actually supported and where they are not. We should use a separate class to make that distinction clear (and avoid other similar errors like this).

Awesome to hear about the feedback module! Are you targeting master or develop? We’re gearing up for a 2.0.0 release (significant breaking changes have been made in develop) and have diverged a bit from master… We typically try to only merge things into master when we’re ready to make a release (so that master is always the “stable” branch. We could do a better job of making that clear in e.g. the README.md in the repo…

If you need help rebasing the changes against the develop branch, I’d be willing to help out. I’m a bit busy with running my two classes at the moment, but I should be able to carve out some time to make the changes apply nicely. The biggest thing I could see causing issues is that the default ranker base class has switched from using a term-at-a-time approach to a document-at-a-time approach, so ranker::score() has changed quite a bit.


#3

Yes I think creating a separate class for queries is necessary. In addition to preventing ambiguities, it will allow customizing queries and documents to support different functionalities. For the time being I will continue developing based on the temporary workaround of changing the length to double.

I was actually targeting the master branch. I didn’t have a look at the develop yet, but I’m relying on ranker::score() as a black-box, i.e. the feedback method will call one of the built-in rankers, extract the updated query from the top documents, and then call the ranker again. As long as the output of ranker::score() is (docID,score) pairs, I think the code should work. I will have a look at the develop branch and let you know if I need help in the conversion. Thanks! :smiley:


#4

Oh, awesome. You shouldn’t have any issue then: the public part of the API is still the same, just the implementation is different.


#5

@skystrife and @smassung, the code I’m currently working on defines a base class called lm_feedback which can be used to create derived classes of feedback methods like dmm and rm. In lm_feedback, I’m trying to use the same API as the ranker class, i.e. it has a score() function that takes the same arguments as ranker::score() and returns the same output (<docID,score> pairs), however, the implementation of lm_feedback::score() is different from ranker::score(). On a high level, lm_feedback::score() will first call one of the LM-based rankers (like dirichlet-prior), then run the feedback method on the top documents to get the updated query, and finally call the LM-based ranker again to produce the final list of ranked documents.

Conceptually, lm_feedback can be thought of as a special type of rankers since it takes the same input (a query) and produces the same output (ranked documents). So it would be nice to allow the derived classes of lm_feedback to be created just like the other conventional rankers (such bm25, dirichlet-prior,…) using the ranker_factory. However, one thing that prevents integrating lm_feedback into ranker_factory is that the former cannot be considered a derived class of ranker due to the different implementations of score(). I was thinking about creating a more general class for rankers, for example called base_ranker, that keeps the function score virtual. The current ranker and lm_ranker classes can be made derived classes of base_ranker which can be used in ranker_factory as the base class of all types of rankers (whether those based on feedback or not). This will allow any ranking function to be declared in ranker_factory without having to be limited by the implementation of ranker::score(). I think that this will not only allow better support for feedback but also enable non-conventional types of rankers to be used. I’m not sure if this is the best way to do it, so let me know if this makes sense to you or if you have other suggestions.


#6

That strategy seems fine to me. Probably the easiest way to do this is to rename ranker to standard_ranker (or something—I can’t think of a better name at the moment, but ideally it would be something that conveys that this is what you’d subclass for a standard retrieval function) and then make ranker become the base class you mention with a pure-virtual score(). Feedback methods would then directly subclasss ranker, while the standard retrieval methods would subclass standard_ranker.