Currently, our vocabulary mapping strategy is sort of “dumb”: we have a pseudo-B±tree dictionary format that looks up the vocab ids given strings and a disk vector mapping ids to seek positions within the leaf nodes of that B±tree-like structure.
This works fine, but it’s not exactly efficient. We store the keys multiple times instead of just once, it’s not compressed at all, and requires a logarithmic number of disk seeks for looking up any word.
Instead, I want to replace this with a deterministic minimal perfect hashing scheme using something a bit more intelligent. The most promising direction appears to be using finite state automata/transducers, but there could be others. I plan on using this topic to keep track of my findings and implementation status. It might even make sense to implement a few of these and allow them to be pluggable.
The idea here is to create a finite state automata which accepts the words of the vocabulary and, in doing so, maps each one uniquely to a value $v \in [0, |V|)$. There are algorithms that can (relatively) efficiently build a minimal FSA from pre-sorted input, which is what we’ll have during the final vocab-building step of the indexer anyway.
Lookup for the vocab id for a word is just a walk through the automaton and should be linear. It also seems that, in practice, the vocab for even large indexes can fit in memory, so this should be a really fast lookup solution. We can
mmap the file in the case where we don’t want to just load it all into memory (or even just ignore this case for now, since it seems to compress very well).
- http://stevehanov.ca/blog/index.php?id=115 (blog post about the FSA building algorithm)
- http://www.aclweb.org/anthology/J00-1002.pdf (COLING paper containing the algorithm)
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.5272&rep=rep1&type=pdf (discusses how to map to vocab ids in addition to just recognizing the vocab, as well as how to do the “inverse mapping” from id to string)
- https://issues.apache.org/jira/browse/LUCENE-2792 (Apache Lucene ticket for when this was first implemented there)
64-bit “hash and pray”
This appears to be the technique used by KenLM by default: hash the string to a 64-bit hash value using e.g. MurmurHash, and then just ignore the possibility of hash collisions for the unigrams. Then, this 64-bit intermediate number can be used as an index into a hash table containing the vocab id.
An alternative is to store a huge vector of the 64-bit hash values for every word, and then the mapping is the index into this vector where a word’s hash value is located. This can apparently be optimized using interpolation search to be $O(\log \log |V|)$ instead of $O(\log |V|)$.
Apparently this hasn’t caused people problems, at least in the MT community, since pretty much everyone is using KenLM instead of e.g. SRILM or IRSTLM. Still, it makes me uneasy to just ignore the collision possibility.