Calculating probabilities from .arpa file


I’m working on loading .arpa files and having the language model use them. I just want to make sure I am understanding the calculation correctly based on a premade .arpa file.

See the example here, slide 3. What’s on it makes perfect sense, but what if we get OOV words? For example, what if we want to score logPr(!|hello people)? Here, people is OOV. We can’t have a backoff weight of 0 for hello people, since that is essentially assigning it a very high probability.

Do we assume that there are <unk> tokens somewhere in the file?


In KenLM at least, it seems like <unk> is required.