I’m working on loading .arpa files and having the language model use them. I just want to make sure I am understanding the calculation correctly based on a premade .arpa file.
See the example here, slide 3. What’s on it makes perfect sense, but what if we get OOV words? For example, what if we want to score
logPr(!|hello people)? Here,
people is OOV. We can’t have a backoff weight of 0 for
hello people, since that is essentially assigning it a very high probability.
Do we assume that there are
<unk> tokens somewhere in the file?