Sentence Segmentation Tags


I noticed that the default filter automatically adds the tags <s> </s> to the tokenized text which might create problems when evaluating IR systems since the tags will change the ordering of documents in some cases. Another problem is that any query will match all the documents as these tags are being added to all queries and documents, which might lead to performance issues. It might be a good idea to create a special filter for retrieval tasks that does not add these tags.


Yeah, this is a good point. The <s> and </s> tags are there specifically for the case where we have $n$-grams with $n > 1$ so we can match BOS and EOS cases properly, but totally don’t matter for the unigram case (well, unless you want the number of sentences to implicitly be a feature in classification, which I guess might be useful?).

There are two ways I can see of fixing this:

  • One is to have an option in e.g. icu_tokenizer to omit the generation of the sentence start/end tags, and have the default be to omit them for the unigram case.

  • Another is to just use the existing filter framework and have a list_filter to remove any <s> or </s> that occurs in the token stream.

The first one is probably better, but the second one is very easy to do (and you can even fix this with just a config.toml now since the filter configuration is pretty flexible). For now I’ll probably just fix this (using option 1) in develop; if you’re using the latest released version it should be pretty easy to set up a list_filter in your filter chain.


OK, so this will be fixed in 1.3.7 which I’m releasing shortly. :smile:

You should be able to specify the following in your config file to suppress the <s> and </s> tokens now:

method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"

If you’re manually using icu_tokenizer, you can specify to suppress the tags as a configuration option, e.g.

method = "ngram-word"
ngram = 1
    type = "icu-tokenzier"
    suppress-tags = true