bootcat:help:lists

Blacklists and whitelists

A blacklist is a list of “bad” words (e.g., pornographic terms). If a document contains more than a certain number of types or tokens from this list, it will be discarded.

The “Types” and “Tokens” parameters let you specify the number of types and tokens from the “bad” word list sufficient to cause a document to be discarded.

A whitelist is a list of “good” words (e.g., function words). A document is included in the corpus only if it contains a certain number of types and tokens from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.

The “Types”, “Tokens” and “Ratio” parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included.

  • bootcat/help/lists.txt
  • Last modified: 2012/05/30 13:24
  • by eros