bootcat:help:lists

no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.


bootcat:help:lists [2012/05/30 13:24] (current) – created eros
Line 1: Line 1:
 +====== Blacklists and whitelists ======
  
 +==== Blacklists ====
 +
 +A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] or [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, it will be discarded.
 +
 +The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded.
 +
 +==== Whitelists ====
 +
 +A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] and [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.
 +
 +The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included.
  • bootcat/help/lists.txt
  • Last modified: 2012/05/30 13:24
  • by eros