Differences

This shows you the differences between two versions of the page.

Link to this comparison view

bootcat:help:lists [2012/05/30 15:24] (current)
eros created
Line 1: Line 1:
 +====== Blacklists and whitelists ======
  
 +==== Blacklists ====
 +
 +A blacklist is a list of "​bad"​ words (e.g., pornographic terms). If a document contains more than a certain number of [[http://​en.wikipedia.org/​wiki/​Type-token_distinction|types]] or [[http://​en.wikipedia.org/​wiki/​Token_(parser)#​Token|tokens]] from this list, it will be discarded.
 +
 +The "​Types"​ and "​Tokens"​ parameters let you specify the number of types and tokens from the "​bad"​ word list sufficient to cause a document to be discarded.
 +
 +==== Whitelists ====
 +
 +A whitelist is a list of "​good"​ words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[http://​en.wikipedia.org/​wiki/​Type-token_distinction|types]] and [[http://​en.wikipedia.org/​wiki/​Token_(parser)#​Token|tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.
 +
 +The "​Types",​ "​Tokens"​ and "​Ratio"​ parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included.
  • bootcat/help/lists.txt
  • Last modified: 2012/05/30 15:24
  • by eros