help:lists

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revisionBoth sides next revision
help:lists [2010/06/02 17:41] – created eroshelp:lists [2012/02/17 19:54] eros
Line 1: Line 1:
-====== Whitelists and blacklists ======+====== Blacklists and whitelists ======
  
 ==== Blacklists ==== ==== Blacklists ====
  
-A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of types or tokens from this list, it will be discarded.+A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] or [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, it will be discarded.
  
 The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded. The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded.
Line 9: Line 9:
 ==== Whitelists ==== ==== Whitelists ====
  
-A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of types and tokens from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.+A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] and [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.
  
 The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included. The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included.