help:lists

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
help:lists [2010/06/16 00:19] eroshelp:lists [2012/05/30 15:24] (current) – removed eros
Line 1: Line 1:
-====== Blacklists and whitelists ====== 
  
-==== Blacklists ==== 
- 
-A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of [[wp>types]] or [[wp>tokens]] from this list, it will be discarded. 
- 
-The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded. 
- 
-==== Whitelists ==== 
- 
-A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[wp>types]] and [[wp>tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold. 
- 
-The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included. 
  • help/lists.1276640368.txt.gz
  • Last modified: 2010/06/16 00:19
  • by eros