help:lists

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
help:lists [2010/06/16 00:19] eroshelp:lists [2012/02/17 19:54] eros
Line 3: Line 3:
 ==== Blacklists ==== ==== Blacklists ====
  
-A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of [[wp>types]] or [[wp>tokens]] from this list, it will be discarded.+A blacklist is a list of "bad" words (e.g., pornographic terms). If a document contains more than a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] or [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, it will be discarded.
  
 The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded. The "Types" and "Tokens" parameters let you specify the number of types and tokens from the "bad" word list sufficient to cause a document to be discarded.
Line 9: Line 9:
 ==== Whitelists ==== ==== Whitelists ====
  
-A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[wp>types]] and [[wp>tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.+A whitelist is a list of "good" words (e.g., function words). A document is included in the corpus only if it contains a certain number of [[http://en.wikipedia.org/wiki/Type-token_distinction|types]] and [[http://en.wikipedia.org/wiki/Token_(parser)#Token|tokens]] from this list, and if the ratio of tokens from the list to total tokens is above a certain threshold.
  
 The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included. The "Types", "Tokens" and "Ratio" parameters let you specify the minimum number of types and tokens from the whitelist that a document must contain to be included, and the minimum ratio of tokens from the list to total tokens in the document that a document must have to be included.