bootcat:help:corpus_creation_mode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revisionBoth sides next revision
bootcat:help:corpus_creation_mode [2013/06/15 13:33] – created erosbootcat:help:corpus_creation_mode [2019/11/05 12:55] eros
Line 1: Line 1:
 ====== Corpus creation mode ====== ====== Corpus creation mode ======
  
-Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus creation procedure.+You can choose between the following creation "modes":
  
-You can now choose between the following creation "modes":+  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] (recommended) 
 +  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] (advanced) 
 +  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] (advanced) 
 +  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] (advanced) 
 +  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]] (advanced)
  
-  * Simple mode (recommended) +{{:bootcat:help:corpus_creation_modes.png?nolink|}}
-  * Custom tuples (advanced) +
-  * Custom URLs (advanced)+
  
 ===== Simple mode (recommended) ===== ===== Simple mode (recommended) =====
Line 13: Line 15:
 This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus. This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.
  
-If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on this).+If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on how to build a corpus).
  
 ===== Custom tuples (advanced) ===== ===== Custom tuples (advanced) =====
  
-In this mode you skip the seed selection steps and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.+In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
  
 Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
Line 33: Line 35:
 ===== Custom URLs (advanced) ===== ===== Custom URLs (advanced) =====
  
-In this mode you'll skip directly to the final stap: you'll be asked to provide a list of URLs in a text file which will have to contain one URL per line.+In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs)
  
-You'll have to edit the list separately with a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.+You'll be asked to provide a text file containing one URL per line. 
 + 
 +You'll have to edit the list separately using a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.
  
 The text file should look like this: The text file should look like this:
Line 45: Line 49:
 ... ...
 </file> </file>
 +
 +**N.B.**: only URLs pointing to HTML files will be downloaded (typical extensions for such files are ''.htm'', ''.html'', ''.php'', ''.asp''), if the list you provide contains URLs ending in PDF, DOC, DOCX etc. BootCaT will display an error and will refuse to proceed. In order to continue you'll have to remove the links to unsupported file formats from the list.
 +
 +===== Local files (advanced) =====
 +
 +Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.
  • bootcat/help/corpus_creation_mode.txt
  • Last modified: 2023/04/19 10:56
  • by eros