bootcat:help:corpus_creation_mode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
bootcat:help:corpus_creation_mode [2013/06/15 13:33] – created erosbootcat:help:corpus_creation_mode [2023/04/19 10:56] (current) eros
Line 1: Line 1:
 ====== Corpus creation mode ====== ====== Corpus creation mode ======
  
-Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus creation procedure.+You can choose between the following creation "modes":
  
-You can now choose between the following creation "modes":+  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] 
 +  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] 
 +  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] 
 +  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] 
 +  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]]
  
-  * Simple mode (recommended) +{{:bootcat:help:corpus_creation_modes.png?nolink|}}
-  * Custom tuples (advanced) +
-  * Custom URLs (advanced)+
  
 ===== Simple mode (recommended) ===== ===== Simple mode (recommended) =====
Line 13: Line 15:
 This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus. This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.
  
-If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on this).+If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on how to build a corpus).
  
 ===== Custom tuples (advanced) ===== ===== Custom tuples (advanced) =====
  
-In this mode you skip the seed selection steps and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.+In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
  
 Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
Line 33: Line 35:
 ===== Custom URLs (advanced) ===== ===== Custom URLs (advanced) =====
  
-In this mode you'll skip directly to the final stap: you'll be asked to provide a list of URLs in a text file which will have to contain one URL per line.+In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs)
  
-You'll have to edit the list separately with a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.+You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''
 + 
 +You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format.
  
 The text file should look like this: The text file should look like this:
Line 41: Line 45:
 <file> <file>
 http://foo.com/bar.htm http://foo.com/bar.htm
-http://bar.com/foo.php+https://example.com/report.pdf 
 +https://bar.com/foo.php
 http://some.site.com/index.html http://some.site.com/index.html
-...+http://random.docs.org/thesis.docx
 </file> </file>
 +
 +NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**), the issue will be solved in future versions of BootCaT.
 +===== Local files (advanced) =====
 +
 +Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created.
 +
 +Most common file formats are supported, including ''html'', ''pdf'' and ''doc'' files.
 +
 +===== Local queries (advanced) =====
 +
 +Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved.
 +
  • bootcat/help/corpus_creation_mode.1371295996.txt.gz
  • Last modified: 2013/06/15 13:33
  • by eros