bootcat:help:corpus_creation_mode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
bootcat:help:corpus_creation_mode [2019/11/05 13:01] – [Custom URLs (advanced)] erosbootcat:help:corpus_creation_mode [2023/04/19 10:56] (current) eros
Line 3: Line 3:
 You can choose between the following creation "modes": You can choose between the following creation "modes":
  
-  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] (recommended) +  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] 
-  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] (advanced) +  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] 
-  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] (advanced) +  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] 
-  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] (advanced) +  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] 
-  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]] (advanced)+  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]]
  
 {{:bootcat:help:corpus_creation_modes.png?nolink|}} {{:bootcat:help:corpus_creation_modes.png?nolink|}}
Line 37: Line 37:
 In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).  In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). 
  
-You'll be asked to provide a text file containing one URL per line.+You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''.
  
-You'll have to edit the list separately using a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.+You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format.
  
 The text file should look like this: The text file should look like this:
Line 49: Line 49:
 http://some.site.com/index.html http://some.site.com/index.html
 http://random.docs.org/thesis.docx http://random.docs.org/thesis.docx
-... 
 </file> </file>
  
-**N.B.**: you need to provide a list of valid URLsi.e. each line must begin with ''http://'' or ''https://'' +NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**)the issue will be solved in future versions of BootCaT.
 ===== Local files (advanced) ===== ===== Local files (advanced) =====
  
-Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.+Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created. 
 + 
 +Most common file formats are supported, including ''html'', ''pdf'' and ''doc'' files. 
 + 
 +===== Local queries (advanced) ===== 
 + 
 +Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved. 
  • bootcat/help/corpus_creation_mode.1572955286.txt.gz
  • Last modified: 2019/11/05 13:01
  • by eros