bootcat:help:corpus_creation_mode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
bootcat:help:corpus_creation_mode [2019/11/05 11:55] erosbootcat:help:corpus_creation_mode [2023/04/19 08:56] (current) eros
Line 3: Line 3:
 You can choose between the following creation "modes": You can choose between the following creation "modes":
  
-  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] (recommended) +  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] 
-  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] (advanced) +  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] 
-  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] (advanced) +  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] 
-  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] (advanced) +  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] 
-  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]] (advanced)+  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]]
  
 {{:bootcat:help:corpus_creation_modes.png?nolink|}} {{:bootcat:help:corpus_creation_modes.png?nolink|}}
Line 37: Line 37:
 In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).  In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). 
  
-You'll be asked to provide a text file containing one URL per line.+You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''.
  
-You'll have to edit the list separately using a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.+You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format.
  
 The text file should look like this: The text file should look like this:
Line 45: Line 45:
 <file> <file>
 http://foo.com/bar.htm http://foo.com/bar.htm
-http://bar.com/foo.php+https://example.com/report.pdf 
 +https://bar.com/foo.php
 http://some.site.com/index.html http://some.site.com/index.html
-...+http://random.docs.org/thesis.docx
 </file> </file>
  
-**N.B.**: only URLs pointing to HTML files will be downloaded (typical extensions for such files are ''.htm'', ''.html'', ''.php'', ''.asp''), if the list you provide contains URLs ending in PDF, DOC, DOCX etc. BootCaT will display an error and will refuse to proceed. In order to continue you'll have to remove the links to unsupported file formats from the list. +NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**), the issue will be solved in future versions of BootCaT.
 ===== Local files (advanced) ===== ===== Local files (advanced) =====
  
-Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.+Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created. 
 + 
 +Most common file formats are supported, including ''html'', ''pdf'' and ''doc'' files. 
 + 
 +===== Local queries (advanced) ===== 
 + 
 +Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved. 
  • bootcat/help/corpus_creation_mode.1572954912.txt.gz
  • Last modified: 2019/11/05 11:55
  • by eros