Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
bootcat:help:corpus_creation_mode [2016/11/14 12:29]
eros
bootcat:help:corpus_creation_mode [2019/11/08 09:12] (current)
eros [Local files (advanced)]
Line 1: Line 1:
 ====== Corpus creation mode ====== ====== Corpus creation mode ======
  
-Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus creation procedure. +You can choose between the following creation "​modes":​
- +
-You can now choose between the following creation "​modes":​+
  
   * [[bootcat:​help:​corpus_creation_mode#​simple_mode_recommended|Simple mode]] (recommended)   * [[bootcat:​help:​corpus_creation_mode#​simple_mode_recommended|Simple mode]] (recommended)
Line 9: Line 7:
   * [[bootcat:​help:​corpus_creation_mode#​custom_urls_advanced|Custom URLs]] (advanced)   * [[bootcat:​help:​corpus_creation_mode#​custom_urls_advanced|Custom URLs]] (advanced)
   * [[bootcat:​help:​corpus_creation_mode#​local_files|Local files]] (advanced)   * [[bootcat:​help:​corpus_creation_mode#​local_files|Local files]] (advanced)
 +  * [[bootcat:​help:​corpus_creation_mode#​local_queries|Local queries]] (advanced)
  
 {{:​bootcat:​help:​corpus_creation_modes.png?​nolink|}} {{:​bootcat:​help:​corpus_creation_modes.png?​nolink|}}
Line 38: Line 37:
 In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). ​ In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). ​
  
-You'll be asked to provide a text file containing one URL per line.+You'll be asked to provide a text file containing one **valid** ​URL per line, i.e. each line must begin with ''​http:​%%//​%%''​ or ''​https:​%%//​%%''​.
  
-You'll have to edit the list separately using a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''​txt''​ format.+You'll have to edit the list separately using a text editor (like [[https://​notepad-plus-plus.org/​|Notepad++]] for Windows or TextEdit for Mac) and save it in ''​txt''​ format.
  
 The text file should look like this: The text file should look like this:
Line 46: Line 45:
 <​file>​ <​file>​
 http://​foo.com/​bar.htm http://​foo.com/​bar.htm
-http://​bar.com/​foo.php+https://​example.com/​report.pdf 
 +https://​bar.com/​foo.php
 http://​some.site.com/​index.html http://​some.site.com/​index.html
-...+http://​random.docs.org/thesis.docx
 </​file>​ </​file>​
  
-**N.B.**: only URLs pointing to HTML files will be downloaded ​(typical extensions for such files are ''​.htm'',​ ''​.html'',​ ''​.php'',​ ''​.asp''​), if the list you provide contains URLs ending ​in PDF, DOC, DOCX etc. BootCaT ​will display an error and will refuse to proceed. In order to continue you'll have to remove the links to unsupported file formats from the list. +NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "​UTF8 ​**with BOM**", please make sure your URL list is saved as "​UTF8" ​(**without BOM**), the issue will be solved ​in future versions of BootCaT.
 ===== Local files (advanced) ===== ===== Local files (advanced) =====
  
-Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.+Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and the corpus files will be created. 
 + 
 +Most common file formats are supported, including ''​html'',​ ''​pdf''​ and ''​doc''​ files. 
 + 
 +===== Local queries (advanced) ===== 
 + 
 +Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved. 
  • bootcat/help/corpus_creation_mode.1479122945.txt.gz
  • Last modified: 2016/11/14 12:29
  • by eros