bootcat:help:corpus_creation_mode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
bootcat:help:corpus_creation_mode [2013/06/15 15:38] – [Custom URLs (advanced)] erosbootcat:help:corpus_creation_mode [2023/04/19 08:56] (current) eros
Line 1: Line 1:
 ====== Corpus creation mode ====== ====== Corpus creation mode ======
  
-Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus creation procedure.+You can choose between the following creation "modes":
  
-You can now choose between the following creation "modes": +  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] 
- +  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] 
-  * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] (recommended) +  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] 
-  * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] (advanced) +  * [[bootcat:help:corpus_creation_mode#local_files|Local files]] 
-  * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] (advanced)+  * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]]
  
 {{:bootcat:help:corpus_creation_modes.png?nolink|}} {{:bootcat:help:corpus_creation_modes.png?nolink|}}
Line 19: Line 19:
 ===== Custom tuples (advanced) ===== ===== Custom tuples (advanced) =====
  
-In this mode you skip the seed selection steps and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.+In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
  
 Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
Line 35: Line 35:
 ===== Custom URLs (advanced) ===== ===== Custom URLs (advanced) =====
  
-In this mode you'll skip directly to the final stap: you'll be asked to provide a list of URLs in a text file which will have to contain one URL per line.+In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs)
  
-You'll have to edit the list separately with a text editor (like Notepad for Windows or TextEdit for Mac) and save it in ''txt'' format.+You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''
 + 
 +You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format.
  
 The text file should look like this: The text file should look like this:
Line 43: Line 45:
 <file> <file>
 http://foo.com/bar.htm http://foo.com/bar.htm
-http://bar.com/foo.php+https://example.com/report.pdf 
 +https://bar.com/foo.php
 http://some.site.com/index.html http://some.site.com/index.html
-...+http://random.docs.org/thesis.docx
 </file> </file>
  
-**N.B.**: only URLs pointing to HTML files will be downloaded (typical extensions for such files are ''.htm'', ''.html'', ''.php''''.asp''), whereas PDF and DOC files will be ignored.+NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**), the issue will be solved in future versions of BootCaT. 
 +===== Local files (advanced) ===== 
 + 
 +Using this mode BootCaT will process all files contained in a folder on your computerFiles will be cleaned and the corpus files will be created. 
 + 
 +Most common file formats are supportedincluding ''html'', ''pdf'' and ''doc'' files. 
 + 
 +===== Local queries (advanced===== 
 + 
 +Using this modeyou can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved. 
  • bootcat/help/corpus_creation_mode.1371310712.txt.gz
  • Last modified: 2013/06/15 15:38
  • by eros