====== Corpus creation mode ======
You can choose between the following creation "modes":
* [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]]
* [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]]
* [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]]
* [[bootcat:help:corpus_creation_mode#local_files|Local files]]
* [[bootcat:help:corpus_creation_mode#local_queries|Local queries]]
{{:bootcat:help:corpus_creation_modes.png?nolink|}}
===== Simple mode (recommended) =====
This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.
If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on how to build a corpus).
===== Custom tuples (advanced) =====
In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
dog Fido "food hygiene"
leash Fido dog
breeds "food hygiene" leash
pet leash Fido
...
After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus.
===== Custom URLs (advanced) =====
In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).
You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''.
You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format.
The text file should look like this:
http://foo.com/bar.htm
https://example.com/report.pdf
https://bar.com/foo.php
http://some.site.com/index.html
http://random.docs.org/thesis.docx
NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**), the issue will be solved in future versions of BootCaT.
===== Local files (advanced) =====
Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created.
Most common file formats are supported, including ''html'', ''pdf'' and ''doc'' files.
===== Local queries (advanced) =====
Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved.