====== Corpus creation mode ====== You can choose between the following creation "modes": * [[bootcat:help:corpus_creation_mode#simple_mode_recommended|Simple mode]] * [[bootcat:help:corpus_creation_mode#custom_tuples_advanced|Custom tuples]] * [[bootcat:help:corpus_creation_mode#custom_urls_advanced|Custom URLs]] * [[bootcat:help:corpus_creation_mode#local_files|Local files]] * [[bootcat:help:corpus_creation_mode#local_queries|Local queries]] {{:bootcat:help:corpus_creation_modes.png?nolink|}} ===== Simple mode (recommended) ===== This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus. If you're a novice user this is the mode you should use (see the [[bootcat:tutorials:basic_1|tutorial]] for more info on how to build a corpus). ===== Custom tuples (advanced) ===== In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples. Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: dog Fido "food hygiene" leash Fido dog breeds "food hygiene" leash pet leash Fido ... After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus. ===== Custom URLs (advanced) ===== In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with ''http:%%//%%'' or ''https:%%//%%''. You'll have to edit the list separately using a text editor (like [[https://notepad-plus-plus.org/|Notepad++]] for Windows or TextEdit for Mac) and save it in ''txt'' format. The text file should look like this: http://foo.com/bar.htm https://example.com/report.pdf https://bar.com/foo.php http://some.site.com/index.html http://random.docs.org/thesis.docx NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", please make sure your URL list is saved as "UTF8" (**without BOM**), the issue will be solved in future versions of BootCaT. ===== Local files (advanced) ===== Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created. Most common file formats are supported, including ''html'', ''pdf'' and ''doc'' files. ===== Local queries (advanced) ===== Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved.