This is an old revision of the document!
Corpus creation mode
Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus creation procedure.
You can now choose between the following creation “modes”:
- Simple mode (recommended)
- Custom tuples (advanced)
- Custom URLs (advanced)
Simple mode (recommended)
This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.
If you're a novice user this is the mode you should use (see the tutorial for more info on how to build a corpus).
Custom tuples (advanced)
In this mode you skip the seed selection steps and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
dog Fido "food hygiene" leash Fido dog breeds "food hygiene" leash pet leash Fido ...
After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus.
Custom URLs (advanced)
In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).
You'll be asked to provide a text file containing one URL per line.
You'll have to edit the list separately using a text editor (like Notepad for Windows or TextEdit for Mac) and save it in txt
format.
The text file should look like this:
http://foo.com/bar.htm http://bar.com/foo.php http://some.site.com/index.html ...
N.B.: only URLs pointing to HTML files will be downloaded (typical extensions for such files are .htm
, .html
, .php
, .asp
), whereas PDF and DOC files will be ignored.