This is an old revision of the document!
Corpus creation mode
You can choose between the following creation “modes”:
- Simple mode (recommended)
- Custom tuples (advanced)
- Custom URLs (advanced)
- Local files (advanced)
- Local queries (advanced)
Simple mode (recommended)
This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.
If you're a novice user this is the mode you should use (see the tutorial for more info on how to build a corpus).
Custom tuples (advanced)
In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.
Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:
dog Fido "food hygiene" leash Fido dog breeds "food hygiene" leash pet leash Fido ...
After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus.
Custom URLs (advanced)
In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).
You'll be asked to provide a text file containing one URL per line.
You'll have to edit the list separately using a text editor (like Notepad++ for Windows or TextEdit for Mac) and save it in txt
format.
The text file should look like this:
http://foo.com/bar.htm https://example.com/report.pdf https://bar.com/foo.php http://some.site.com/index.html http://random.docs.org/thesis.docx ...
N.B.: you need to provide a list of valid URLs, i.e. each line must begin with http:
or
https:
Local files (advanced)
Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.