bootcat:help:corpus_creation_mode

Corpus creation mode

You can choose between the following creation “modes”:

This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.

If you're a novice user this is the mode you should use (see the tutorial for more info on how to build a corpus).

In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.

Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:

dog Fido "food hygiene"
leash Fido dog
breeds "food hygiene" leash
pet leash Fido
...

After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus.

In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).

You'll be asked to provide a text file containing one valid URL per line, i.e. each line must begin with http:// or https://.

You'll have to edit the list separately using a text editor (like Notepad++ for Windows or TextEdit for Mac) and save it in txt format.

The text file should look like this:

http://foo.com/bar.htm
https://example.com/report.pdf
https://bar.com/foo.php
http://some.site.com/index.html
http://random.docs.org/thesis.docx

NB: up to version 1.21, BootCaT does not accept URLs lists encoded as “UTF8 with BOM”, please make sure your URL list is saved as “UTF8” (without BOM), the issue will be solved in future versions of BootCaT.

Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created.

Most common file formats are supported, including html, pdf and doc files.

Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved.

  • bootcat/help/corpus_creation_mode.txt
  • Last modified: 2023/04/19 08:56
  • by eros