bootcat:help:corpus_creation_mode

This is an old revision of the document!


Corpus creation mode

You can choose between the following creation “modes”:

This is the standard method for creating a BootCaT corpus: you choose seeds, build random tuples, collect URLs and finally build the corpus.

If you're a novice user this is the mode you should use (see the tutorial for more info on how to build a corpus).

In this mode you skip the seed selection step and directly provide a list of tuples: a window will open and you'll be able to type in the tuples.

Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this:

dog Fido "food hygiene"
leash Fido dog
breeds "food hygiene" leash
pet leash Fido
...

After providing the tuples you will proceed normally: you'll collect URLs and then build the corpus.

In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs).

You'll be asked to provide a text file containing one valid URL per line, i.e. each line must begin with http:// or https://.

You'll have to edit the list separately using a text editor (like Notepad++ for Windows or TextEdit for Mac) and save it in txt format.

The text file should look like this:

http://foo.com/bar.htm
https://example.com/report.pdf
https://bar.com/foo.php
http://some.site.com/index.html
http://random.docs.org/thesis.docx
...

Using this mode BootCaT will process all files contained in a folder (and its subfolders) on your computer. Files will be cleaned and a single text file will be created.

Most common file formats are supported, including html, pdf and doc files.

Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved.

  • bootcat/help/corpus_creation_mode.1572955971.txt.gz
  • Last modified: 2019/11/05 13:12
  • by eros