BootCaT front-end tutorial - Part 2

Providing your chosen seeds

This is the most important step in the corpus creation process: here you provide the seeds that will be used to generate the queries that will be submitted to the search engine.

Type (or copy/paste) the seeds that you choose into the text box (one seed per line, multi-word seeds go on the same line, quotes are not necessary), as in the example provided:

food hygiene

The minimum number of seeds you must provide is 5; here for the purposes of illustration we used 6.

Once you have provided the seeds of your choice, check the “I'm done editing seeds” box and click “Next”.

Tuple generation

The seeds you provided in the previous step will be randomly grouped to form tuples (a variety of combinations of your seeds). These tuples will be submitted as queries to the search engine.

You can choose the number of tuples to be generated; of course the number of possible random combinations is finite and depends on how many seeds you provided. The maximum number of tuples you can generate is shown in parentheses. Since we provided 6 seeds, we can generate a maximum of 20 tuples. We choose to generate 15 tuples.

You can also alter the length of the tuple (i.e. the number of seeds forming it); typical values for this option are:

  • 3 if you want to build a specialized corpus
  • 2 if you are creating a general language corpus and are using general language words

We'll use a length of 3 and recommend that you do the same.

Once you're finished setting the options, click on “Generate tuples”

Here you can also unselect individual tuples if you think that they will not yield interesting results.

:!: Notice how “food hygiene” has been automatically surrounded with quotes. The tuples in which this seed appears are 4 words long but only 3 seeds long since “food hygiene” counts as a single seed.

Click “Next” to proceed to the next step.

