Table of Contents


BootCaT front-end tutorial - Part 3

Generating queries

It's time to generate the queries that will be sent to the search engine (i.e. Google) using the tuples we generated earlier. The queries we generate here will be used in the next step to open a browser and save results.

A number of parameters can be specified here, but, for the purposes of this tutorial, we'll just accept the default values and click on “Generate Queries”.

Saving query results

What happens here is that we open each of the queries generated in the previous step in a web browser. Each tuple (combinations of our seeds) generated earlier becomes a query. This method allows us to identify texts that are relevant to the more or less specific corpus (domain) in which we are interested, based on how specialized or general the seeds are.

Click on “Open in browser”, a message will appear explaining what's about to happen and the folder where you'll need to save the results page. You can also open the folder by clicking on “Open folder”.

Once you click on “OK” your default Web browser will open and you'll see the results of the query, the page will look something like this:

Now you need to save the page by using the “Save page” function of your browser (on Windows you can just press CTRL-S, on MacOS press CMD-S), a dialog box will appear asking you where you want to save the page. You need to select the folder 'BootCaT Corpora → dogs → queries'.

NB: make sure you're saving the queries either as “Web page, Complete”, “Web page, HTML only” or “Page Source”, basically you need to save them in HTML format (and not in MHTML or some other compressed format).

Collecting URLs

Once you're done saving the results of all queries, click on “Collect URLs” and you'll be taken to the next step:

:!:: you can choose to click on “Open All in Browser” to send all queries to the browser with a single click, but this sometimes results in Google blocking the operation.

:!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step.