This is an old revision of the document!
BootCaT front-end tutorial - Part 3
Collect URL
It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit; the default value is 10 URLs per query and we won't change it.
Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant.
Click “Collect URLs” to start collecting URLs from the search engine. This might take a while, depending on Internet traffic and speed of your connection.
In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step.
Edit the URL list
Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box “I'm done editing URLs” to activate the text box and start editing (i.e. deleting the line of text with each URL that you would like to remove). Check the box again when you're done.
Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, so the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT.
Click “Next”.
Build corpus
This is the final step.
Not only will the pages be downloaded, they will also be automatically cleaned:
- HTML code will be removed
- boilerplate (i.e. things like menus, ads, disclaimers) will be stripped
Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus.
Click on “Build corpus” to start the corpus creation process. This will take a while, depending on Internet traffic, connection speed and number of URLs to download.
Once the download is complete click “Open corpus folder”.
The folder containing the corpus data will be displayed.
What now?
Congratulations, you have created your first web corpus!
Now you can use your favourite corpus analysis tools to word on your corpus, here's a list of programs you might find useful.
If you want to manually inspect the corpus you just created, there's a number of text editors you can use. If you're on Mac or Linux you already have everything you need, if you're on Windows we strongly recommend the free Notepad++ since the default Windows Notepad will not display the corpus correctly.