tutorial:basic_3

This is an old revision of the document!


BootCaT front-end tutorial - Part 3

It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit, the default value is 10 URLs per query and we won't change it.

:!: Increasing the number of pages will result in a larger corpus, but the content will be increasingly less relevant.

Click “Collect URLs” to start collecting URLs from the search engine. This might take a while, depending on Internet traffic and speed of your connection.

:!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step.

Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box “I'm done editing URLs” to activate the text box and start editing. Check the box again when you're done.

:!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT.

Click “Next”.

This is the final step, click on “Build corpus” to start the corpus creation process.

Not only will the pages be downloaded, they will also be automatically cleaned:

  • HTML code will be removed
  • boilerplate (i.e. things like menus, ads, disclaimers) will be stripped

Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus.

  • tutorial/basic_3.1271159109.txt.gz
  • Last modified: 2010/04/13 13:45
  • by eros