Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
tutorial:basic_3 [2010/04/13 13:52] – eros | tutorial:basic_3 [2010/04/13 16:19] (current) – removed eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== BootCaT front-end tutorial - Part 3 ====== | ||
- | [[tutorials: | ||
- | |||
- | ==== Collect URL ==== | ||
- | |||
- | It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit, the default value is 10 URLs per query and we won't change it. | ||
- | |||
- | {{: | ||
- | |||
- | :!: Increasing the number of pages will result in a larger corpus, but the content will be increasingly less relevant. | ||
- | |||
- | Click " | ||
- | |||
- | {{: | ||
- | |||
- | :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. | ||
- | |||
- | ==== Edit the URL list ==== | ||
- | |||
- | Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box " | ||
- | |||
- | {{: | ||
- | |||
- | :!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT. | ||
- | |||
- | Click " | ||
- | ==== Build corpus ==== | ||
- | |||
- | This is the final step. | ||
- | |||
- | Not only will the pages be downloaded, they will also be automatically cleaned: | ||
- | |||
- | * HTML code will be removed | ||
- | * boilerplate (i.e. things like menus, ads, disclaimers) will be stripped | ||
- | |||
- | Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus. | ||
- | |||
- | {{: | ||
- | |||
- | Click on "Build corpus" | ||
- | |||
- | {{: | ||
- | |||
- | Once the download is complete click "Open corpus folder" | ||
- | |||
- | {{: | ||
- | |||
- | The folder containing the corpus data will be displayed. | ||
- | |||
- | {{: |