Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| tutorial:basic_3 [2010/04/13 14:02] – eros | tutorial:basic_3 [2010/04/13 16:19] (current) – removed eros | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== BootCaT front-end tutorial - Part 3 ====== | ||
| - | [[tutorials: | ||
| - | |||
| - | ==== Collect URL ==== | ||
| - | |||
| - | It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit, the default value is 10 URLs per query and we won't change it. | ||
| - | |||
| - | {{: | ||
| - | |||
| - | :!: Increasing the number of pages will result in a larger corpus, but the content will be increasingly less relevant. | ||
| - | |||
| - | Click " | ||
| - | |||
| - | {{: | ||
| - | |||
| - | :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. | ||
| - | |||
| - | ==== Edit the URL list ==== | ||
| - | |||
| - | Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box " | ||
| - | |||
| - | {{: | ||
| - | |||
| - | :!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT. | ||
| - | |||
| - | Click " | ||
| - | ==== Build corpus ==== | ||
| - | |||
| - | This is the final step. | ||
| - | |||
| - | Not only will the pages be downloaded, they will also be automatically cleaned: | ||
| - | |||
| - | * HTML code will be removed | ||
| - | * boilerplate (i.e. things like menus, ads, disclaimers) will be stripped | ||
| - | |||
| - | Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus. | ||
| - | |||
| - | {{: | ||
| - | |||
| - | Click on "Build corpus" | ||
| - | |||
| - | {{: | ||
| - | |||
| - | Once the download is complete click "Open corpus folder" | ||
| - | |||
| - | {{: | ||
| - | |||
| - | The folder containing the corpus data will be displayed. | ||
| - | |||
| - | {{: | ||
| - | |||
| - | ==== What now? ==== | ||
| - | |||
| - | Congratulations, | ||
| - | |||
| - | Now you can use your favourite corpus analysis tools to word on your corpus, here's a [[http:// | ||
| - | |||
| - | If you want to manually inspect the corpus you just created, there' | ||