Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
tutorial:basic_3 [2010/04/13 14:58] – federico | tutorial:basic_3 [2010/04/13 16:19] (current) – removed eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== BootCaT front-end tutorial - Part 3 ====== | ||
- | [[tutorials: | ||
- | ==== Collect URL ==== | ||
- | |||
- | It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit; the default value is 10 URLs per query and we won't change it. | ||
- | |||
- | {{: | ||
- | |||
- | :!: Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant. | ||
- | |||
- | Click " | ||
- | |||
- | {{: | ||
- | |||
- | :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. | ||
- | ==== Edit the URL list ==== | ||
- | |||
- | Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box " | ||
- | |||
- | {{: | ||
- | |||
- | :!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, so the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT. | ||
- | |||
- | Click " | ||
- | ==== Build corpus ==== | ||
- | |||
- | This is the final step. | ||
- | |||
- | Not only will the pages be downloaded, they will also be automatically cleaned: | ||
- | |||
- | * HTML code will be removed | ||
- | * boilerplate (i.e. things like menus, navigation bars, ads, disclaimers, | ||
- | |||
- | The purpose of this stage is to get rid of elements which are part of the downloaded web pages, but that are very unlikely to be of interest to corpus users. However, since this process is automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus. | ||
- | |||
- | {{: | ||
- | |||
- | Click on "Build corpus" | ||
- | |||
- | {{: | ||
- | |||
- | Once the download is complete click "Open corpus folder" | ||
- | |||
- | {{: | ||
- | |||
- | The contents of the folder where the corpus data is stored will be displayed. | ||
- | |||
- | {{: | ||
- | |||
- | ==== What now? ==== | ||
- | |||
- | Congratulations, | ||
- | |||
- | Now you can use your favourite corpus analysis tools to word on your corpus, here's a [[http:// | ||
- | |||
- | If you want to manually inspect the corpus you just created, there' |