Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
tutorials:basic_3 [2010/06/15 16:13] – eros | tutorials:basic_3 [2012/05/30 15:22] (current) – removed eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== BootCaT front-end tutorial - Part 3 ====== | ||
- | [[tutorials: | ||
- | ==== Collect URLs ==== | ||
- | |||
- | It's time to query the search engine using the tuples we generated earlier. What happens here is that we search the web via the search engine, looking for pages that contain the tuples (combinations of our seeds) that were generated in the previous step. This identifies texts that are relevant to the more or less specific corpus (domain) in which we are interested, based on how specialized or general the seeds are. | ||
- | |||
- | The search engine will return only a limited number of pages for each query (i.e. tuple) we submit; the default value is 10 URLs per query and we won't change it. | ||
- | |||
- | {{: | ||
- | |||
- | :!: Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant. | ||
- | |||
- | Click " | ||
- | |||
- | {{: | ||
- | |||
- | :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. | ||
- | ==== Edit the URL list ==== | ||
- | |||
- | In this step you can choose to remove URLs you think might not be interesting. Just for fun try unchecking the box next to a couple of URLs: notice how the number of " | ||
- | |||
- | {{: | ||
- | |||
- | :!: Notice how the number of "Total URLs" appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, so the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (this is because the queries can be very similar to one another, as the tuples overlap to a large extent) and duplicates were automatically eliminated by BootCaT. | ||
- | |||
- | Click " | ||
- | |||
- | ==== Build corpus ==== | ||
- | |||
- | This is the final step. | ||
- | |||
- | Not only will the pages be downloaded, they will also be automatically cleaned: | ||
- | |||
- | * HTML code will be removed | ||
- | * boilerplate (i.e. things like menus, navigation bars, ads, disclaimers, | ||
- | |||
- | The purpose of this stage is to get rid of elements which are part of the downloaded web pages, but that are very unlikely to be of interest to corpus users. However, since this process is automated, the cleaning process is far from perfect, so be aware that some unwanted elements will still be present in the corpus. | ||
- | |||
- | {{: | ||
- | |||
- | Click on "Build corpus" | ||
- | |||
- | {{: | ||
- | |||
- | Once the download is complete click "Open corpus folder" | ||
- | |||
- | {{: | ||
- | |||
- | The contents of the folder where the corpus data is stored will be displayed. | ||
- | |||
- | {{: | ||
- | |||
- | [[tutorials: |