Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
tutorials:basic_3 [2011/08/23 17:08] – eros | tutorials:basic_3 [2012/05/30 15:22] (current) – removed eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== BootCaT front-end tutorial - Part 3 ====== | ||
- | [[tutorials: | ||
- | ==== Collect URLs ==== | ||
- | |||
- | It's time to query the search engine (i.e. Yahoo) using the tuples we generated earlier. What happens here is that we search the web via the search engine, looking for pages that contain the tuples (combinations of our seeds) that were generated in the previous step. This identifies texts that are relevant to the more or less specific corpus (domain) in which we are interested, based on how specialized or general the seeds are. | ||
- | |||
- | The search engine will return only a limited number of pages for each query (i.e. tuple) we submit; the default value is 10 URLs per query and we won't change it. | ||
- | |||
- | {{: | ||
- | |||
- | :!: Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant. | ||
- | |||
- | Click " | ||
- | |||
- | {{: | ||
- | |||
- | :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. | ||
- | |||
- | |||
- | |||
- | [[tutorials: |