tutorial:basic_3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tutorial:basic_3 [2010/04/13 14:58] federicotutorial:basic_3 [2010/04/13 16:19] (current) – removed eros
Line 1: Line 1:
-====== BootCaT front-end tutorial - Part 3 ====== 
  
-[[tutorials:basic_2|Back to part 2 of the tutorial]] 
-==== Collect URL ==== 
- 
-It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit; the default value is 10 URLs per query and we won't change it. 
- 
-{{:tutorials:basic_steps:09.jpg|}} 
- 
-:!: Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant. 
- 
-Click "Collect URLs" to start collecting **URLs** from the search engine. This might take a while, depending on Internet traffic and speed of your connection. 
- 
-{{:tutorials:basic_steps:10.jpg|}} 
- 
-:!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. 
-==== Edit the URL list ==== 
- 
-Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box "I'm done editing URLs" to activate the text box and start editing (i.e. deleting the line of text with each URL that you would like to remove). Check the box again when you're done. 
- 
-{{:tutorials:basic_steps:12.jpg|}} 
- 
-:!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, so the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT. 
- 
-Click "Next". 
-==== Build corpus ==== 
- 
-This is the final step.  
- 
-Not only will the pages be downloaded, they will also be automatically cleaned: 
- 
-  * HTML code will be removed 
-  * boilerplate (i.e. things like menus, navigation bars, ads, disclaimers, automatic error messages) will be stripped 
- 
-The purpose of this stage is to get rid of elements which are part of the downloaded web pages, but that are very unlikely to be of interest to corpus users. However, since this process is automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus. 
- 
-{{:tutorials:basic_steps:13.jpg|}} 
- 
-Click on "Build corpus" to start the corpus creation process. This will take a while, depending on Internet traffic, connection speed and number of URLs to download. 
- 
-{{:tutorials:basic_steps:14.jpg|}} 
- 
-Once the download is complete click "Open corpus folder". 
- 
-{{:tutorials:basic_steps:15.jpg|}} 
- 
-The contents of the folder where the corpus data is stored will be displayed. 
- 
-{{:tutorials:basic_steps:16.jpg|}} 
- 
-==== What now? ==== 
- 
-Congratulations, you have created your first web corpus! 
- 
-Now you can use your favourite corpus analysis tools to word on your corpus, here's a [[http://sslmit.unibo.it/~eros/teaching_software.php#concordancers|list of programs]] you might find useful. 
- 
-If you want to manually inspect the corpus you just created, there's a number of text editors you can use. If you're on Mac or Linux you already have everything you need, if you're on Windows we strongly recommend the free [[http://notepad-plus.sourceforge.net|Notepad++]] since the default Windows Notepad will not display the corpus correctly. 
  • tutorial/basic_3.1271163522.txt.gz
  • Last modified: 2010/04/13 14:58
  • by federico