tutorial:basic_3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tutorial:basic_3 [2010/04/13 14:02] erostutorial:basic_3 [2010/04/13 16:19] (current) – removed eros
Line 1: Line 1:
-====== BootCaT front-end tutorial - Part 3 ====== 
  
-[[tutorials:basic_2|Back to part 2 of the tutorial]] 
- 
-==== Collect URL ==== 
- 
-It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit, the default value is 10 URLs per query and we won't change it. 
- 
-{{:tutorials:basic_steps:09.jpg|}} 
- 
-:!: Increasing the number of pages will result in a larger corpus, but the content will be increasingly less relevant. 
- 
-Click "Collect URLs" to start collecting **URLs** from the search engine. This might take a while, depending on Internet traffic and speed of your connection. 
- 
-{{:tutorials:basic_steps:10.jpg|}} 
- 
-:!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. 
- 
-==== Edit the URL list ==== 
- 
-Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box "I'm done editing URLs" to activate the text box and start editing. Check the box again when you're done. 
- 
-{{:tutorials:basic_steps:12.jpg|}} 
- 
-:!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT. 
- 
-Click "Next". 
-==== Build corpus ==== 
- 
-This is the final step.  
- 
-Not only will the pages be downloaded, they will also be automatically cleaned: 
- 
-  * HTML code will be removed 
-  * boilerplate (i.e. things like menus, ads, disclaimers) will be stripped 
- 
-Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus. 
- 
-{{:tutorials:basic_steps:13.jpg|}} 
- 
-Click on "Build corpus" to start the corpus creation process. This will take a while, depending on Internet traffic, connection speed and number of URLs to download. 
- 
-{{:tutorials:basic_steps:14.jpg|}} 
- 
-Once the download is complete click "Open corpus folder". 
- 
-{{:tutorials:basic_steps:15.jpg|}} 
- 
-The folder containing the corpus data will be displayed. 
- 
-{{:tutorials:basic_steps:16.jpg|}} 
- 
-==== What now? ==== 
- 
-Congratulations, you have created your first web corpus! 
- 
-Now you can use your favourite corpus analysis tools to word on your corpus, here's a [[http://sslmit.unibo.it/~eros/teaching_software.php#concordancers|list of programs]] you might find useful. 
- 
-If you want to manually inspect the corpus you just created, there's a number of text editors you can use. If you're on Mac or Linux you already have everything you need, if you're on Windows we strongly recommend the free [[http://notepad-plus.sourceforge.net|Notepad++]] since the default Windows Notepad will not display the corpus correctly. 
  • tutorial/basic_3.1271160179.txt.gz
  • Last modified: 2010/04/13 14:02
  • by eros