Differences

This shows you the differences between two versions of the page.

--- tutorial:basic_3 [2010/04/13 14:02] – eros
+++ tutorial:basic_3 [2010/04/13 16:19] (current) – removed eros
@@ Line 1: / Line 1: @@
-====== BootCaT front-end tutorial - Part 3 ======
-[[tutorials:basic_2|Back to part 2 of the tutorial]]
-==== Collect URL ====
-It's time to query the search engine using the tuples we generated earlier. The search engine will return only a limited number of pages for each query (i.e. tuple) we submit, the default value is 10 URLs per query and we won't change it.
-{{:tutorials:basic_steps:09.jpg|}}
-:!: Increasing the number of pages will result in a larger corpus, but the content will be increasingly less relevant.
-Click "Collect URLs" to start collecting **URLs** from the search engine. This might take a while, depending on Internet traffic and speed of your connection.
-{{:tutorials:basic_steps:10.jpg|}}
-:!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step.
-==== Edit the URL list ====
-Now you can manually edit the list of URLs. We won't do it here, but if you want to give it a try just de-select the box "I'm done editing URLs" to activate the text box and start editing. Check the box again when you're done.
-{{:tutorials:basic_steps:12.jpg|}}
-:!: Notice how the total number of collected URLs appears to be wrong: we generated 15 queries and instructed BootCaT to retrieve 10 URLs per query, the total should be 150. What happened then? Simple, quite a few URLs where retrieved more than once (remember that the queries can be very similar to one another) and duplicate ones where automatically discarded by BootCaT.
-Click "Next".
-==== Build corpus ====
-This is the final step.
-Not only will the pages be downloaded, they will also be automatically cleaned:
-  * HTML code will be removed
-  * boilerplate (i.e. things like menus, ads, disclaimers) will be stripped
-Since it's automated, the cleaning process is far from perfect and some unwanted elements will still be present in the corpus.
-{{:tutorials:basic_steps:13.jpg|}}
-Click on "Build corpus" to start the corpus creation process. This will take a while, depending on Internet traffic, connection speed and number of URLs to download.
-{{:tutorials:basic_steps:14.jpg|}}
-Once the download is complete click "Open corpus folder".
-{{:tutorials:basic_steps:15.jpg|}}
-The folder containing the corpus data will be displayed.
-{{:tutorials:basic_steps:16.jpg|}}
-==== What now? ====
-Congratulations, you have created your first web corpus!
-Now you can use your favourite corpus analysis tools to word on your corpus, here's a [[http://sslmit.unibo.it/~eros/teaching_software.php#concordancers|list of programs]] you might find useful.
-If you want to manually inspect the corpus you just created, there's a number of text editors you can use. If you're on Mac or Linux you already have everything you need, if you're on Windows we strongly recommend the free [[http://notepad-plus.sourceforge.net|Notepad++]] since the default Windows Notepad will not display the corpus correctly.