bootcat:tutorials:basic_3

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
bootcat:tutorials:basic_3 [2016/11/14 11:32] erosbootcat:tutorials:basic_3 [2022/10/07 10:58] (current) – [Saving query results] eros
Line 4: Line 4:
 ====== BootCaT front-end tutorial - Part 3 ====== ====== BootCaT front-end tutorial - Part 3 ======
  
-==== Search Engine Key ====+==== Generating queries ====
  
-Before we can query the search enginewe need to provide BootCaT with Bing Search Engine Key (see [[bootcat:help:search_engine_key|this page]] for more information on Account Keys).+It's time to generate the queries that will be sent to the search engine (i.e. Google) using the tuples we generated earlier. The queries we generate here will be used in the next step to open browser and save results.
  
-Once you have obtained your Search Engine Keypaste it in the box and click "Next"+A number of parameters can be specified herebut, for the purposes of this tutorial, we'll just accept the default values and click on "Generate Queries".
  
-{{bootcat:tutorials:basic_steps:007.png?nolink|}}+{{ bootcat:tutorials:basic_steps:0065.png?nolink |}}
  
-:!: If you want BootCaT to remember your Account Key the next time you use it, leave the relevant box checked (it's not recommended doing this if you're using a public or shared computer).+==== Saving query results ====
  
-==== Collect URLs ==== +What happens here is that we open each of the queries generated in the previous step in a web browser. Each tuple (combinations of our seeds) generated earlier becomes a query. This method allows us to identify texts that are relevant to the more or less specific corpus (domain) in which we are interested, based on how specialized or general the seeds are.
- +
-It's time to query the search engine (i.e. Bing) using the tuples we generated earlier. What happens here is that we search the web via the search engine, looking for pages that contain the tuples (combinations of our seeds) that were generated in the previous step. This identifies texts that are relevant to the more or less specific corpus (domain) in which we are interested, based on how specialized or general the seeds are.+
    
-The search engine will return only limited number of pages for each query (i.etuple) we submit; the default value is 10 URLs per query and we won't change it.+{{ bootcat:tutorials:basic_steps:008.png?nolink |}} 
 + 
 +Click on "Open in browser", message will appear explaining what's about to happen and the folder where you'll need to save the results page. You can also open the folder by clicking on "Open folder"
 + 
 +{{ bootcat:tutorials:basic_steps:0085.png?nolink |}} 
 + 
 +Once you click on "OK" your default Web browser will open and you'll see the results of the query, the page will look something like this: 
 + 
 +{{ bootcat:tutorials:basic_steps:0087.png?nolink |}} 
 + 
 +Now you need to save the page by using the "Save page" function of your browser (on Windows you can just press CTRL-S, on MacOS press CMD-S), a dialog box will appear asking you where you want to save the pageYou need to select the folder '''BootCaT Corpora -> dogs -> queries'''. 
 + 
 +**NB**: make sure you're saving the queries either as "Web page, Complete", "Web page, HTML only" or "Page Source", basically you need to save them in HTML format (and not in MHTML or some other compressed format).
  
-{{bootcat:tutorials:basic_steps:008.png?nolink|}}+{{ bootcat:tutorials:basic_steps:0088.png?nolink |}}
  
-:!: Increasing the number of pages will result in a larger corpus, but its contents will tend to become less relevant.+==== Collecting URLs ====
  
-Some advanced options are available on this step, but we won't discuss them herefor now just click "Collect URLs" to start collecting **URLs** from the search engine.+Once you're done saving the results of all queries, click on "Collect URLs" and you'll be taken to the next step:
  
-This might take a while, depending on the number of tuples, Internet traffic and speed of your connection. In the lower text area you can see the URLs that are being collected from the search engine.+{{ bootcat:tutorials:basic_steps:0089.png?nolink |}}
  
-{{bootcat:tutorials:basic_steps:009.png?nolink|}}+:!:: you can choose to click on "Open All in Browser" to send all queries to the browser with a single click, but this sometimes results in Google blocking the operation.
  
 :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step. :!: In this step we only collect the URLs (i.e. the Internet addresses) of pages, the actual pages will be downloaded in a later step.
  • bootcat/tutorials/basic_3.1479123131.txt.gz
  • Last modified: 2016/11/14 11:32
  • by eros