Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
bootcat:help:corpus_creation_mode [2013/06/15 14:37] – [Simple mode (recommended)] eros | bootcat:help:corpus_creation_mode [2023/04/19 10:56] (current) – eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Corpus creation mode ====== | ====== Corpus creation mode ====== | ||
- | Version 0.71 of the BootCaT frontend introduced the possibility of skipping some of the steps involved in the corpus | + | You can choose between |
- | You can now choose between the following creation " | + | * [[bootcat:help: |
+ | * [[bootcat: | ||
+ | * [[bootcat: | ||
+ | * [[bootcat: | ||
+ | * [[bootcat: | ||
- | * Simple mode (recommended) | + | {{: |
- | * Custom tuples (advanced) | + | |
- | * Custom URLs (advanced) | + | |
===== Simple mode (recommended) ===== | ===== Simple mode (recommended) ===== | ||
Line 17: | Line 19: | ||
===== Custom tuples (advanced) ===== | ===== Custom tuples (advanced) ===== | ||
- | In this mode you skip the seed selection | + | In this mode you skip the seed selection |
Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: | Remember that each line will become a single query to the search engine, therefore phrases should be enclosed in quotes. You tuples should look like this: | ||
Line 33: | Line 35: | ||
===== Custom URLs (advanced) ===== | ===== Custom URLs (advanced) ===== | ||
- | In this mode you'll skip directly to the final stap: you'll be asked to provide | + | In this mode you'll skip directly to the final step, the one where the corpus is built using a list of Internet addresses (or URLs). |
- | You'll have to edit the list separately | + | You'll be asked to provide a text file containing one **valid** URL per line, i.e. each line must begin with '' |
+ | |||
+ | You'll have to edit the list separately | ||
The text file should look like this: | The text file should look like this: | ||
Line 41: | Line 45: | ||
< | < | ||
http:// | http:// | ||
- | http:// | + | https:// |
+ | https:// | ||
http:// | http:// | ||
- | ... | + | http:// |
</ | </ | ||
+ | |||
+ | NB: up to version 1.21, BootCaT does not accept URLs lists encoded as "UTF8 **with BOM**", | ||
+ | ===== Local files (advanced) ===== | ||
+ | |||
+ | Using this mode BootCaT will process all files contained in a folder on your computer. Files will be cleaned and the corpus files will be created. | ||
+ | |||
+ | Most common file formats are supported, including '' | ||
+ | |||
+ | ===== Local queries (advanced) ===== | ||
+ | |||
+ | Using this mode, you can query Google normally using a web browser and save the result pages to a folder. Then you can tell BootCaT where this folder is and it will extract the URLs from the queries you saved. | ||
+ |