Using an external downloader
When you download a very long list of URLs, sometimes BootCaT will crash. We're trying to fix the problem, but for now here's a handy workaround.
Even if BootCaT crashed while downloading files, you can find a file called urls_list_final.txt
in the folder created for the failed attempt at building your corpus: that's the list of all the URLs you collected in the first stage of the corpus creation process.
You can simply try again using the Custom URLs corpus creation mode.
Another solution is downloading the files using an external program and then using BootCaT to clean them using the Local files corpus creation mode.
Here's a step-by-step guide to downloading files using the freeware external downloader WinWget and then turning them into a corpus with BootCaT.
Download and configure WinWget
- Visit the WinWget site at https://www.astatix.com/tools/winwget.php and download the WinWget zip file
- Unzip the WinWget.zip file
- Download Wget for Windows from here https://eternallybored.org/misc/wget/1.20.3/32/wget.exe and move it to the WinWget folder
- Double-click on WinWget to start the application, then click on Tools → Options
- Click on browse and select the
wget.exe
file you downloaded earlier - Then select the folder where you want to download the web pages
- Click OK
Downloading URLs
- Create a new download job
- Select the
url_list_final.txt
file
- Add double quotes characters (
“
) at the beginning and the end of the file path, it should look something like“C:\Users\john\Desktop\urls_list_final.txt”
, the important part is that there must be double quotes at the beginning and at the end of the line
- Click OK, you'll see the job is ready to run, click on the Run button
- After some time (from a few seconds to several minutes, depending on the number of URLs), you'll see that the job is complete
- You can close WinWget and move on to the corpus creation process
Creating the corpus
- Start BootCaT, choose a name and a language for the corpus as usual and when BootCaT asks you how you want to proceed, select
Local files
- BootCaT will ask you to select the folder containing the downloaded pages, click once on the folder and then click on
Open
- You'll be taken to the corpus creation page, click on “Build corpus” and you're done!