Using an external downloader

When you download a very long list of URLs, sometimes BootCaT will crash. We're trying to fix the problem, but for now here's a handy workaround.

Even if BootCaT crashed while downloading files, you can find a file called urls_list_final.txt in the folder created for the failed attempt at building your corpus: that's the list of all the URLs you collected in the first stage of the corpus creation process.

You can simply try again using the Custom URLs corpus creation mode.

Another solution is downloading the files using an external program and then using BootCaT to clean them using the Local files corpus creation mode.

Here's a step-by-step guide to downloading files using the freeware external downloader WinWget and then turning them into a corpus with BootCaT.

  • Double-click on WinWget to start the application, then click on Tools → Options

  • Click on browse and select the wget.exe file you downloaded earlier
  • Then select the folder where you want to download the web pages

  • Click OK
  • Create a new download job

  • Select the url_list_final.txt file

  • Add double quotes characters () at the beginning and the end of the file path, it should look something like “C:\Users\john\Desktop\urls_list_final.txt”, the important part is that there must be double quotes at the beginning and at the end of the line

  • Click OK, you'll see the job is ready to run, click on the Run button

  • After some time (from a few seconds to several minutes, depending on the number of URLs), you'll see that the job is complete

  • You can close WinWget and move on to the corpus creation process
  • Start BootCaT, choose a name and a language for the corpus as usual and when BootCaT asks you how you want to proceed, select Local files

  • BootCaT will ask you to select the folder containing the downloaded pages, click once on the folder and then click on Open

  • You'll be taken to the corpus creation page, click on “Build corpus” and you're done!

  • bootcat/help/use_external_downloader.txt
  • Last modified: 2021/02/10 12:14
  • by eros