Version 1.21
- NEW (feature): Windows and Mac users no longer need to install Java for BootCaT to work, Java is already included in the distribution package;
- NEW (feature): pseudo-XML versions of the extracted plain text files are now created in the
xml_corpus
folder; a singlecorpus.xml
file is also created, containing the merged version of the pseudo-XML corpus; the XML version of the corpus contains more metadata than the plain text version:id
, a unique identifier for the document consisting of the corpus name followed by a number,filename
of the downloaded file (basically, the id plus the file extension),uri
, the uri of the original file,content_type
of the original file;
- NEW (feature): in the “Project Definition” step, you can now add up to three user-defined XML attributes to the XML version of the corpus;
- NEW (feature): the name of the corpus is now prepended to the names of downloaded files, individual corpus text files and XML corpus files; this makes it possible to easily merge different corpora in the same folder; files are still progressively numbered;
- BUGFIX : fixed a bug that prevented download timeout to work properly, resulting in BootCaT to wait forever for certain URLs to download