xml_corpus
folder; a single corpus.xml
file is also created, containing the merged version of the pseudo-XML corpus; the XML version of the corpus contains more metadata than the plain text version:id
, a unique identifier for the document consisting of the corpus name followed by a number,filename
of the downloaded file (basically, the id plus the file extension),uri
, the uri of the original file,content_type
of the original file;