BootCaTExtractor.jar performs the same task as retrieve_and_clean_pages_from_url_list.pl but, unlike the Perl script, supports UTF-8 , language filtering and document size filtering;UrlCollector.jar does not require the “market” parameter anymore;