bootcat:tutorials:basic_5

This is an old revision of the document!


BootCaT front-end tutorial - Part 5

Congratulations, you have created your first web corpus!

Now you can use your favourite corpus analysis tools to work with your corpus, here's a list of programs you might find useful.

If you want to manually inspect the corpus you just created, there are a number of text editors you can use. If you're on Mac or Linux you already have everything you need, if you're on Windows we strongly recommend the free Notepad++ since the default Windows Notepad will not display the corpus correctly.

If you're happy with the corpus that you have created, then go ahead and have fun using it! Otherwise, if the semi-automatically built corpus does not meet your requirements, repeat the procedure providing a different set of seeds (e.g. more seeds to make the corpus more specific and focussed), and/or modifying the parameters subsequently used to generate the tuples.

Whether you believe in the old adage that “more data is better data” or you simply want to experiment some more, you might want to build a larger corpus. The easiest way of doing it is repeating the process using more seeds (with which you'll be able to generate more tuples/queries which in turn will result in more URLs and more documents).

Use Antconc or Wordsmith tools (or whatever other tool you might have) to generate a list of keywords from your new corpus. Then you can use the most relevant keywords as seeds for a new and improved version of your web corpus.

  • bootcat/tutorials/basic_5.1338384244.txt.gz
  • Last modified: 2012/05/30 13:24
  • by eros