The
core of any Text Categorization (TC) experimentation is the final accuracy and
the possibility to compare it against previous work. The Reuters corpus offers
this possibility as it has been largely used in the TC work. Unfortunately, it
is not so easy to pass from its downloadable format to the several versions
used in literature: Apte' split, Apte' split 90 categories, Apte' split 115 (or
135) categories, Apte' split 10 categories, Reuters-22173, Reuters Yang
preparation (Reuters3). An attempt to
describe all Reuters versions has been made in [Sebastiani, 2002],
even if there is a disagreement with [Yang, 1999] on Reuters3 about the
numbers of documents in training and testing. Another critical point is to follow the Apte' split
preparation accurately. Indeed, to get the exact numbers of documents for each
category and for the final split, usually, requires a lot of time.
In
order to help researchers that approach the Text Categorization world, we make
available the standard Apte' split in an easy to process format. The categories
are expressed as different directories. In each directory are stored the set of
files (one for each document) associated with the target category. As in
Reuters there are non-labeled documents we stored all of them in the directory unknown. The document file names are
increasing numbers (starting from 0) over all categories (this enables a fast
document indexing). The training/testing split is provided by means of two
different main directories (test and training).
The
same annoying corpus preparation problems affect also other two well known
corpora: Ohsumed and 20NewsGroups (see [Moschitti and Basili, 2004; Moschitti,
2003a; Moschitti, 2003b]), thus we provide even them in the final format.
Hereafter, there are the corpora descriptions along with the download link:
Reuters-21578 collection Apte' split (available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html).
It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603). This is the most used version as
also confirmed by the Table VI at page 38 in [Sebastiani, 2002]. To obtain from it the Reuters 10
categories Apte' split it is enough to select the 10 top-sized categories, i.e. Earn, Acquisition, Money-fx, Grain, Crude,
Trade, Interest, Ship, Wheat and Corn.
Download
Here
-
90 categories: according to
literature, e.g. [Joachims,
1997], they are the categories with at least 1 training and 1 test documents.
After the category selection the exact number of training documents decreases
to 9,598.
-
115 categories: according to
literature, e.g. [Sebastiani, 2002], they are the categories with at
least 1 training documents.
Ohsumed collection (available at ftp://medir.ohsu.edu/pub/ohsumed):
it includes medical abstracts from the
MeSH categories of the year 1991. In [Joachims, 1997] were
used the first 20,000 documents divided in 10,000 for training and 10,000 for
testing. The specific task was to categorize the 23 cardiovascular
diseases categories. After selecting the such category subset, the unique abstract number becomes 13,929 (6,286 for
training and 7,643 for testing). As current computers can easily manage larger
number of documents we make available all 34,389 cardiovascular
diseases abstracts
out of 50,216 medical abstracts contained in the year 1991.
Download
Here
-
Cardiovascular diseases abstracts (in the first 20,000 abstracts of
the year 1991)
-
All Cardiovascular diseases abstracts (in all 50,216 abstracts of the year 1991)
20Newsgroups corpus (available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/.):
it contains 19997 articles for 20 categories taken from the Usenet newsgroups
collection. We used the subject and the body of each message only. Some of the
newsgroups are very closely related to each other (e.g., IBM computer system hardware / Macintosh computer system hardware),
while others are highly unrelated (e.g. misc
forsale / social
religion and christian). This corpus is different
from the previous corpora because it includes a larger vocabulary and words typically have more meanings.
Moreover, the stylistic writing (e-mail dialogues) is very distant from the other more technical collections.
Download
Here
-
All 20,000
documents (There is no fixed literature split. It has usually been used
with cross validation techniques)
Some
additional information as well as the accuracy evaluations of the above corpora
can be found below.
[Moschitti and Basili, 2004]. Alessandro Moschitti and Roberto Basili, Complex Linguistic Features for Text Classification: a comprehensive study. In proceedings of the 26th European Conference on Information Retrieval Research (ECIR 2004), Sunderland, U.K., 2004.
[Moschitti, 2003b]. Alessandro Moschitti, A study on optimal parameter tuning for Rocchio text classifier. In proceedings of the 25th European Conference on Information Retrieval Research (ECIR 2003), Pisa, Italy, April, 2003.
[Joachims, 1997] Thorsten
Joachims, Text Categorization
with Support Vector Machines: Learning with Many Relevant Features. LS8-Report 23, Universitat Dortmund,
LS VIII-Report, 1997.
[Sebastiani, 2002] Fabrizio Sebastiani. Machine
learning in automated text categorization. ACM Computing Surveys,
34(1):1-47, 2002.
[Yang, 1999] Yiming Yang, An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, Vol 1, No. 1/2, pp 67--88, 1999.
Maintained by Alessandro Moschitti
moschitti[at]dit.unitn.it |