TREC Data Guide
This page provides a link to the TREC 2002 corpora available to members
of the Cognitive Computation Group. This data is copyright-protected, so
it is your responsibility to use it accordingly.
The TREC 2002 data set contains articles from the New York Times,
Associated Press Worldstream, and Xinhua News, from 1998-2000.
To access the TREC AQAINT data (directory listings), follow
this link.
From this page, you can access the original contents of the TREC cd sets from
2002,
plus some annotated versions of these corpora:
column format data
This gzipped tarball contains column format annotated versions of the apw,
nyt, and xie articles in the TREC02 distribution. The annotation
includes Part of Speech, Shallow Parse, and Named Entity information.
minipar column format
This corpus contains articles from apw and nyt, tagged with
minipar. There
is a readme included that explains the format.
question-answer pairs (column format)
This is a QA corpus containing pairs of sentences comprising a question
rewritten in statement form and a candidate answer returned by a text
retrieval system. There are 2 to 4 pairs for each question: up to 2 pairs
in which the answer is correct, and 2 in which the answer is incorrect.
(For some questions, there are less than 2 correct candidate answers.)
There is an associated .info file which contains the truth value for each
pair.
This column format data has POS, NE, shallow parse, full parse (Collins) and
SRL (older version) annotation. Also included is a file that explains
the format of the annotation.
Mark Sammons
Last modified: Thu Dec 1 09:51:52 CST 2005