TREC Data Guide

This page provides a link to the TREC 2002 corpora available to members of the Cognitive Computation Group. This data is copyright-protected, so it is your responsibility to use it accordingly. The TREC 2002 data set contains articles from the New York Times, Associated Press Worldstream, and Xinhua News, from 1998-2000.

To access the TREC AQAINT data (directory listings), follow this link.

From this page, you can access the original contents of the TREC cd sets from 2002, plus some annotated versions of these corpora:


column format data

This gzipped tarball contains column format annotated versions of the apw, nyt, and xie articles in the TREC02 distribution. The annotation includes Part of Speech, Shallow Parse, and Named Entity information.

minipar column format

This corpus contains articles from apw and nyt, tagged with minipar. There is a readme included that explains the format.

question-answer pairs (column format)

This is a QA corpus containing pairs of sentences comprising a question rewritten in statement form and a candidate answer returned by a text retrieval system. There are 2 to 4 pairs for each question: up to 2 pairs in which the answer is correct, and 2 in which the answer is incorrect. (For some questions, there are less than 2 correct candidate answers.) There is an associated .info file which contains the truth value for each pair.

This column format data has POS, NE, shallow parse, full parse (Collins) and SRL (older version) annotation. Also included is a file that explains the format of the annotation.



Mark Sammons
Last modified: Thu Dec 1 09:51:52 CST 2005