The Wikilinks Rare Entity Prediction Dataset

The Wikilinks Rare Entity Prediction Dataset is a significantly processed version of the Wikilinks dataset (created for the task of cross-document coreference resolution). The corpus is formed by parsing the HTML of the crawled webpages. Entity descriptions are extracted from Freebase and serve as the external knowledge. It was created for the following paper:

Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Chi Kit Cheung, Doina Precup; "World Knowledge for Reading Comprehension: Rare Entity Prediction with Hierarchical LSTMs Using External Descriptions", EMNLP 2017. [ACL anthology link]

The dataset was created for the task of rare entity prediction, a reading comprehension problem proposed in the above paper. This involves predicting which entities are missing in a given document. The task is made more difficult by the relatively few appearances of each entity within the corpus. With nearly 270 000 documents and 245 000 unique entities, 92.8% of entities appear 10 or fewer times within the entire corpus. With such few appearances, rather than relying solely on document context for completing the task, external knowledge sources are used to train the models. The external knowledge included in the dataset are the descriptions extracted from Freebase for each unique entity.

The following is an example from the dataset:

Context: The local context for which the entity must be predicted

[...] ***blank***, who lived from 1757 to 1827, was admired by a small group of intellectuals and artists in his day, but never gained general recognition as either a poet or painter. [...]

Candidate Entities: Candidate entities appear with their Freebase description

Peter Ackroyd: Peter Ackroyd is an English biographer, novelist and critic with a particular interest in the history and culture of London. [...]
William Blake: William Blake was an English poet, painter, and printmaker. [...]
Emmanuel Swedenborg: Emmanuel Swedenborg was a Swedish scientist, philosopher, theologian, revelator, and mystic. [...]