The Wikilinks Rare Entity Prediction Dataset
Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Chi Kit Cheung, Doina Precup; "World Knowledge for Reading Comprehension: Rare Entity Prediction with Hierarchical LSTMs Using External Descriptions", EMNLP 2017. [ACL anthology link]
The dataset was created for the task of rare entity prediction, a reading comprehension problem proposed in the above paper. This involves predicting which entities are missing in a given document. The task is made more difficult by the relatively few appearances of each entity within the corpus. With nearly 270 000 documents and 245 000 unique entities, 92.8% of entities appear 10 or fewer times within the entire corpus. With such few appearances, rather than relying solely on document context for completing the task, external knowledge sources are used to train the models. The external knowledge included in the dataset are the descriptions extracted from Freebase for each unique entity.
The following is an example from the dataset:[...] ***blank***, who lived from 1757 to 1827, was admired by a small group of intellectuals and artists in his day, but never gained general recognition as either a poet or painter. [...]
Candidate Entities: Candidate entities appear with their Freebase description
Peter Ackroyd: Peter Ackroyd is an English biographer, novelist and critic with a particular interest in the history and culture of London. [...]
William Blake: William Blake was an English poet, painter, and printmaker. [...]
Emmanuel Swedenborg: Emmanuel Swedenborg was a Swedish scientist, philosopher, theologian, revelator, and mystic. [...]