The Wikilinks Rare Entity Prediction Dataset

The dataset is contained in this single compressed file:
rare_entity_dataset.tar.gz (1042563276 bytes, MD5: 39b816112052038dd06e4fc8928df218)

This archive contains 3 files:

corpus.txt: This file contains the content of parsed web pages. Each line represents a single document, with entity mentions being replaced by their corresponding freebase_ids.

entities.txt: This file contains specific information about the entities that appear in the corpus. Each line consists of five entries, separated by tabs, which includes the freebase_id, anchor_text, wiki_url, freebase_name, and description of an entity. A readme.