Adverbial Presupposition Triggering Dataset: Corpora

Adverbial Presupposition Triggering Dataset

We extract datasets from two corpora, namely the Penn Treebank (PTB) corpus (Marcus et al. 1993) and a subset (sections 000-760) of the third edition of the English Gigaword corpus (Graff et al. 2007). For the PTB dataset, we use sections 22 and 23 for testing. For the Gigaword corpus, we use sections 700-760 for testing. For the remaining data, we randomly chose 10% of them for development, and the other 90% for training.

For each dataset, we consider a set of five target adverbs: too, again, also, still, and yet. We choose these five because they are commonly used adverbs that trigger presupposition. Since we are concerned with investigating the capacity of attentional deep neural networks in predicting the presuppositional effects in general, we frame the learning problem as a binary classification for predicting the presence of an adverbial presupposition (as opposed to the identity of the adverb).

On the Gigaword corpus, we consider each adverb separately, resulting in five binary classification tasks. This was not feasible for PTB because of its small size.

Finally, because of the commonalities between the adverbs in presupposing similar events, we create a dataset that unifies all instances of the five adverbs found in the Gigaword corpus, with a label "1" indicating the presence of any of these adverbs.