Raw dialogue files (two-way conversation, no pre-processing):
Ubuntu dialogues (527M)
Files containing the data for the response classification task described in the paper.
Data is split into train/validation/test sets. Each example is a triple: context, response, flag.
Standard (readable) CSV. This is very large, and split into several files to facilitate downloading.
ubuntu_dataset.tgz.aa (954M)
ubuntu_dataset.tgz.ab (954M)
ubuntu_dataset.tgz.ac (954M)
ubuntu_dataset.tgz.ad (954M)
ubuntu_dataset.tgz.ae (690M)
The files can be joined by using the command `cat ubuntu_dataset.tgz.a* | tar xz'
Same files in a single binarized CSV:
Ubuntu blobs (513M)
The Ubuntu blobs are in the format required to train the neural architecture as described in the paper using our code:
Ubuntu neural dialogue model
For further information, contact: Dr. Joelle Pineau