This file describes the datasets described and used in:

Multilingual Sentiment Analysis on Social Media
E. Tromp

This folder contains 7 subfolders. Each of these
subfolders contains its own readme with more
information on the specific dataset. We next
described each folder briefly.

Ground Truth - This folder contains social media
messages manually labeled. In the thesis this
dataset is called the Ground Truth Set and it
is used to align with the traditional survey data.

LI - Contains all data used for the language
identification experiments. This includes both
training and validation data.

Survey - Contains all survey responses and the
manually labeled samples 'Sample 1' and 'Sample 2'
as described in the thesis.

Test Set - Contains the test set which is manually
labeled and contains 120 messages in total.

Training Sentiment - Contains the trianing data
used for the subjectivity and polarity detection
algorithms. For language identification, the data
in the LI folder is used to train upon instead.

Validation Set - Contains the validation set which
extracted from the crawled data and then manually
labeled for each social medium separately.

Wordlists - Contains the word lists used in the
AdaBoost algorithm for subjectivity detection as
additional features.