Bullying Traces v1.0 ***************************************************************************** As per Terms of Services of Twitter, the original tweets are not included in this dataset. Instead, we release proceeded data files only. Specifically, we release the vocabulary, feature vectors, and labels. It is impossible to fully reconstruct the original tweets from this data. However, we believe the dataset is still useful for the scientific study of bullying. ***************************************************************************** This dataset contains data collected from Twitter stream API and labeled by experienced annotators for the study of bullying traces in social media. We collected tweets using the public Twitter stream API, such that each tweet contains at least one of the following keywords: "bully, bullied, bullying". We further removed re-tweets by excluding tweets containing the acronym "RT." Our annotators labeled 1762 tweets uniformly sampled from the ones collected by the above procedure on August 6, 2011. The tweets are cased-folded and tokenized, but without any stemming or stopword removal. Any user mentions preceded by a ``@'' were replaced by the anonymized user name ``@USERNAME''. Any URLs starting with ``http'' were replaced by the token ``HTTPLINK''. Hashtags (compound words following ``#'') were not split and were treated as a single token. Emoticons, such as ``:)'' or ``:D'', were also included as tokens. Our features include both unigrams and bigrams that appear at least twice in the 1762 tweets. To cite this dataset: Learning from bullying traces in social media Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and Amy Bellmore In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT) Montreal, Canada, 2012 Contact: Jun-Ming Xu (xujm@cs.wisc.edu), Xiaojin Zhu (jerryzhu@cs.wisc.edu) April 2012 ----------------------------------Format----------------------------------- The feature vectors are written in a sparse vector format as commonly used in SVM-light. Each line corresponds to one tweet and has the format: Label featureIndex:value featureIndex:value featureIndex:value ... Only features with a nonzero value are listed. Each vector is normalized to have norm 1. ----------------------------------Content---------------------------------- vocab The vocabulary file for all features. Each line is a token and the index is the line number, starting from 1. tweetType This corresponds to "NLP Task A: Text Categorization" in the paper. It contains 1762 feature vectors for all the labeled tweets. Labels: 1 bullying trace, -1 not bullying trace authorRole This corresponds to "NLP Task B: Role Labeling / Author's Roles". It contains the 684 feature vectors for the bullying traces only. Labels: 1 Accuser 2 Bully 3 Reporter 4 Victim 5 Other teasing This corresponds to "NLP Task C: Sentiment Analysis" in the paper. It contains the 684 feature vectors for the bullying traces only. Label: 1 Teasing, -1 Not teasing