Subset of 6 Corpora for the SentiStrength Benchmark — corpus_senti

A dataset containing six corpora used in the SentiStrength Benchmark for judging the performance of automated sentiment analysis techniques. The corpus contains a random sample of 4,044 comments from online social networks collected circa 2010. The original corpor includes: BBC Forums (1,000), Digg (1,077), MySpace (1,041), Runners World (1,046), Twitter (4,242), and YouTube (3,407). Each comment was hand coded for sentiment polarity. There were three coders for all but Runners World (two coders), Twitter (one coder), and YouTube (one coder) -- all relied on a common codebook. Data were prepared on January 9th, 2021

Usage

data(corpus_senti_bench4k)

Format

A data frame with 4044 rows and 6 variables.

Variables

doc_id. Unique identifier for each comment
pos_mean. Mean positivity score
neg_mean. Mean negativity score
polarity. Positivity minus the negativity
source. One of the six online social networks
text. Text of the comment