A dataset containing six corpora used in the SentiStrength Benchmark for judging the performance of automated sentiment analysis techniques. The corpus contains 11,557 comments from online social networks collected circa 2010: BBC Forums (1,000), Digg (1,077), MySpace (1,041), Runners World (1,046), Twitter (4,242), and YouTube (3,407). Each comment was hand coded for sentiment polarity. There were three coders for all but Runners World (two coders), Twitter (one coder), and YouTube (one coder) -- all relied on a common codebook. Data were prepared on January 9th, 2021

data(corpus_senti_bench)

Format

A data frame with 11557 rows and 6 variables.

Variables

  • doc_id. Unique identifier for each comment

  • pos_mean. Mean positivity score

  • neg_mean. Mean negativity score

  • polarity. Positivity minus the negativity

  • source. One of the six online social networks

  • text. Text of the comment