Skip to contents

This is an R Package with datasets for text analysis, including word frequencies, ranks, and norms for various languages (English, Spanish, French, German, Italian, Portuguese). See also text2map.

Installation

This is primarily a dataset package and therefore we will not be sending it to CRAN. You can install the latest version from GitLab:

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")

library(text2map.dictionaries)

Core Dictionaries (Installed with Package)

The following 4 dictionaries are installed with the package by default:

Dictionary Description Rows Cols
concreteness Lancaster Concreteness Scores (39,954 terms) 39,954 8
english_normalization_rules Text normalization rules (form → replacement) 19,376 6
sensorimotor Lancaster Sensorimotor Norms (39,707 terms) 39,707 45
unicode_normalization Unicode normalization rules (form → replacement) 635 5

On-Demand Dictionaries (Downloaded as Needed)

The following 65 dictionaries are downloaded on-demand from the repository when you first request them:

Dictionary Description Rows
affordance_norms Affordance production norms for 2,825 concrete nouns 2,825
bgb_pleasantness Bellezza et al. pleasantness and imagery ratings 399
bootstrap_mrc Bootstrapped MRC psycholinguistic features 85,942
british_american_spelling British↔︎American spelling differences by pattern 283
callsigns US FCC broadcast station callsigns and owners 43,366
category_typicality Category typicality ratings from Banks & Connell and THINGSplus 3,033
chemicals Chemical names, formulas, and identifiers 77,975
demonyms Demonyms and adjectivals for places 1,553
diseases MalaCards Human Disease Database 37,991
elp_lexical English Lexicon Project lexical and behavioral data 79,672
emfd_norms Extended Moral Foundations Dictionary norms 3,270
english_abbreviations Abbreviations, acronyms, honorifics, initialisms, and political entity abbreviations 1,585
english_action_verbs English action verbs 1,566
english_adverbs English adverbs 13,397
english_antonyms WordNet antonym pairs 3,627
english_apostrophe_words Apostrophe-containing words skip-list 2,431
english_archaic Archaic/dialectal spellings with modern equivalents 166
english_bigrams English bigram frequencies from Google Web Trillion Word Corpus 286,357
english_colors CSS4 named colors with hex and RGB values 148
english_compounds Compound words (hyphenated + closed) 62,480
english_contractions English contractions and expansions (basic + extended) 550
english_discourse_markers Discourse markers categorized by type and semantics 97
english_emoticons Text emoticons and Unicode emoji with names and categories 467
english_freqs English word frequencies across four corpora 371,938
english_function_words English function words and prepositions 460
english_fusing_rules Historical bigram-to-compound fusing rules 259
english_grady GradyAugmented English word list 122,806
english_hedging Epistemic hedging and stance markers 247
english_homophones Commonly confused homophone pairs and groups 290
english_interjections Interjections with translations and definitions 168
english_internet_slang Internet slang, abbreviations, acronyms, and leetspeak 242
english_irregular_verbs English irregular verbs with conjugation forms 188
english_legal_jargon Legal terminology with plain English translations 349
english_log_freq Log-frequency word values 811
english_medical Medical subject headings from NLM MeSH 2026 31,110
english_misspellings Common misspellings with corrections (codespell, Birkbeck, Aspell, Wikipedia, Holbrook) 93,488
english_numerics Cardinal, ordinal, and Roman numeral words 4,101
english_personal_names Personal names (skip list for OCR correction) 4,644
english_phonaesthemes English phonaesthetic patterns (onset + rime) 640
english_phrasal_verbs English phrasal verbs with simplified meanings 1,085
english_place_names Place names (skip list for OCR correction) 604
english_professions Occupation titles from O*NET 30.2 (canonical + alternates) 58,556
english_syllables English words with syllable counts 20,137
english_synonyms WordNet synonym pairs 266,155
english_verb_embodiment Embodiment and semantic norms for English verbs 2,938
english_word_recognition Age-of-acquisition and familiarity ratings for English words 31,124
french_freqs French word frequencies (Wikipedia, news, subtitles) 415,863
german_freqs German word frequencies (Wikipedia, news, subtitles) 743,573
global_surnames Global surname prevalence by country 10,607,198
humor_norms Humor ratings for 4,997 English words 4,997
iconicity Iconicity ratings for 14,776 English words 14,776
italian_freqs Italian word frequencies (Wikipedia, news, subtitles) 456,804
kte_survey Kozlowski et al. cultural associations 59
latin_phrases Latin phrases with English translations 2,757
mft_anchors Moral Foundations Theory anchor words 365
nrc_vad NRC Valence, Arousal, Dominance 20,007
occupation_prestige Occupational prestige scores across multiple coding systems (SIOPS, ISEI, CAMSIS, GSS, NS-SEC) 10,674
organisms Scientific organism names from ITIS and UniProt 4,478,229
portuguese_freqs Portuguese word frequencies (Wikipedia, news, subtitles) 385,435
semantic_density Semantic density ratings from McRae and Buchanan norms 4,436
spanish_freqs Spanish word frequencies (Wikipedia, news, subtitles) 455,683
subtlexus_freqs SUBTLEXus word frequencies 74,286
us_ssa_names US SSA baby name frequencies 2,085,158
us_ssa_surnames US SSA surname data by race/ethnicity 162,254
wkb_vad Warriner et al. VAD scores with group breakdowns 97,398

Usage

Listing and Inspecting Dictionaries

# List all available dictionaries
list_dictionaries()

# List only installed dictionaries
list_dictionaries(installed_only = TRUE)

# Filter by status, pattern, or minimum rows
list_dictionaries(status = "ondemand", pattern = "^english", min_rows = 1000)

# Get detailed metadata for a single dictionary
dictionary_info("sensorimotor")

Loading Dictionaries

The main function is load_dictionary(), which auto-downloads on-demand dictionaries:

# Load a core dictionary (already installed)
sensorimotor <- load_dictionary("sensorimotor")

# Load an on-demand dictionary (auto-downloads on first use)
global_surnames <- load_dictionary("global_surnames", large = TRUE)

# Load with column name unification (adds a "term" column)
abbrevs <- load_dictionary("english_abbreviations", unify = TRUE)

# Load multiple dictionaries at once
dicts <- load_dictionaries(c("nrc_vad", "wkb_vad", "humor_norms"))

# Take a random sample from a large dictionary
sample <- sample_dictionary("global_surnames", n = 100, seed = 42, large = TRUE)

Downloading and Updating

# Pre-download an on-demand dictionary
download_dictionary("english_freqs")

# Download to a custom location
download_dictionary("chemicals", path = "/my/data")

# Check for updates
check_updates()

# Update all outdated dictionaries
update_dictionaries()

Cache Management

Loaded dictionaries are cached for faster subsequent access:

# Clear all cached dictionaries
clear_dictionary_cache()

# Remove a single cached dictionary
remove_cached_dictionary("global_surnames")

# Force re-cache after an update
load_dictionary("nrc_vad", force_rebuild = TRUE)

Version and Integrity

# Check a dictionary's version
dict_version("sensorimotor")

# Verify a dictionary's integrity (MD5 checksum)
verify_dictionary("sensorimotor")

Unifying Column Names

Dictionaries use different names for their primary identifier column (term, word, form, name, etc.). Use unify_dictionary() to normalize them:

# Unify by name
surnames <- unify_dictionary("global_surnames")
head(surnames$term)

# Unify a data frame directly
df <- load_dictionary("english_numerics")
unified <- unify_dictionary(df)
head(unified$term)

File Formats

The package supports two file formats:

  • .qs2 (default): Smaller file size, faster loading. Uses ZSTD compression (level 22).
  • .rda: Standard R data format for compatibility.

There are four related packages hosted on GitLab:

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.pretrained")

Contributions and Support

We welcome new dictionaries — especially old or rare dictionaries! If you have a dictionary you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.dictionaries/-/issues