Skip to contents

This is an R Package with datasets for text analysis, including word frequencies, ranks, and norms for various languages (English, Spanish, French, German, Italian, Portuguese). See also text2map.

Installation

This is primarily a dataset package and therefore we will not be sending it to CRAN. You can install the latest version from GitLab:

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")

library(text2map.dictionaries)

Core Dictionaries (Installed with Package)

The following 11 dictionaries are installed with the package by default:

Dictionary Description Rows Cols
concreteness Lancaster Concreteness Scores (39,954 terms) 39,954 8
english_abbreviations Abbreviations, acronyms, honorifics, initialisms, and firearm calibers 952 5
english_emoticons Text emoticons and their meanings 81 3
english_fusing_rules Historical bigram-to-compound fusing rules 259 5
english_interjections Interjections with translations and definitions 168 6
english_irregular_verbs English irregular verbs with conjugation forms 188 5
english_normalization_rules Text normalization rules (form → replacement) 19,376 6
english_numerics Cardinal, ordinal, and Roman numeral words 4,101 4
english_personal_names Personal names (skip list for OCR correction) 4,644 2
sensorimotor Lancaster Sensorimotor Norms (39,707 terms) 39,707 45
unicode_normalization Unicode normalization rules (form → replacement) 635 5

On-Demand Dictionaries (Downloaded as Needed)

The following 54 dictionaries are downloaded on-demand from the repository when you first request them:

Dictionary Description Rows
bgb_pleasantness Bellezza et al. pleasantness and imagery ratings 399
bootstrap_mrc Bootstrapped MRC psycholinguistic features 85,942
british_american_spelling British↔︎American spelling differences by pattern 283
callsigns US FCC broadcast station callsigns and owners 43,366
chemicals Chemical names, formulas, and identifiers 77,975
demonyms Demonyms and adjectivals for places 1,553
diseases MalaCards Human Disease Database 37,991
elp_lexical English Lexicon Project lexical and behavioral data 79,672
emfd_norms Extended Moral Foundations Dictionary norms 3,270
english_action_verbs English action verbs 1,566
english_adverbs English adverbs 13,397
english_antonyms WordNet antonym pairs 3,627
english_apostrophe_words Apostrophe-containing words skip-list 2,431
english_archaic Archaic/dialectal spellings with modern equivalents 166
english_bigrams English bigram frequencies from Google Web Trillion Word Corpus 286,357
english_colors CSS4 named colors with hex and RGB values 148
english_compounds Compound words (hyphenated + closed) 62,480
english_contractions English contractions and expansions (basic + extended) 550
english_discourse_markers Discourse markers categorized by type and semantics 97
english_emoji Unicode emoji with English names and categories 386
english_freqs English word frequencies across four corpora 371,939
english_function_words English function words 350
english_grady GradyAugmented English word list 122,806
english_hedging Epistemic hedging and stance markers 247
english_homophones Commonly confused homophone pairs and groups 290
english_internet_slang Internet slang, abbreviations, acronyms, and leetspeak 242
english_legal_jargon Legal terminology with plain English translations 349
english_medical Medical subject headings from NLM MeSH 2026 (terms, synonyms, categories, scope notes) 31,110
english_log_freq Log-frequency word values 811
english_misspellings Common misspellings with corrections 40,299
english_phrasal_verbs English phrasal verbs with simplified meanings 1,085
english_place_names Place names (skip list for OCR correction) 604
english_political_abbreviations Political entity abbreviations (ISO, USPS, AP, Canada Post) 642
english_prepositions English prepositions 162
english_professions Occupation titles from O*NET 30.2 (canonical + alternates) 58,556
english_syllables English words with syllable counts 20,137
english_synonyms WordNet synonym pairs 266,155
french_freqs French word frequencies (Wikipedia, news, subtitles) 415,864
german_freqs German word frequencies (Wikipedia, news, subtitles) 743,574
global_surnames Global surname prevalence by country 10,607,198
humor_norms Humor ratings for 4,997 English words 4,997
iconicity Iconicity ratings for 14,776 English words 14,776
italian_freqs Italian word frequencies (Wikipedia, news, subtitles) 456,805
kte_survey Kozlowski et al. cultural associations 59
latin_phrases Latin phrases with English translations 2,757
mft_anchors Moral Foundations Theory anchor words 365
nrc_vad NRC Valence, Arousal, Dominance 20,007
organisms Scientific organism names from ITIS and UniProt 3,478,229
portuguese_freqs Portuguese word frequencies (Wikipedia, news, subtitles) 385,436
spanish_freqs Spanish word frequencies (Wikipedia, news, subtitles) 455,684
subtlexus_freqs SUBTLEXus word frequencies 74,286
us_ssa_names US SSA baby name frequencies 2,085,158
us_ssa_surnames US SSA surname data by race/ethnicity 162,254
wkb_vad Warriner et al. VAD scores with group breakdowns 97,398

Usage

Listing and Inspecting Dictionaries

# List all available dictionaries
list_dictionaries()

# List only installed dictionaries
list_dictionaries(installed_only = TRUE)

# Filter by status, pattern, or minimum rows
list_dictionaries(status = "ondemand", pattern = "^english", min_rows = 1000)

# Get detailed metadata for a single dictionary
dictionary_info("sensorimotor")

Loading Dictionaries

The main function is load_dictionary(), which auto-downloads on-demand dictionaries:

# Load a core dictionary (already installed)
sensorimotor <- load_dictionary("sensorimotor")

# Load an on-demand dictionary (auto-downloads on first use)
global_surnames <- load_dictionary("global_surnames", large = TRUE)

# Load with column name unification (adds a "term" column)
abbrevs <- load_dictionary("english_abbreviations", unify = TRUE)

# Load multiple dictionaries at once
dicts <- load_dictionaries(c("nrc_vad", "wkb_vad", "humor_norms"))

# Take a random sample from a large dictionary
sample <- sample_dictionary("global_surnames", n = 100, seed = 42, large = TRUE)

Downloading and Updating

# Pre-download an on-demand dictionary
download_dictionary("english_freqs")

# Download to a custom location
download_dictionary("chemicals", path = "/my/data")

# Check for updates
check_updates()

# Update all outdated dictionaries
update_dictionaries()

Cache Management

Loaded dictionaries are cached for faster subsequent access:

# Clear all cached dictionaries
clear_dictionary_cache()

# Remove a single cached dictionary
remove_cached_dictionary("global_surnames")

# Force re-cache after an update
load_dictionary("nrc_vad", force_rebuild = TRUE)

Version and Integrity

# Check a dictionary's version
dict_version("sensorimotor")

# Verify a dictionary's integrity (MD5 checksum)
verify_dictionary("sensorimotor")

Unifying Column Names

Dictionaries use different names for their primary identifier column (term, word, form, name, etc.). Use unify_dictionary() to normalize them:

# Unify by name
surnames <- unify_dictionary("global_surnames")
head(surnames$term)

# Unify a data frame directly
df <- load_dictionary("english_numerics")
unified <- unify_dictionary(df)
head(unified$term)

File Formats

The package supports two file formats:

  • .qs2 (default): Smaller file size, faster loading. Uses ZSTD compression (level 22).
  • .rda: Standard R data format for compatibility.

There are four related packages hosted on GitLab:

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.pretrained")

Contributions and Support

We welcome new dictionaries — especially old or rare dictionaries! If you have a dictionary you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.dictionaries/-/issues