
text2map.dictionaries: Dictionaries for Text Analysis
This is an R Package with datasets for text analysis, including word frequencies, ranks, and norms for various languages (English, Spanish, French, German, Italian, Portuguese). See also text2map.
Installation
This is primarily a dataset package and therefore we will not be sending it to CRAN. You can install the latest version from GitLab:
library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
library(text2map.dictionaries)Core Dictionaries (Installed with Package)
The following 11 dictionaries are installed with the package by default:
| Dictionary | Description | Rows | Cols |
|---|---|---|---|
concreteness |
Lancaster Concreteness Scores (39,954 terms) | 39,954 | 8 |
english_abbreviations |
Abbreviations, acronyms, honorifics, initialisms, and firearm calibers | 952 | 5 |
english_emoticons |
Text emoticons and their meanings | 81 | 3 |
english_fusing_rules |
Historical bigram-to-compound fusing rules | 259 | 5 |
english_interjections |
Interjections with translations and definitions | 168 | 6 |
english_irregular_verbs |
English irregular verbs with conjugation forms | 188 | 5 |
english_normalization_rules |
Text normalization rules (form → replacement) | 19,376 | 6 |
english_numerics |
Cardinal, ordinal, and Roman numeral words | 4,101 | 4 |
english_personal_names |
Personal names (skip list for OCR correction) | 4,644 | 2 |
sensorimotor |
Lancaster Sensorimotor Norms (39,707 terms) | 39,707 | 45 |
unicode_normalization |
Unicode normalization rules (form → replacement) | 635 | 5 |
On-Demand Dictionaries (Downloaded as Needed)
The following 54 dictionaries are downloaded on-demand from the repository when you first request them:
| Dictionary | Description | Rows |
|---|---|---|
bgb_pleasantness |
Bellezza et al. pleasantness and imagery ratings | 399 |
bootstrap_mrc |
Bootstrapped MRC psycholinguistic features | 85,942 |
british_american_spelling |
British↔︎American spelling differences by pattern | 283 |
callsigns |
US FCC broadcast station callsigns and owners | 43,366 |
chemicals |
Chemical names, formulas, and identifiers | 77,975 |
demonyms |
Demonyms and adjectivals for places | 1,553 |
diseases |
MalaCards Human Disease Database | 37,991 |
elp_lexical |
English Lexicon Project lexical and behavioral data | 79,672 |
emfd_norms |
Extended Moral Foundations Dictionary norms | 3,270 |
english_action_verbs |
English action verbs | 1,566 |
english_adverbs |
English adverbs | 13,397 |
english_antonyms |
WordNet antonym pairs | 3,627 |
english_apostrophe_words |
Apostrophe-containing words skip-list | 2,431 |
english_archaic |
Archaic/dialectal spellings with modern equivalents | 166 |
english_bigrams |
English bigram frequencies from Google Web Trillion Word Corpus | 286,357 |
english_colors |
CSS4 named colors with hex and RGB values | 148 |
english_compounds |
Compound words (hyphenated + closed) | 62,480 |
english_contractions |
English contractions and expansions (basic + extended) | 550 |
english_discourse_markers |
Discourse markers categorized by type and semantics | 97 |
english_emoji |
Unicode emoji with English names and categories | 386 |
english_freqs |
English word frequencies across four corpora | 371,939 |
english_function_words |
English function words | 350 |
english_grady |
GradyAugmented English word list | 122,806 |
english_hedging |
Epistemic hedging and stance markers | 247 |
english_homophones |
Commonly confused homophone pairs and groups | 290 |
english_internet_slang |
Internet slang, abbreviations, acronyms, and leetspeak | 242 |
english_legal_jargon |
Legal terminology with plain English translations | 349 |
english_medical |
Medical subject headings from NLM MeSH 2026 (terms, synonyms, categories, scope notes) | 31,110 |
english_log_freq |
Log-frequency word values | 811 |
english_misspellings |
Common misspellings with corrections | 40,299 |
english_phrasal_verbs |
English phrasal verbs with simplified meanings | 1,085 |
english_place_names |
Place names (skip list for OCR correction) | 604 |
english_political_abbreviations |
Political entity abbreviations (ISO, USPS, AP, Canada Post) | 642 |
english_prepositions |
English prepositions | 162 |
english_professions |
Occupation titles from O*NET 30.2 (canonical + alternates) | 58,556 |
english_syllables |
English words with syllable counts | 20,137 |
english_synonyms |
WordNet synonym pairs | 266,155 |
french_freqs |
French word frequencies (Wikipedia, news, subtitles) | 415,864 |
german_freqs |
German word frequencies (Wikipedia, news, subtitles) | 743,574 |
global_surnames |
Global surname prevalence by country | 10,607,198 |
humor_norms |
Humor ratings for 4,997 English words | 4,997 |
iconicity |
Iconicity ratings for 14,776 English words | 14,776 |
italian_freqs |
Italian word frequencies (Wikipedia, news, subtitles) | 456,805 |
kte_survey |
Kozlowski et al. cultural associations | 59 |
latin_phrases |
Latin phrases with English translations | 2,757 |
mft_anchors |
Moral Foundations Theory anchor words | 365 |
nrc_vad |
NRC Valence, Arousal, Dominance | 20,007 |
organisms |
Scientific organism names from ITIS and UniProt | 3,478,229 |
portuguese_freqs |
Portuguese word frequencies (Wikipedia, news, subtitles) | 385,436 |
spanish_freqs |
Spanish word frequencies (Wikipedia, news, subtitles) | 455,684 |
subtlexus_freqs |
SUBTLEXus word frequencies | 74,286 |
us_ssa_names |
US SSA baby name frequencies | 2,085,158 |
us_ssa_surnames |
US SSA surname data by race/ethnicity | 162,254 |
wkb_vad |
Warriner et al. VAD scores with group breakdowns | 97,398 |
Usage
Listing and Inspecting Dictionaries
# List all available dictionaries
list_dictionaries()
# List only installed dictionaries
list_dictionaries(installed_only = TRUE)
# Filter by status, pattern, or minimum rows
list_dictionaries(status = "ondemand", pattern = "^english", min_rows = 1000)
# Get detailed metadata for a single dictionary
dictionary_info("sensorimotor")Loading Dictionaries
The main function is load_dictionary(), which auto-downloads on-demand dictionaries:
# Load a core dictionary (already installed)
sensorimotor <- load_dictionary("sensorimotor")
# Load an on-demand dictionary (auto-downloads on first use)
global_surnames <- load_dictionary("global_surnames", large = TRUE)
# Load with column name unification (adds a "term" column)
abbrevs <- load_dictionary("english_abbreviations", unify = TRUE)
# Load multiple dictionaries at once
dicts <- load_dictionaries(c("nrc_vad", "wkb_vad", "humor_norms"))
# Take a random sample from a large dictionary
sample <- sample_dictionary("global_surnames", n = 100, seed = 42, large = TRUE)Downloading and Updating
# Pre-download an on-demand dictionary
download_dictionary("english_freqs")
# Download to a custom location
download_dictionary("chemicals", path = "/my/data")
# Check for updates
check_updates()
# Update all outdated dictionaries
update_dictionaries()Cache Management
Loaded dictionaries are cached for faster subsequent access:
# Clear all cached dictionaries
clear_dictionary_cache()
# Remove a single cached dictionary
remove_cached_dictionary("global_surnames")
# Force re-cache after an update
load_dictionary("nrc_vad", force_rebuild = TRUE)Version and Integrity
# Check a dictionary's version
dict_version("sensorimotor")
# Verify a dictionary's integrity (MD5 checksum)
verify_dictionary("sensorimotor")Unifying Column Names
Dictionaries use different names for their primary identifier column (term, word, form, name, etc.). Use unify_dictionary() to normalize them:
# Unify by name
surnames <- unify_dictionary("global_surnames")
head(surnames$term)
# Unify a data frame directly
df <- load_dictionary("english_numerics")
unified <- unify_dictionary(df)
head(unified$term)File Formats
The package supports two file formats:
-
.qs2(default): Smaller file size, faster loading. Uses ZSTD compression (level 22). -
.rda: Standard R data format for compatibility.
Related Packages
There are four related packages hosted on GitLab:
-
text2map: text analysis functions -
text2map.corpora: 13+ text datasets -
text2map.pretrained: pretrained embeddings and topic models -
text2map.theme: changesggplot2aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.pretrained")Contributions and Support
We welcome new dictionaries — especially old or rare dictionaries! If you have a dictionary you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.dictionaries/-/issues