English Normalization Rules Dictionary — english_normalization

A comprehensive dictionary of 19,376 word normalization rules for historical and dialectal English text, covering compound resolution, syncope expansion, archaic spelling modernization, irregular verb conjugation, long-S correction, OCR character substitution, HTML entity decoding, malformed HTML cleanup, and dialect normalization. Rules are sourced from the textnorm package, ECHNAE Project, qdapDictionaries, Wikipedia, and curated entries. Note: contraction and dialect_contraction categories were moved to english_contractions in v1.2.0.

Format

A data frame with 19,376 rows and 6 variables.

Source

textnorm/ECHNAE (MIT), qdapDictionaries (GPL-2), Wikipedia (CC-BY-SA-4.0), Google Books Ngrams (CC BY 3.0), curated

Variables

form. The historical, dialectal, or erroneous form
replacement. The modern/normalized equivalent (NA for 3 legacy entries where the expansion is in subcategory instead)
category. Type of normalization rule: compound_fuse, compound_split, compound_hyphen, compound, syncope, irregular_verb, archaic_spelling, dialect, legitimate, long_s_correction, ocr_substitution, other_normalization, html_entity, malformed_html
subcategory. Finer-grained classification within the category
frequency. Corpus frequency from Google Books Ngrams where available
source. data source attribution