Skip to contents

A comprehensive dictionary of 19,376 word normalization rules for historical and dialectal English text, covering compound resolution, syncope expansion, archaic spelling modernization, irregular verb conjugation, long-S correction, OCR character substitution, HTML entity decoding, malformed HTML cleanup, and dialect normalization. Rules are sourced from the textnorm package, ECHNAE Project, qdapDictionaries, Wikipedia, and curated entries. Note: contraction and dialect_contraction categories were moved to english_contractions in v1.2.0.

Format

A data frame with 19,376 rows and 6 variables.

Source

textnorm/ECHNAE (MIT), qdapDictionaries (GPL-2), Wikipedia (CC-BY-SA-4.0), Google Books Ngrams (CC BY 3.0), curated

Variables

  • form. The historical, dialectal, or erroneous form

  • replacement. The modern/normalized equivalent (NA for 3 legacy entries where the expansion is in subcategory instead)

  • category. Type of normalization rule: compound_fuse, compound_split, compound_hyphen, compound, syncope, irregular_verb, archaic_spelling, dialect, legitimate, long_s_correction, ocr_substitution, other_normalization, html_entity, malformed_html

  • subcategory. Finer-grained classification within the category

  • frequency. Corpus frequency from Google Books Ngrams where available

  • source. data source attribution