A comprehensive dictionary of 19,376 word normalization rules for
historical and dialectal English text, covering compound resolution,
syncope expansion, archaic spelling modernization,
irregular verb conjugation, long-S correction, OCR character
substitution, HTML entity decoding, malformed HTML cleanup, and
dialect normalization. Rules are sourced from the textnorm package,
ECHNAE Project, qdapDictionaries, Wikipedia, and curated entries.
Note: contraction and dialect_contraction categories were moved to
english_contractions in v1.2.0.
Source
textnorm/ECHNAE (MIT), qdapDictionaries (GPL-2), Wikipedia (CC-BY-SA-4.0), Google Books Ngrams (CC BY 3.0), curated
Variables
form. The historical, dialectal, or erroneous form
replacement. The modern/normalized equivalent (NA for 3 legacy entries where the expansion is in
subcategoryinstead)category. Type of normalization rule: compound_fuse, compound_split, compound_hyphen, compound, syncope, irregular_verb, archaic_spelling, dialect, legitimate, long_s_correction, ocr_substitution, other_normalization, html_entity, malformed_html
subcategory. Finer-grained classification within the category
frequency. Corpus frequency from Google Books Ngrams where available
source. data source attribution
