A comprehensive dictionary of 635 Unicode character normalization rules for English text preprocessing, covering smart quotes, ligatures, dash variants, whitespace normalization, diacritic stripping, punctuation normalization, control/invisible characters, math symbols, fullwidth characters, superscripts/subscripts, fraction characters, Roman numerals, box drawing, block elements, enclosed alphanumerics, and OCR confusables.
Variables
form. The Unicode character to be normalized
replacement. The ASCII or normalized replacement string
category. Type of normalization rule: smart_quotes, ligature, dash, dash_punctuation, whitespace, whitespace_nonbreaking, zero_width, diacritic, punctuation, punctuation_bullet, punctuation_list, punctuation_prime, punctuation_guillemet, punctuation_symbol, punctuation_math, punctuation_math_comparison, punctuation_arrow, punctuation_double_arrow, punctuation_dagger, punctuation_currency, punctuation_inverted, punctuation_interrobang, punctuation_variant, dashed_overline, dashed_underline, control_character, bidi_control, invisible_formatting, math_operator, math_comparison, math_logic, math_set, math_geometry, math_ellipsis, math_greek, math_greek_lower, math_greek_upper, fullwidth_latin, fullwidth_digit, superscript_digit, superscript_symbol, superscript_letter, subscript_digit, subscript_symbol, fraction, roman_numeral, typographic, typographic_symbol, modifier_letter, ocr_confusable, box_drawing, block_element, enclosed_alphanumeric
description. Brief description of the normalization
source. Data source attribution
