Skip to contents

A comprehensive dictionary of 635 Unicode character normalization rules for English text preprocessing, covering smart quotes, ligatures, dash variants, whitespace normalization, diacritic stripping, punctuation normalization, control/invisible characters, math symbols, fullwidth characters, superscripts/subscripts, fraction characters, Roman numerals, box drawing, block elements, enclosed alphanumerics, and OCR confusables.

Format

A data frame with 635 rows and 5 variables.

Source

Unicode Standard / W3C / common text preprocessing

Variables

  • form. The Unicode character to be normalized

  • replacement. The ASCII or normalized replacement string

  • category. Type of normalization rule: smart_quotes, ligature, dash, dash_punctuation, whitespace, whitespace_nonbreaking, zero_width, diacritic, punctuation, punctuation_bullet, punctuation_list, punctuation_prime, punctuation_guillemet, punctuation_symbol, punctuation_math, punctuation_math_comparison, punctuation_arrow, punctuation_double_arrow, punctuation_dagger, punctuation_currency, punctuation_inverted, punctuation_interrobang, punctuation_variant, dashed_overline, dashed_underline, control_character, bidi_control, invisible_formatting, math_operator, math_comparison, math_logic, math_set, math_geometry, math_ellipsis, math_greek, math_greek_lower, math_greek_upper, fullwidth_latin, fullwidth_digit, superscript_digit, superscript_symbol, superscript_letter, subscript_digit, subscript_symbol, fraction, roman_numeral, typographic, typographic_symbol, modifier_letter, ocr_confusable, box_drawing, block_element, enclosed_alphanumeric

  • description. Brief description of the normalization

  • source. Data source attribution