Getting Characters Standardized for Digital Texts
by Deborah Anderson
Digital humanists who work with text can face problems in getting certain letters and symbols to display, be stored, or sent electronically, because the characters are not in Unicode, the international character encoding standard. When a character is not in Unicode, the text may appear with a square box, for the missing character (1, below), a question mark (2), a nonsense character (3), or, if an entire script is not in Unicode, with completely garbled text, called Mojibake (from the Japanese ‘moji’, character + ‘bake’, transform) (4).
Not having a character in the Unicode Standard also causes other problems for such text, such as copying and pasting, searching, OCR, and long-term storage. In order to help digital humanists overcome this problem of unencoded characters, the Script Encoding Initiative project was started in the Department of Linguistics at UC Berkeley. The goal of SEI is to work with users to get eligible characters and scripts into the Unicode Standard.
Read more