Getting Characters Standardized for Digital Texts
by Deborah Anderson
Digital humanists who work with text can face problems in getting certain letters and symbols to display, be stored, or sent electronically, because the characters are not in Unicode, the international character encoding standard. When a character is not in Unicode, the text may appear with a square box, for the missing character (1, below), a question mark (2), a nonsense character (3), or, if an entire script is not in Unicode, with completely garbled text, called Mojibake (from the Japanese ‘moji’, character + ‘bake’, transform) (4).
Not having a character in the Unicode Standard also causes other problems for such text, such as copying and pasting, searching, OCR, and long-term storage. In order to help digital humanists overcome this problem of unencoded characters, the Script Encoding Initiative project was started in the Department of Linguistics at UC Berkeley. The goal of SEI is to work with users to get eligible characters and scripts into the Unicode Standard.
Since its inception in 2002, the project has assisted in getting over 70 historical and modern scripts into Unicode, including Egyptian hieroglyphs (Gardiner set), Linear B, and Javanese and Balinese. However, over 100 scripts are known to be missing. One historic script not yet in Unicode is the north-east Iberian script, which the LITTERA group at the University of Barcelona is working on.
Other scripts include:
Mayan Hieroglyphs (photo by James Gaither)
Garay, a modern script used for Wolof in Senegal
There are also characters from scripts already in Unicode that are known to be missing. These include several Runic characters, Medieval and Late Latin characters, alchemical symbols (outside of Newton), Arabic math characters, and Ptolemaic characters (Egyptian hieroglyphs).
Getting characters into Unicode is a multi-step and lengthy process:
- A proposal for the characters must be written (and SEI can assist digital humanists who want to write a proposal themselves, or can find a Unicode proposal author to write it)
- The proposal must be approved by two standards committees (and SEI can present the proposals to the committees and help track it through the entire encoding process)
- Once the characters have been approved by both committees and are published in the Unicode Standard, standardized fonts need to be created and rendering engines updated to support the characters.
Work on unencoded characters (and scripts) is challenging, for several reasons:
- Because remaining unencoded scripts (and characters) are not well known, it is difficult to find experts and/or collect enough material for a complete proposal.
- Understanding the technical details of Unicode can be challenging, especially for newcomers
- The approval process is lengthy (2+ years), and requires patience and long-term commitment.
Despite these challenges, the effort to get needed characters into Unicode is worthwhile. Ultimately, it will provide access to historical texts and texts in lesser-known languages, thereby building up the global digital repository of our literary, cultural, and historical documents. For modern minority language users, it opens the door for them to be able to participate in the digital world in their preferred script.
Deborah Anderson is a researcher at the Department of Linguistics of UC Berkeley. She participated in the recent Berkeley Digital Humanities Faire.