The Starling encoding

On the Moscow sites starling.rinet.ru and newstar.rinet.ru, and also elsewhere, one finds linguistic data bases. The present page documents my current understanding of the character set encoding used in these data bases.

Single byte encoding

The code is a single-byte code, 8 bits per symbol. Text in the Latin alphabet is in ASCII, cyrillic text in CP 866. This covers the ranges 0x20-0x7e and 0x80-0xaf, 0xb0-0xdf.

  0123 4567 89ab cdef
20   ! " # $ % & ' ( ) * + , - . /
30 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
40 @ A B C D E F G H I J K L M N O
50 P Q R S T U V W X Y Z [ \ ]   ̂ _
60 ` a b c d e f g h i j k l m n o
70 p q r s t u v w x y z { | }   ̃

  0123 4567 89ab cdef
80 А Б В Г Д Е Ж З И Й К Л М Н О П
90 Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
a0 а б в г д е ж з и й к л м н о п
e0 р с т у ф х ц ч ш щ ъ ы ь э ю я

Various other symbols are encoded in the ranges 0xb0-0xdf, 0xf0-0xff.

  0123 4567 89ab cdef
b0 ā   ́ ä   ̨ ǟ č č̣ δ ē   ̇   ̈ ɛ ʡ   ̯
c0 ç ɣ ʁ ħ   ̄ ī ɨ ɨ̄   ̊   ̥ ʎ ƛ - ƛ̣ ɫ
d0 ŋ ō ö ȫ ɔ ɔ̄ ß ~   ̀   ̣ š   ̆
f0 ϑ ū ü ǖ ə ə̄   ̌ ʷ ɦ χ ʒ ǯ ž ʔ ʕ ʌ

The yellow fields are combining accents that follow the symbol to be accented.

There remains the low ASCII range 0-0x1f. The value 0x0a is used for newline. Sometimes the DOS convention is used with 0x0a 0x0d for a newline. Sometimes, 0x15 is used to start a new paragraph. For 0x1d and 0x01, see below.

Escaped single bytes

The symbol 0x1d (^]) is an escape and indicates that the following symbol has a special meaning. The table below gives the symbols S and the corresponding meaning of the combination 0x15 S.

S abcd egh ijos tAD T ^
^]-S æƀɕð œǥ ıȷøʃ þӔđ ŧ   ̑

hex a5 a7 ab ad ae b1 bf db ed ef f8 fa
S е з л н о   ́   ̯   ̀ э я ɦ ʒ
^]-S ѧѕ љњѫ  ̋   ̮  ̏ єž ƕђ

(Here in the Cyrillic range also the hex value is given, since there is often no visual distinction between e.g. ASCII e and Cyrillic е.)

Double byte encoding

The symbol 0x01 (^A) is an escape and indicates that the following text uses double byte encoding. The double byte encoding used has the property that its bytes are outside the ASCII range. Double byte encoding mode ends when a byte in the ASCII range 0-0x7f follows. (If necessary, a 0x7f (DEL) is inserted.)

(The details vary. Single-byte combining accents like ~, 0xbf, 0xc4, 0xdf sometimes occur in 2-byte coding. They may be preceded by DEL = 0x7f or not. They may be followed by ^A = 0x01 or not. Sometimes 2-byte coding just starts without being announced by ^A. This means that the coding near 0x83 or 0x85 = Cyrillic Г or Е may be ambiguous.)

The code used is the Chinese Big5 code. It is a big-endian 16-bit code, so that the bytes 0xa4 0x67 spell the 16-bit value 0xa467, which stands for the Chinese character 土. Similarly, e.g. 0xa75a, 0xabd2, 0xb3c2, 0xc5d6 stand for 卵, 帝, 麻, 纖. The Big5 code defines (Chinese) characters for codes with first byte in the range 0xa1-0xf9. The rest of the range 0x81-0xfe is user-defined. Starling uses the user-defined area to encode e.g. Greek and OCS. (So, the Greek code points of Big5 are not used.)

Greek

Greek text is coded with byte pairs in the user-defined area of Big5. The first byte is either 0x83 or 0x85. The value 0x85 only occurs in the combination 0x85 0xaf, which codes ϝ (digamma). In all other cases the first byte is 0x83, and the second is given in the table below.

  0123 4567 89ab cdef
90 α̈́ ἅ ἄ ά             α̈̀ ἃ ἂ ἇ ·  
a0 ἁ ἀ   α̈ Α Β Χ Δ Ε Φ Γ Η Ι ῳ Κ Λ
b0 Μ Ν Ο Π Θ Ρ Σ Τ Υ ῃ Ω Ξ Ψ Ζ    
c0 ἆ ὰ α β χ δ ε φ γ η ι ς κ λ μ ν
d0 ο π θ ρ σ τ υ ᾳ ω ξ ψ ζ ᾶ      

(Here the fourteen cases of accented α represent just the combining accent, without the alpha.)

OCS

Text in OCS is coded with byte pairs in the user-defined area of Big5. The first byte is either 0x87 or 0x88. The value 0x88 only occurs in the combinations 0x88 0x81 for ѵ and 0x88 0x83 for ѧ. In all other cases the first byte is 0x87, and the second is given in the table below.

  0123 4567 89ab cdef
80       г̑ б   ю                  
90       ж     Ю       И С     А П
a0 Р   О Л Д       З   К   Е Г М  
b0   Н   х         ъ ф и с в у а п
c0 р ш о л д ь т   з   к ы е г м ц
d0 ч н я Х   ѣ       ѧ          
e0               ѫ Е e            
f0     ѡ Ѭ ѭ Ѥ ѥ                  

Send corrections and additions to aeb@cwi.nl