5.1. Encoding About

5.1.1. Problem

>>> key = 'a'
>>>
>>> bin('a')
Traceback (most recent call last):
TypeError: 'str' object cannot be interpreted as an integer

5.1.2. Solution

>>> key = 'a'
>>>
>>> ord(key)
97
>>>
>>> bin(97)
'0b1100001'

5.1.3. Encoding

  • ASCII

  • ISO-8859-2

  • CP1250

  • UTF-8

5.1.4. ASCII vs. ASCII-Extended Encoding

  • ASCII - American Standard Code for Information Interchange

  • ASCII - 7-bit encoding - from 0 to 127 (0b0000000 to 0b1111111)

  • ASCII-Extended - 8-bit encoding - 0 to 255 (from 0b00000000 to 0b11111111)

../../_images/encoding-ascii%2Bextended.png

5.1.5. ISO Encoding

  • ISO - International Organization for Standardization

  • ISO-8859 - character encoding standard

  • ISO-8859-1 - Western European (Latin-1)

  • ISO-8859-2 - Central European (Latin-2)

  • ISO-8859-3 - South European (Latin-3)

  • ISO-8859-4 - North European (Latin-4)

  • ISO-8859-5 - Latin/Cyrillic

  • ISO-8859-6 - Latin/Arabic

  • ISO-8859-7 - Latin/Greek

  • ISO-8859-8 - Latin/Hebrew

  • ISO-8859-9 - Turkish (Latin-5)

  • ISO-8859-10 - Nordic (Latin-6)

  • ISO-8859-11 - Latin/Thai

  • ISO-8859-12 - Latin/Devanagari (abandoned)

  • ISO-8859-13 - Baltic Rim (Latin-7)

  • ISO-8859-14 - Celtic (Latin-8)

  • ISO-8859-15 - Latin-9 - A revision of 8859-1 with removed little-used symbols, replacing them with the euro sign € and the letters Š, š, Ž, ž, Œ, œ, and Ÿ)

  • ISO-8859-16 - South-Eastern European (Latin-10) - Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene, but also Finnish, French, German and Irish Gaelic

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='iso-8859-2') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: ISO-8859 text
$ cat /tmp/myfile.txt
cze��

5.1.6. Windows Encoding

  • Windows is registered trademark of Microsoft

  • windows-1250 is called cp1250

  • CP - Code Page

  • cp1250 - Central European languages

  • cp1251 - Cyrillic script

  • cp1252 - Western European languages

  • cp1253 - Greek script

  • cp1254 - Turkish script

  • cp1255 - Hebrew script

  • cp1256 - Arabic script

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='cp1250') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Non-ISO extended-ASCII text
$ cat /tmp/myfile.txt
cze��

5.1.7. Unicode Encoding

  • Unicode - character encoding standard

  • UTF-8 - Unicode Transformation Format - ASCII compatible

  • UTF-16 - Unicode Transformation Format - uF600

  • UTF-32 - Unicode Transformation Format - U0001F600

  • Little Endian (LE) vs. Big Endian (BE)

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='utf-8') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć

5.1.8. Default

  • UTF-8

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć