6.2. Locale Encoding
ASCII
ASCII Extended
ISO-8859
Windows Encoding
Unicode Encoding
6.2.1. ASCII
ASCII - American Standard Code for Information Interchange
7-bit encoding
From 0b0000000 to 0b1111111 (0 to 127)
6.2.2. ASCII Extended
8-bit encoding
From 0b00000000 to 0b11111111 (0 to 255)
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case. [1]
There are several different variations of the 8-bit ASCII table. The table below is according to Windows-1252 (CP-1252) which is a superset of ISO 8859-1, also called ISO Latin-1, in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 128 to 159 range [3].
6.2.3. ISO-8859
ISO - International Organization for Standardization
ISO-8859 - character encoding standard
ISO-8859-1 - Western European (Latin-1)
ISO-8859-2 - Central European (Latin-2)
ISO-8859-3 - South European (Latin-3)
ISO-8859-4 - North European (Latin-4)
ISO-8859-5 - Latin/Cyrillic
ISO-8859-6 - Latin/Arabic
ISO-8859-7 - Latin/Greek
ISO-8859-8 - Latin/Hebrew
ISO-8859-9 - Turkish (Latin-5)
ISO-8859-10 - Nordic (Latin-6)
ISO-8859-11 - Latin/Thai
ISO-8859-12 - Latin/Devanagari (abandoned)
ISO-8859-13 - Baltic Rim (Latin-7)
ISO-8859-14 - Celtic (Latin-8)
ISO-8859-15 - Latin-9 - A revision of 8859-1 with removed little-used symbols, replacing them with the euro sign € and the letters Š, š, Ž, ž, Œ, œ, and Ÿ)
ISO-8859-16 - South-Eastern European (Latin-10) - Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene, but also Finnish, French, German and Irish Gaelic
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='iso-8859-2') as file:
... file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: ISO-8859 text
$ cat /tmp/myfile.txt
cze��
6.2.4. Windows Encoding
Windows is registered trademark of Microsoft
windows-1250is calledcp1250CP - Code Page
cp42 – Windows Symbol
cp874 – Windows Thai
cp1250 – Windows Central Europe
cp1251 – Windows Cyrillic
cp1252 – Windows Western
cp1253 – Windows Greek
cp1254 – Windows Turkish
cp1255 – Windows Hebrew
cp1256 – Windows Arabic
cp1257 – Windows Baltic
cp1258 – Windows Vietnamese
These code pages are used by Microsoft in its own Windows operating system. Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes from ISO 6429 mentioned by ISO 8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252 [2].
Microsoft recommends new applications use UTF-8 or UCS-2/UTF-16 instead of these code pages [2].
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='cp1250') as file:
... file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Non-ISO extended-ASCII text
$ cat /tmp/myfile.txt
cze��
6.2.5. Unicode Encoding
Unicode - character encoding standard
UTF-8 - Unicode Transformation Format - ASCII compatible
UTF-16 - Unicode Transformation Format - uF600
UTF-32 - Unicode Transformation Format - U0001F600
>>> text = 'cześć'
>>> text.encode()
b'cze\xc5\x9b\xc4\x87'
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='utf-8') as file:
... file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć
6.2.6. UTF-32
Fixed-length encoding
4 bytes per character
Supports all Unicode characters
>>> text = 'cześć'
>>> text.encode('utf-32')
b'\xff\xfe\x00\x00c\x00\x00\x00z\x00\x00\x00e\x00\x00\x00[\x01\x00\x00\x07\x01\x00\x00'
6.2.7. UTF-16
Fixed-length encoding
2 bytes per character
Supports all Unicode characters
>>> text = 'cześć'
>>> text.encode('utf-16')
b'\xff\xfec\x00z\x00e\x00[\x01\x07\x01'
6.2.8. UTF-8
Variable-length encoding
1 to 4 bytes per character
Supports all Unicode characters
Most common encoding for web pages
Compatible with ASCII
>>> text = 'cześć'
>>> text.encode('utf-8')
b'cze\xc5\x9b\xc4\x87'
6.2.9. Default
UTF-8
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt') as file:
... file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć