Strings and Encoding¶
Files (and networks, and …) contain arbitrary bytes
Files don’t have an idea of their content
⟶ Content can be anything
Plain 7-bit ASCII
One of 2156 Chinese (multibyte) character sets
One of 1375 Japanese (multibyte) character sets
UTF-8, UTF-16, UTF-32
Many many more …
Unicode — one encoding to rule them all
Internally, Python strings are sequences of Unicode code points
Where does the data come from and go to?
Programmer has to know what the source contains, and act accordingly
Raw bytes ⟶ create
Strings ⟶ which encoding?
Email: MIME headers (⟶
fileobject creation (⟶ later)
Otherwise: read byte data and convert to string objects
At the programmer’s responsibility!
Has always been programmer’s responsibility
Python 3 just doesn’t let you mix
Pre-Unicode: ISO/IEC 8859-1 (“Latin-1”) for Mid-European alphabet
>>> joerg_raw = b'J\xf6rg' >>> type(joerg_raw) <class 'bytes'>
File happens to be Latin-1 encoded
\xf6is “ö” in Latin-1
… but that information isn’t there ⟶ binary
Transformation to string should be done as early as possible
Everything’s clear if one knows what’s in
⟶ Transformation to Unicode (rules them all)
⟶ Nobody has to know anymore what’s in
>>> joerg = str(joerg_raw, encoding='iso-8859-1') >>> type(joerg) <class 'str'> >>> joerg 'Jörg'
Internal string representation is Unicode
No-one cares (has to care)
Unicode is a set of numbers, not a concrete encoding
>>> joerg.encode('utf-8') b'J\xc3\xb6rg'
>>> joerg.encode('big5') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'big5' codec can't encode ....