Strings and Encoding¶

Character Encodings ¶

Problem …

Solution …

Where does the data come from and go to?

Programmer has to know what the source contains, and act accordingly
Raw bytes ⟶ create bytes objects
Strings ⟶ which encoding?
- Email: MIME headers (⟶ email module)
- Files: specify encoding parameter at file object creation (⟶ later)
- Otherwise: read byte data and convert to string objects

At the programmer’s responsibility!

Pre-Unicode: ISO/IEC 8859-1 (“Latin-1”) for Mid-European alphabet

Jörg, as read from a file with unknown encoding¶

>>> joerg_raw = b'J\xf6rg'
>>> type(joerg_raw)
<class 'bytes'>

Transformation to string should be done as early as possible

Transfer raw bytes into string¶

>>> joerg = str(joerg_raw, encoding='iso-8859-1')
>>> type(joerg)
<class 'str'>
>>> joerg
'Jörg'

Internal string representation is Unicode

“ö” is obviously multibyte in UTF-8¶

>>> joerg.encode('utf-8')
b'J\xc3\xb6rg'

“ö” is unknown in China¶

>>> joerg.encode('big5')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5' codec can't encode ....

Question: how are string literals encoded?

Explicit source encoding¶

#!/usr/bin/python3
# -*- encoding: utf-8 -*-

print('Jörg')