Unicode and character encodings

There are dozens of character encodings. For an overview of Python’s encodings, see Encodings and Unicode.

The string module

Python’s string module distinguishes the following string constants, all of which fall into the ASCII character set:

# Some strings for ctype-style character classification
whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + ascii_letters + punctuation + whitespace

Most of these constants should be self-explanatory in their identifier names. hexdigits and octdigits refer to the hexadecimal and octal values respectively. You can use these constants for everyday string manipulation:

>>> import string
>>> hepy = "Hello Pythonistas!"
>>> hepy.rstrip(string.punctuation)
'Hello Pythonistas'

However, the string module works with Unicode by default, which is represented as binary data (bytes).

Unicode

It is obvious that the ASCII character set is not nearly large enough to cover all languages, dialects, symbols and glyphs; it is not even large enough for English.

While ASCII is a complete subset of Unicode – the first 128 characters in the Unicode table correspond exactly to ASCII characters – Unicode encompasses a much larger set of characters. Unicode itself is not an encoding but is implemented by various character encodings, with UTF-8 probably being the most commonly used encoding scheme.

Note

The Python help documentation has an entry for Unicode: enter help() and then UNICODE. The various options for creating Python strings are described in detail.

Unicode and UTF-8

While Unicode is an abstract encoding standard, UTF-8 is a concrete encoding scheme. The Unicode standard is a mapping of characters to code points and defines several different encodings from a single character set. UTF-8 is an encoding scheme for representing Unicode characters as binary data with one or more bytes per character.

Encoding and decoding in Python 3

The str type is intended for the representation of human-readable text and can contain all Unicode characters. The bytes type, on the other hand, represents binary data that is not inherently encoded. str.encode() and bytes.decode() are the methods of transition from one to the other:

>>> "You’re welcome!".encode("utf-8")
b'You\xe2\x80\x99re welcome!'
>>> b"You\xe2\x80\x99re welcome!".decode("utf-8")
'You’re welcome!'

The result of str.encode() is a bytes object. Both byte literals (such as b'You\xe2\x80\x99re welcome!') and representations of bytes only allow ASCII characters. For this reason, when calling "You’re welcome!".encode("utf-8"), the ASCII-compatible 'You' may be represented as it is, but the becomes '\xe2\x80\x99'. This chaotic looking sequence represents three bytes, e2, 80 and 99 as hexadecimal values.

Tip

In .encode() and .decode(), the encoding parameter is "utf-8" by default; however, it is recommended to specify it explicitly.

With bytes.fromhex() you can convert the hexadecimal values into bytes:

>>> bytes.fromhex('e2 80 99')
b'\xe2\x80\x99'

UTF-16 and UTF-32

The difference between these and UTF-8 is considerable in practice. In the following, I would like to show you only briefly by means of an example that a round-trip conversion can simply fail here:

>>> hepy = "Hello Pythonistas!"
>>> hepy.encode("utf-8")
b'Hello Pythonistas!'
>>> len(hepy.encode("utf-8"))
18
>>> hepy.encode("utf-8").decode("utf-16")
'效汬\u206f祐桴湯獩慴ⅳ'
>>> len(hepy.encode("utf-8").decode("utf-16"))
9

Encoding Latin letters in UTF-8 and then decoding them in UTF-16 resulted in a text that also contains characters from the Chinese, Japanese or Korean language areas as well as Roman numerals. Decoding the same byte object can lead to results that are not even in the same language or contain the same number of characters.

Python 3 and Unicode

Python 3 relies fully on Unicode and specifically on UTF-8:

  • Python 3 source code is assumed to be UTF-8 by default.

  • Texts (str) are Unicode by default. Encoded Unicode text is represented as binary data (Bytes) dargestellt.

  • Python 3 accepts many Unicode code points in identifiers.

  • Python’s re module uses the re.UNICODE flag by default, not re.ASCII. This means that, for example, r"\w" matches Unicode word characters, not just ASCII letters.

  • The default encoding in str.encode() and bytes.decode() is UTF-8.

The only exception could be open(), which is platform dependent and therefore depends on the value of locale.getpreferredencoding():

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

Built-in Python Functions

Python has a number of built-in functions that relate to character encodings in some way:

ascii(), bin(), hex(), oct()

output a string.

bytes, str, int

are class constructors for their respective types, converting the input to the desired type.

ord(), chr()

are inverses of each other in that the Python function ord() converts an str character to its base=10 code point, while chr() does the opposite.

Below is a more detailed look at each of these nine functions:

Function

RReturn type

Description

ascii()

str

ASCII representation of an object, escaping non-ASCII characters.

bin()

str

binary representation of an integer with the prefix 0b

hex()

str

hexadecimal representation of an integer with the prefix 0x

oct()

str

octal representation of an integer with the prefix 0o

bytes

bytes

converts the input to bytes type

str

str

converts the input to str type

int

int

converts the input to int type

ord()

int

converts a single Unicode character to its integer code point

chr()

str

converts an integer code point into a single Unicode character