Strings

The processing of character strings is one of Python’s strengths. There are many options for delimiting character strings:

"A string in double quotes can contain 'single quotes'."
'A string in single quotes can contain "double quotes"'
'''\tA string that starts with a tab and ends with a newline character.\n'''
"""This is a string in triple double quotes, the only string that contains
real line breaks.""""

Strings can be separated by single (' '), double (" "), triple single (''' ''') or triple double (""" """) quotes and can contain tab (\t) and newline (\n) characters. In general, backslashes \ can be used as escape characters. For example \\ can be used for a single backslash and \' for a single quote, whereby it does not end the string:

"You don't need a backslash here."
'However, this wouldn\'t work without a backslash.'

Here are other characters you can get with the escape character:

Escape sequence

Output

Description

\\

\

Backslash

\'

'

single quote character

\"

"

double quote character

\b

Backspace (BS)

\n

ASCII Linefeed (LF)

\r

ASCII Carriage Return (CR)

\t

Tabulator (TAB)

u00B5

µ

Unicode 16 bit

U000000B5

µ

Unicode 32 bit

N{SNAKE}

🐍

Unicode Emoji name

A normal string cannot be split into multiple lines. The following code will not work:

"This is an incorrect attempt to insert a newline into a string without
using \n."

However, Python provides strings in triple quotes (""") that allow this and can contain single and double quotes without backslashes.

Strings are also immutable. The operators and functions that work with them return new strings derived from the original. The operators (in, + and *) and built-in functions (len, max and min) work with strings in the same way as with lists and tuples.

>>> welcome = "Hello pythonistas!\n"
>>> 2 * welcome
'Hello pythonistas!\nHello pythonistas!\n'
>>> welcome + welcome
'Hello pythonistas!\nHello pythonistas!\n'
>>> 'python' in welcome
True
>>> max(welcome)
'y'
>>> min(welcome)
'\n'

The index and slice notation works in the same way to obtain elements or slices:

>>> welcome[0:5]
'Hello'
>>> welcome[6:-1]
'pythonistas!'

However, the index and slice notation cannot be used to add, remove or replace elements:

 >>> welcome[6:-1] = 'everybody!'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

string

For strings, the standard Python library string contains several methods for working with their content, including str.split(), str.replace() and str.strip():

>>> welcome = "hello pythonistas!\n"
>>> welcome.isupper()
False
>>> welcome.isalpha()
False
>>> welcome[0:5].isalpha()
True
>>> welcome.capitalize()
'Hello pythonistas!\n'
>>> welcome.title()
'Hello Pythonistas!\n'
 >>> welcome.strip()
'Hello pythonistas!'
>>> welcome.split(' ')
['hello', 'pythonistas!\n']
>>> chunks = [snippet.strip() for snippet in welcome.split(' ')]
>>> chunks
['hello', 'pythonistas!']
>>> ' '.join(chunks)
'hello pythonistas!'
>>> welcome.replace('\n', '')
'hello pythonistas!'

Below you will find an overview of the most common string methods:

Method

Description

str.count()

returns the number of non-overlapping occurrences of the string.

str.endswith()

returns True if the string ends with the suffix.

str.startswith()

returns True if the string starts with the prefix.

str.join()

uses the string as a delimiter for concatenating a sequence of other strings.

str.index()

returns the position of the first character in the string if it was found in the string; triggers a ValueError if it was not found.

str.find()

returns the position of the first character of the first occurrence of the substring in the string; like index, but returns -1 if nothing was found.

str.rfind()

Returns the position of the first character of the last occurrence of the substring in the string; returns -1 if nothing was found.

str.replace()

replaces occurrences of a string with another string.

str.strip(), str.rstrip(), str.lstrip()

strip spaces, including line breaks.

str.split()

splits a string into a list of substrings using the passed separator.

str.lower()

converts alphabetic characters to lower case.

str.upper()

converts alphabetic characters to upper case.

str.casefold()

converts characters to lower case and converts all region-specific variable character combinations to a common comparable form.

str.ljust(), str.rjust()

left-aligned or right-aligned; fills the opposite side of the string with spaces (or another filler character) in order to obtain a character string with a minimum width.

str.removeprefix() str.removesuffix()

In Python 3.9 this can be used to extract the suffix or file name.

In addition, there are several methods with which the property of a character string can be checked:

Method

[!#$%…]

[a-zA-Z]

[¼½¾]

[¹²³]

[0-9]

str.isprintable()

str.isalnum()

str.isnumeric()

str.isdigit()

str.isdecimal()

str.isspace() checks for spaces: [ \t\n\r\f\v\x1c-\x1f\x85\xa0\u1680…].

re

The Python standard library re also contains functions for working with strings. However, re offers more sophisticated options for pattern extraction and replacement than string.

>>> import re
>>> re.sub('\n', '', welcome)
'Hello pythonistas!'

Here, the regular expression is first compiled and then its re.Pattern.sub() method is called for the passed text. You can compile the expression itself with re.compile() to create a reusable regex object that reduces CPU cycles when applied to different strings:

>>> regex = re.compile('\n')
>>> regex.sub('', welcome)
'Hello pythonistas!'

If you want to get a list of all patterns that match the regex object instead, you can use the re.Pattern.findall() method:

>>> regex.findall(welcome)
['\n']

Note

To avoid the awkward escaping with \ in a regular expression, you can use raw string literals such as r'C:\PATH\TO\FILE' instead of the corresponding 'C:\\PATH\\TO\\FILE'.

re.Pattern.match() and re.Pattern.search() are closely related to re.Pattern.findall(). While findall returns all matches in a string, search only returns the first match and match only returns matches at the beginning of the string. As a less trivial example, consider a block of text and a regular expression that can identify most email addresses:

>>> addresses = """Veit <veit@cusy.io>
... Veit Schiele <veit.schiele@cusy.io>
... cusy GmbH <info@cusy.io>
... """
>>> pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> regex.findall(addresses)
['veit@cusy.io', 'veit.schiele@cusy.io', 'info@cusy.io']
>>> regex.search(addresses)
<re.Match object; span=(6, 18), match='veit@cusy.io'>
>>> print(regex.match(addresses))
None

regex.match returns None, as the pattern only matches if it is at the beginning of the string.

Suppose you want to find email addresses and at the same time split each address into its three components:

  1. personal name

  2. domain name

  3. domain suffix

To do this, you first place round brackets () around the parts of the pattern to be segmented:

>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> match = regex.match('veit@cusy.io')
>>> match.groups()
('veit', 'cusy', 'io')

re.Match.groups() returns a Tuples that contains all subgroups of the match.

re.Pattern.findall() returns a list of tuples if the pattern contains groups:

>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]

Groups can also be used in re.Pattern.sub() where \1 stands for the first matching group, \2 for the second and so on:

>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
>>> print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', addresses))
Veit <Username: veit, Domain: cusy, Suffix: io>
Veit Schiele <Username: veit.schiele, Domain: cusy, Suffix: io>
cusy GmbH <Username: info, Domain: cusy, Suffix: io>

The following table contains a brief overview of methods for regular expressions:

Method

Description

re.findall()

returns all non-overlapping matching patterns in a string as a list.

re.finditer()

like findall, but returns an iterator.

re.match()

matches the pattern at the beginning of the string and optionally segments the pattern components into groups; if the pattern matches, a match object is returned, otherwise none.

re.search()

searches the string for matches to the pattern; in this case, returns a match object; unlike match, the match can be anywhere in the string and not just at the beginning.

re.split()

splits the string into parts each time the pattern occurs.

re.sub(), re.subn()

replaces all (sub) or the first n occurrences (subn) of the pattern in the string with a replacement expression; uses the symbols \1, \2, … to refer to the elements of the match group.

print()

The function print() outputs character strings, whereby other Python data types can easily be converted into strings and formatted, for example:

>>> import math
>>> pi = math.pi
>>> d = 28
>>> u = pi * d
>>> print("Pi is", pi, "and the circumference with a diameter of", d, "inches is", u, "inches.")
Pi is 3.141592653589793 and the circumference with a diameter of 28 inches is 87.96459430051421 inches.

F-Strings

F-strings can be used to shorten numbers that are too detailed for a text:

>>> print(f"The value of Pi is {pi:.3f}.")
The value of Pi is 3.142.

In {pi:.3f}, the format specification f is used to truncate the number Pi to three decimal places.

In A/B test scenarios, you often want to display the percentage change in a key figure. F strings can be used to formulate them in an understandable way:

>>> metrics = 0.814172
>>> print(f"The AUC has increased to {metrics:=+7.2%}")
The AUC has increased to +81.42%

In this example, the variable metrics is formatted with = taking over the contents of the variable after the +, displaying a total of seven characters including the plus or minus sign, metrics and the percent sign. .2 provides two decimal places, while the % symbol converts the decimal value into a percentage. For example, 0.514172 is converted to +51.42%.

Values can also be converted into binary and hexadecimal values:

>>> block_size = 192
>>> print(f"Binary block size: {block_size:b}")
Binary block size: 11000000
>>> print(f"Hex block size: {block_size:x}")
Hex block size: c0

There are also formatting specifications that are ideally suited for CLI output, for example:

>>> data_types = [(7, "Data types", 19), (7.1, "Numbers", 19), (7.2, "Lists", 23)]
>>> for n, title, page in data_types:
...     print(f"{n:.1f} {title:.<25} {page: >3}")                               ...
7.0 Data types...............  19
7.1 Numbers..................  19
7.2 Lists....................  23

In general, the format is as follows, whereby all information in square brackets is optional:

:[[FILL]ALIGN][SIGN][0b|0o|0x|d|n][0][WIDTH][GROUPING]["." PRECISION][TYPE]

The following table lists the fields for character string formatting and their meaning:

Field

Meaning

FILL

Character used to fill in ALIGN. The default value is a space.

ALIGN

Text alignment and fill character:

<: left-aligned
>: right-aligned
^: centred
=: Fill character after SIGN

SIGN

Display sign:

+: Display sign for positive and negative numbers
-: Default value, - only for negative numbers or space for positive

0b|0o|0x|d|n

Sign for integers:

0b: Binary numbers
0o: Octal numbers
0x: Hexadecimal numbers
d: Default value, decimal integer with base 10
n: uses the current locale setting to insert the corresponding number separators

0

fills with zeros

WIDTH

Minimum field width

GROUPING

Number separator: [1]

,: comma as thousands separator
_: underscore for thousands separator

.PRECISION

For floating point numbers, the number of digits after the point
For non-numeric values, the maximum length

TYPE

Output format as number type or string

… for integers:

b: binary format
c: converts the integer to the corresponding Unicode character
d: default value, decimal character
n: same as d, th the difference that it uses the current locale setting to insert the corresponding number separators
o: octal format
x: Hexadecimal format in base 16, using lowercase letters for the digits above 9
X: Hexadecimal format based on 16, using capital letters for digits above 9

… for floating point numbers:

e: Exponent with e as separator between coefficient and exponent
E: Exponent with E as separator between coefficient and exponent
g: Standard value for floating point numbers, whereby the exponent has a fixed width for large and small numbers
G: Like g, but changes to E if the number becomes too large. The representations of infinity and NaN are also written in capital letters
n: Like g with the difference that it uses the current locale setting to insert the corresponding number separators
%: Percentage. Multiplies the number by 100 and displays it in the fixed format f followed by a percent sign

Tip

A good source for F-strings is the help function:

>>> help()
help> FORMATTING
...

You can browse through the help here and find many examples.

You can exit the help function again with :q and .

Debugging F-Strings

In Python 3.8, a specifier was introduced to help with debugging F-string variables. By adding an equals sign =, the code is included within the F-string:

>>> uid = "veit"
>>> print(f"My name is {uid.capitalize()=}")
My name is uid.capitalize()='Veit'

Formatting date and time formats and IP addresses

datetime supports the formatting of strings using the same syntax as the strftime method for these objects.

>>> import datetime
>>> today = datetime.date.today()
>>> print(f"Today is {today:%d %B %Y}.")
Today is 26 November 2023.

The ipaddress module of Python also supports the formatting of IPv4Address and IPv6Address objects.

Finally, third-party libraries can also add their own support for formatting strings by adding a __format__ method to their objects.

Built-in modules for strings

The Python standard library contains a number of built-in modules that you can use to manage strings:

Module

Description

string

compares with constants such as string.digits or string.whitespace

re

searches and replaces text with regular expressions

struct

interprets bytes as packed binary data

difflib

helps to calculate deltas, find differences between strings or sequences and create patches and diff files

textwrap

wraps and fills text, formats text with line breaks or spaces