Strings¶
The processing of character strings is one of Python’s strengths. There are many options for delimiting character strings:
"A string in double quotes can contain 'single quotes'."
'A string in single quotes can contain "double quotes"'
'''\tA string that starts with a tab and ends with a newline character.\n'''
"""This is a string in triple double quotes, the only string that contains
real line breaks.""""
Strings can be separated by single (' '
), double (" "
), triple single
(''' '''
) or triple double (""" """
) quotes and can contain tab (\t
)
and newline (\n
) characters. In general, backslashes \
can be used as
escape characters. For example \\
can be used for a single backslash and
\'
for a single quote, whereby it does not end the string:
"You don't need a backslash here."
'However, this wouldn\'t work without a backslash.'
Here are other characters you can get with the escape character:
Escape sequence |
Output |
Description |
---|---|---|
|
|
Backslash |
|
|
single quote character |
|
|
double quote character |
|
Backspace ( |
|
|
ASCII Linefeed |
|
|
ASCII Carriage Return
( |
|
|
Tabulator ( |
|
|
|
Unicode 16 bit |
|
|
Unicode 32 bit |
|
|
Unicode Emoji name |
A normal string cannot be split into multiple lines. The following code will not work:
"This is an incorrect attempt to insert a newline into a string without
using \n."
However, Python provides strings in triple quotes ("""
) that allow this and
can contain single and double quotes without backslashes.
Strings are also immutable. The operators and functions that work with them
return new strings derived from the original. The operators (in
, +
and
*
) and built-in functions (len
, max
and min
) work with strings
in the same way as with lists and tuples.
>>> welcome = "Hello pythonistas!\n"
>>> 2 * welcome
'Hello pythonistas!\nHello pythonistas!\n'
>>> welcome + welcome
'Hello pythonistas!\nHello pythonistas!\n'
>>> 'python' in welcome
True
>>> max(welcome)
'y'
>>> min(welcome)
'\n'
The index and slice notation works in the same way to obtain elements or slices:
>>> welcome[0:5]
'Hello'
>>> welcome[6:-1]
'pythonistas!'
However, the index and slice notation cannot be used to add, remove or replace elements:
>>> welcome[6:-1] = 'everybody!'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
string
¶
For strings, the standard Python library string
contains several methods for working with their content, including
str.split()
, str.replace()
and str.strip()
:
>>> welcome = "hello pythonistas!\n"
>>> welcome.isupper()
False
>>> welcome.isalpha()
False
>>> welcome[0:5].isalpha()
True
>>> welcome.capitalize()
'Hello pythonistas!\n'
>>> welcome.title()
'Hello Pythonistas!\n'
>>> welcome.strip()
'Hello pythonistas!'
>>> welcome.split(' ')
['hello', 'pythonistas!\n']
>>> chunks = [snippet.strip() for snippet in welcome.split(' ')]
>>> chunks
['hello', 'pythonistas!']
>>> ' '.join(chunks)
'hello pythonistas!'
>>> welcome.replace('\n', '')
'hello pythonistas!'
Below you will find an overview of the most common string methods:
Method |
Description |
---|---|
returns the number of non-overlapping occurrences of the string. |
|
returns |
|
returns |
|
uses the string as a delimiter for concatenating a sequence of other strings. |
|
returns the position of the first character in the string if
it was found in the string; triggers a |
|
returns the position of the first character of the first
occurrence of the substring in the string; like |
|
Returns the position of the first character of the last
occurrence of the substring in the string; returns |
|
replaces occurrences of a string with another string. |
|
strip spaces, including line breaks. |
|
splits a string into a list of substrings using the passed separator. |
|
converts alphabetic characters to lower case. |
|
converts alphabetic characters to upper case. |
|
converts characters to lower case and converts all region-specific variable character combinations to a common comparable form. |
|
left-aligned or right-aligned; fills the opposite side of the string with spaces (or another filler character) in order to obtain a character string with a minimum width. |
|
In Python 3.9 this can be used to extract the suffix or file name. |
In addition, there are several methods with which the property of a character string can be checked:
Method |
|
|
|
|
|
---|---|---|---|---|---|
✅ |
✅ |
✅ |
✅ |
✅ |
|
❌ |
✅ |
✅ |
✅ |
✅ |
|
❌ |
❌ |
✅ |
✅ |
✅ |
|
❌ |
❌ |
❌ |
✅ |
✅ |
|
❌ |
❌ |
❌ |
❌ |
✅ |
str.isspace()
checks for spaces:
[ \t\n\r\f\v\x1c-\x1f\x85\xa0\u1680…]
.
re
¶
The Python standard library re also contains
functions for working with strings. However, re
offers more sophisticated
options for pattern extraction and replacement than string
.
>>> import re
>>> re.sub('\n', '', welcome)
'Hello pythonistas!'
Here, the regular expression is first compiled and then its
re.Pattern.sub()
method is called for the passed text. You can compile
the expression itself with re.compile()
to create a reusable regex
object that reduces CPU cycles when applied to different strings:
>>> regex = re.compile('\n')
>>> regex.sub('', welcome)
'Hello pythonistas!'
If you want to get a list of all patterns that match the regex
object
instead, you can use the re.Pattern.findall()
method:
>>> regex.findall(welcome)
['\n']
Note
To avoid the awkward escaping with \
in a regular expression, you can use
raw string literals such as r'C:\PATH\TO\FILE'
instead of the
corresponding 'C:\\PATH\\TO\\FILE'
.
re.Pattern.match()
and re.Pattern.search()
are closely related
to re.Pattern.findall()
. While findall returns all matches in a string,
search
only returns the first match and match
only returns matches at
the beginning of the string. As a less trivial example, consider a block of text
and a regular expression that can identify most email addresses:
>>> addresses = """Veit <veit@cusy.io>
... Veit Schiele <veit.schiele@cusy.io>
... cusy GmbH <info@cusy.io>
... """
>>> pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> regex.findall(addresses)
['veit@cusy.io', 'veit.schiele@cusy.io', 'info@cusy.io']
>>> regex.search(addresses)
<re.Match object; span=(6, 18), match='veit@cusy.io'>
>>> print(regex.match(addresses))
None
regex.match
returns None
, as the pattern only matches if it is at the
beginning of the string.
Suppose you want to find email addresses and at the same time split each address into its three components:
personal name
domain name
domain suffix
To do this, you first place round brackets ()
around the parts of the
pattern to be segmented:
>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> match = regex.match('veit@cusy.io')
>>> match.groups()
('veit', 'cusy', 'io')
re.Match.groups()
returns a Tuples that contains all subgroups
of the match.
re.Pattern.findall()
returns a list of tuples if the pattern contains
groups:
>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
Groups can also be used in re.Pattern.sub()
where \1
stands for the
first matching group, \2
for the second and so on:
>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
>>> print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', addresses))
Veit <Username: veit, Domain: cusy, Suffix: io>
Veit Schiele <Username: veit.schiele, Domain: cusy, Suffix: io>
cusy GmbH <Username: info, Domain: cusy, Suffix: io>
The following table contains a brief overview of methods for regular expressions:
Method |
Description |
---|---|
returns all non-overlapping matching patterns in a string as a list. |
|
like |
|
matches the pattern at the beginning of the string and optionally segments
the pattern components into groups; if the pattern matches, a |
|
searches the string for matches to the pattern; in this case, returns a
|
|
splits the string into parts each time the pattern occurs. |
|
replaces all ( |
print()
¶
The function print()
outputs character strings, whereby other Python data
types can easily be converted into strings and formatted, for example:
>>> import math
>>> pi = math.pi
>>> d = 28
>>> u = pi * d
>>> print("Pi is", pi, "and the circumference with a diameter of", d, "inches is", u, "inches.")
Pi is 3.141592653589793 and the circumference with a diameter of 28 inches is 87.96459430051421 inches.
F-Strings¶
F-strings can be used to shorten numbers that are too detailed for a text:
>>> print(f"The value of Pi is {pi:.3f}.")
The value of Pi is 3.142.
In {pi:.3f}
, the format specification f
is used to truncate the number
Pi to three decimal places.
In A/B test scenarios, you often want to display the percentage change in a key figure. F strings can be used to formulate them in an understandable way:
>>> metrics = 0.814172
>>> print(f"The AUC has increased to {metrics:=+7.2%}")
The AUC has increased to +81.42%
In this example, the variable metrics
is formatted with =
taking over
the contents of the variable after the +
, displaying a total of seven
characters including the plus or minus sign, metrics
and the percent sign.
.2
provides two decimal places, while the %
symbol converts the decimal
value into a percentage. For example, 0.514172
is converted to +51.42%
.
Values can also be converted into binary and hexadecimal values:
>>> block_size = 192
>>> print(f"Binary block size: {block_size:b}")
Binary block size: 11000000
>>> print(f"Hex block size: {block_size:x}")
Hex block size: c0
There are also formatting specifications that are ideally suited for CLI output, for example:
>>> data_types = [(7, "Data types", 19), (7.1, "Numbers", 19), (7.2, "Lists", 23)]
>>> for n, title, page in data_types:
... print(f"{n:.1f} {title:.<25} {page: >3}") ...
7.0 Data types............... 19
7.1 Numbers.................. 19
7.2 Lists.................... 23
In general, the format is as follows, whereby all information in square brackets is optional:
:[[FILL]ALIGN][SIGN][0b|0o|0x|d|n][0][WIDTH][GROUPING]["." PRECISION][TYPE]
The following table lists the fields for character string formatting and their meaning:
Field |
Meaning |
---|---|
|
Character used to fill in |
|
Text alignment and fill character: < : left-aligned> : right-aligned^ : centred= : Fill character after SIGN |
|
Display sign: + : Display sign for positive and negative
numbers- : Default value, - only for negative
numbers or space for positive |
|
Sign for integers: 0b : Binary numbers0o : Octal numbers0x : Hexadecimal numbersd : Default value, decimal integer with base 10n : uses the current locale setting to
insert the corresponding number separators |
|
fills with zeros |
|
Minimum field width |
|
Number separator: [1] , : comma as thousands separator_ : underscore for thousands separator |
|
For floating point numbers, the number of digits
after the point
For non-numeric values, the maximum length
|
|
Output format as number type or string … for integers: b : binary formatc : converts the integer to the corresponding
Unicode characterd : default value, decimal charactern : same as d , th the difference that it
uses the current locale setting to insert the
corresponding number separatorso : octal formatx : Hexadecimal format in base 16, using
lowercase letters for the digits above 9X : Hexadecimal format based on 16, using
capital letters for digits above 9… for floating point numbers: e : Exponent with e as separator between
coefficient and exponentE : Exponent with E as separator between
coefficient and exponentg : Standard value for floating point numbers,
whereby the exponent has a fixed width for large
and small numbersG : Like g , but changes to E if the
number becomes too large. The representations
of infinity and NaN are also written in capital
lettersn : Like g with the difference that it uses
the current locale setting to insert the
corresponding number separators% : Percentage. Multiplies the number by 100
and displays it in the fixed format f followed
by a percent sign |
Tip
A good source for F-strings is the help function:
>>> help()
help> FORMATTING
...
You can browse through the help here and find many examples.
You can exit the help function again with :–q and ⏎.
Debugging F-Strings¶
In Python 3.8, a specifier was introduced to help with debugging F-string
variables. By adding an equals sign =
, the code is included within the
F-string:
>>> uid = "veit"
>>> print(f"My name is {uid.capitalize()=}")
My name is uid.capitalize()='Veit'
Formatting date and time formats and IP addresses¶
datetime
supports the formatting of strings using the same syntax as
the strftime
method for these objects.
>>> import datetime
>>> today = datetime.date.today()
>>> print(f"Today is {today:%d %B %Y}.")
Today is 26 November 2023.
The ipaddress
module of Python also supports the formatting of
IPv4Address
and IPv6Address
objects.
Finally, third-party libraries can also add their own support for formatting
strings by adding a __format__
method to their objects.
Built-in modules for strings¶
The Python standard library contains a number of built-in modules that you can use to manage strings:
Module |
Description |
---|---|
compares with constants such as |
|
searches and replaces text with regular expressions |
|
interprets bytes as packed binary data |
|
helps to calculate deltas, find differences between strings or sequences and create patches and diff files |
|
wraps and fills text, formats text with line breaks or spaces |
See also