Python developers trust their applications to have a solid security state due to the use of standard libraries and common frameworks. However, within Python, just like in any other programming language, there are certain features that can be misleading or misused by developers. Often it is only a very minor subtlety or detail that can make developers slip and add a severe security vulnerability to the code base.
In this blog post, we share 10 security pitfalls we encountered in real-world Python projects. We chose pitfalls that we believe are less known in the developer community. By explaining each issue and its impact we hope to raise awareness and sharpen your security mindset. If you are using any of these features, make sure to check your Python code!
Python offers the ability to execute code in an optimized way. This allows the code to run faster and with less memory. It is especially effective when the application is used on a large scale or when there are few resources available. Some pre-packaged Python applications are provided with optimized bytecode. However, when code is optimized, all
assert statements are ignored. These are sometimes used by developers to assess certain conditions within the code. If an
assert is used, for example, as part of an authentication check this can lead to a security bypass.
In this example, the assert statement in line 2 would be ignored and every non-super user could reach the next lines of code. It is not recommended to use assert statements for security-related checks but we do see them in real-world applications.
os.makedirs creates one or more folders in the file system. Its second parameter
mode is used to specify the default permission of the created folders. In line 2 of the following code snippet, the folders A/B/C are created with
rwx------ (0o700) permission. This implies that only the current user (owner) has read, write and execute rights for these folders.
In Python < 3.6, the folders A, B and C are each created with permission 700. However, in Python > 3.6, only the last folder C has permission 700 and the other folders A and B are created with the default permission 755. So, with Python > 3.6, the function
os.makedirs has the same properties as the Linux command:
mkdir -m 700 -p A/B/C. Some developers are unaware of the difference between the versions and it has already led to a permission escalation vulnerability in Django (CVE-2020-24583) and, in a very similar way, to a hardening bypass in WordPress.
os.path.join(path, *paths) function is used to join multiple file path components into a combined file path. The first parameter usually contains the basepath while each further parameter is appended to the basepath as a component. However, the function has a peculiarity that some developers are not aware of. If one of the appended components starts with a
/, all previous components including the basepath are removed and this component is treated as an absolute path. The following example shows this possible pitfall for developers.
In line 3, the resulting path is constructed from the user-controlled input
filename using the
os.path.join function. In line 4, the resulting path is checked to see if it contains a
. to prevent a path traversal vulnerability. However, if the attacker passes the filename parameter
/a/b/c.txt then the resulting variable
file_path in line 3 is an absolute file path. The
var/lib components including the basepath are now ignored by
os.path.join and an attacker can read any file without using a single
. character. Although this behavior is described in the
os.path.join documentation it has led to numerous vulnerabilities in the past (Cuckoo Sandbox Evasion, CVE-2020-35736).
tempfile.NamedTemporaryFile function is used to create temporary files with a specific name. However, the
suffix parameters are vulnerable to a path traversal attack (Issue 35278). If an attacker controls one of these parameters, he can create a temporary file at an arbitrary location in the file system. The following example shows a possible pitfall for developers.
In line 3, the user input
id is used as a prefix for the temporary file. If an attacker passes the payload
/../var/www/test as the
id parameter, the following tmp file is created:
/var/www/test_zdllj17. This may sound harmless at first glance, but it provides an attacker a basis for exploiting more complex vulnerabilities.
Extracting uploaded file archives is a common feature in web applications. In Python, the functions
TarFile.extract are known to be vulnerable to a Zip Slip attack. That's when an attacker tampers with the file names inside an archive so that they contain path traversal (
../) characters. That's why archive entries should always be considered as untrusted sources. The
zipfile.extract functions sanitize zip entries and thus prevent such path traversal vulnerabilities. But, this does not mean that a path traversal vulnerability can’t occur within the ZipFile library. The following example shows a code for extracting zip files.
In line 3, a
ZipFile handler is created from the temporary path of the uploaded user file. In lines 4 - 8, all zip entries ending with
.html are extracted. The function
zf.namelist in line 7 contains the name of an entry within the zip file. Note that only the
zipfile.extractall functions sanitize the entries, not any of the other functions. In this case an attacker can create a filename, e.g.
../../../var/www/html, with arbitrary content. The contents of the malicious file are read in line 6 and written to the attacker's controlled path in lines 7-8. As a result, an attacker is allowed to create arbitrary HTML files on the entire server.
As mentioned above, entries inside an archive should be considered untrusted. If you don’t use
zipfile.extract you should always sanitize the names of the zip entries e.g. by using
os.path.basename. Otherwise it could lead to a critical security vulnerability like the one found in NLTK Downloader (CVE-2019-14751).
Regular expressions (regex) are an integral part of most web applications. We commonly see them used by custom Web Application Firewalls (WAF) for input validation, e.g. to detect malicious strings. In Python, there is a subtle difference between
re.search that we would like to demonstrate in the following code snippet.
In line 2, a pattern is defined that matches a
select to detect a possible SQL Injection. This is a terrible idea, as you can often bypass these blacklists, but we’ve seen it in real-world applications. In line 4 the function
re.match is used with the previously defined pattern to check if the user input
name in line 3 contains any of these malicious values. However, unlike the
re.search function, the
re.match function does not match on new lines. For example, if an attacker submitted the value
aaaaaa \n union select, the user input would not match the regex. As a result, the check can be bypassed and does not provide any protection. Overall, we do not recommend using a regex deny list for any security checks.
Unicode allows characters to be used in multiple representations and maps these characters to codepoints. In the Unicode standard, four normalizations are defined for different Unicode characters. An application can use these normalizations to store data, such as a user name, in a uniform way independent of the human language. However, an attacker can exploit these normalizations, and that has already led to a vulnerability in Python's
urllib (CVE-2019-9636). The following code snippet demonstrates a Cross-Site Scripting (XSS) vulnerability based on the NFKC normalization.
In line 6, the user input is sanitized by Django's
escape function to prevent an XSS vulnerability. In line 7, the sanitized input is normalized via the NFKC algorithm so that it is correctly rendered in lines 8-9 through the
Within the template
test.html, the variable
my_input in line 4 is marked as
safe because the developer expects special characters and assumes that the variable has already been sanitized by the
escape function. By using the keyword
safe the variable is not sanitized additionally by Django. However, due to normalization in line 7 (
view.py), the character
%EF%B9%A4 is transformed to
%EF%B9%A5 is transformed to
>. This allows an attacker to inject arbitrary HTML tags and to trigger an XSS vulnerability. To prevent this vulnerability, user input should always be sanitized at the very last step, after it has been normalized.
As mentioned above, Unicode characters are mapped to codepoints. However, there are many different human languages and Unicode tries to unify them. This also means that there is a high probability that different characters have the same "layout". For example, the lowercase Turkish
ı (without a dot) character is
I in uppercase.
In Latin-based alphabets, the character
i is also
I in uppercase. In Unicode terms, the two different characters are mapped to the same codepoint in uppercase. This behavior is exploitable and has already led to a critical vulnerability in Django (CVE-2019-19844). Let’s have a look at the following code example of a password reset feature.
In line 6 the user input
upper function first. For the attack, we assume that a user with the email
firstname.lastname@example.org exists in the database. An attacker can now simply pass
foo@mıx.com as the email in line 6 where the
i is replaced with the Turkish
ı. In line 7 the email is then transformed to uppercase which results in
FOO@MIX.COM. This means that a user has been found and a password reset email is sent. However, the email is sent to the untransformed email address from line 6 and therefore still contains the Turkish
ı. In other words, the password of another user is sent to the attacker-controlled email address. To prevent this vulnerability, line 10 can be replaced with the user's email from the database. Even if a collision occurs, an attacker has no benefit from it in this context.
In Python < 3.8, IP addresses are normalized by the
ipaddress library so that leading zeros are removed. This behavior might look harmless at first glance, but it has already led to a high-severity vulnerability in Django (CVE-2021-33571). An attacker can exploit the normalization to bypass potential validators for Server-Side Request Forgery (SSRF) attacks. The following code snippet shows how such a validator can be bypassed.
In line 5, an IP address is given by a user, and in line 7, a denylist is used to check if the IP is a local address in order to prevent a possible SSRF vulnerability. The denylist is not complete and is only used as an example. In line 9 the code checks whether the provided IP is an IPv4 address and at the same time the IP is normalized. The actual request to the provided IP is performed on line 12 after all validations. However, an attacker could pass
127.0.00.1 as the IP address, which is not found in the denylist in line 7. Afterward, in line 9, the IP is normalized to
ipaddress.IPv4Address. As a consequence, the attacker is able to bypass the SSRF validator and send requests to the local network addresses.
In Python < 3.7 the function
urllib.parse.parse_qsl allows the use of the
& characters as separators for URL query variables. What's interesting here is that the
; character is not recognized as a separator by other languages. In the following example, we would like to show why this behavior could lead to a vulnerability. Let's assume that we are running an infrastructure where the frontend is a PHP application and there is another internal Python application.
An attacker sends the following GET request to the PHP frontend:
The PHP frontend recognizes only one query variable:
a with the content
1;b=2. PHP does not treat
; characters as separators for query variables. Now the frontend forwards the attacker's request to an internal Python application with the query variable
urllib.parse.parse_qsl is used, the Python application processes two query variables:
b=2 This difference in the parsing of query variables can lead to fatal security vulnerabilities, like the web cache poisoning vulnerability in Django (CVE-2021-23336).
In this blog post, we introduced 10 Python security pitfalls that we believe are less known among developers. Each subtle pitfall can be easily overlooked and has led to security vulnerabilities in real-world applications in the past.
We have seen that pitfalls can occur in all kinds of operations, from processing files, directories, archives, URLs, and IPs to simple strings. A common pattern is the use of library functions which can have unexpected behavior. This reminds us to always upgrade to the latest version and to carefully read the documentation. At SonarSource, we are researching about these pitfalls to continuously improve our code analyzers.