Recent Performance Improvements in Function Calls in CPython
Recent CPython updates have improved function call performance, reducing overhead in loops and built-in functions, with notable speed increases, making Python more efficient for developers.
Read original articleRecent updates to CPython have significantly improved the performance of function calls and built-in functions, addressing long-standing concerns about their efficiency in Python. The analysis highlights three benchmarks that measure the impact of these optimizations. The first benchmark evaluates the overhead of executing simple instructions in a loop, revealing that the introduction of super instructions in CPython 3.13 has reduced the number of bytecode instructions, enhancing execution speed. The second benchmark focuses on the cost of calling built-in functions, such as `min`, showing that optimizations like the specialized `LOAD_GLOBAL_BUILTIN` instruction and the switch to the vectorcall convention have drastically improved performance, with some operations seeing up to a 200% speed increase. The third benchmark assesses the overhead of Python-to-Python function calls, where the inlining of function calls in CPython 3.11 has streamlined the process, resulting in notable performance gains. Overall, these enhancements indicate that function calls in Python are becoming less costly, making the language more efficient for developers.
- Recent CPython releases have improved function call performance significantly.
- Super instructions and instruction specialization have reduced overhead in executing loops.
- Built-in functions like `min` have seen performance improvements due to optimized calling conventions.
- Python-to-Python function calls have become faster with inlining introduced in CPython 3.11.
- Overall, these changes enhance Python's efficiency, addressing previous performance bottlenecks.
Related
Summary of Major Changes Between Python Versions
The article details Python updates from versions 3.7 to 3.12, highlighting async/await, Walrus operator, Type hints, F-strings, Assignment expressions, Typing enhancements, Structural Pattern Matching, Tomllib, and useful tools.
Free-threaded CPython is ready to experiment with
CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.
Mining JIT traces for missing optimizations with Z3
Using Z3, PyPy's JIT traces are analyzed to pinpoint inefficient integer operations for further optimization. By translating operations into Z3 formulas, redundancies are identified to enhance PyPy's JIT compiler efficiently.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.
Python extensions should be lazy
Python's `ast.parse` function is slow due to memory management issues. A Rust extension improved AST processing speed by 16x, suggesting lazy loading strategies for better performance in Python extensions.
- Many commenters acknowledge that while Python's performance has improved, it still lags behind languages like Go and PHP in certain benchmarks.
- Real-world examples highlight the trade-offs between Python's rich library ecosystem and its performance limitations, especially in data processing tasks.
- Some users express skepticism about Python's performance, suggesting that it is often not a priority compared to code readability and flexibility.
- There are suggestions for optimizing Python code, such as minimizing function calls within loops to enhance performance.
- Overall, the community seems to appreciate ongoing performance enhancements but remains realistic about Python's inherent trade-offs.
Interestingly, Lua 5.4 did the opposite. Its implementation introduced C-level function calls for performance reasons [1] (although this change was reverted in 5.4.2 [2]).
[0] https://bugs.python.org/issue45256
[1] https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...
[1] https://github.com/lua/lua/commit/5d8ce05b3f6fad79e37ed21c10...
Here’s a real world example. I recently did some work implementing DSP data pipeline. We have a lot of code in Go, which I like generally. I looked at the library ecosystem in Go and there is no sensible set of standard filtering functions in any one library. I needed all the standards for my use case - Butterworth, Chebyshev, etc. and what I found was that they were all over the place, some libraries had one or another, but none had everything and they all had different interfaces. So I could have done it in Go, or I could have kept that part in Python and used SciPy. To me that’s an obvious choice because I and the business care more about getting something finished and working in a reasonable time, and in any case all the numeric work is in C anyway. In a couple of years, maybe that ecosystem for DSP will be better in Go, but right now it’s just not ready. This is the case with most of our algorithm/ML work. The orchestration ends up being in Go but almost everything scientific ends up in Python as the ecosystem is much much more mature.
The deviation in go's performance is still large, but far less so than Python's. Making the "wrong" choice for a single function call (bearing in mind that this is 10k iterations so we're still in the realm of scales even a moderate app can hit) in python is catastrophic, making the wrong choice for go is a significant slowdown but still 5x faster than doing it in Python. That sort of mental overhead is going to be everywhere, and it certainly doesn't encourage me to want to use python for a project.
[0] https://www.online-python.com/9gcpKLe458 [1] https://go.dev/play/p/zYKE0oZMFF4?v=goprev [2] https://news.ycombinator.com/item?id=41196915
Don’t waste time being surprised that you can do better than the default implementation. Just assume that and do so when you’ve measured that it matters.
That being said, (mostly) free performance is always nice. I’m glad they’re working on performance improvements where they can do it without sacrificing much.
min.py:
i = 10_000_000
r = 0
while i>0:
i = i-1
r += min(i,500)
print(r)
min.php: <?php
$i = 10_000_000;
$r = 0;
while ($i>0) {
$i = $i-1;
$r += min($i,500);
}
print($r);
The results: time python3 min.py
4999874750
real 0m2.523s
time php min.php
4999874750
real 0m0.333s
Looks like Python is still 8x slower than PHP. Pretty significant.I ran it with Python 3.11.2 and PHP 8.2.18
(It sure doesn't demonstrate the improvements between interpreter versions, but that's the classic, Python way of optimizing: let builtins do all the looping)
import time
start = time.time()
import os
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
print("time elapsed (s):", "%0.3f"%(time.time() - start))
on my Windows machine istime elapsed (s): 2.630
But, sometimes you can improve built in functions. I found that using a custom (but simple) string-to-int function in golang is a bit quicker than strconv.formatInt() for decimal numbers.
So, there's that.
Python isn't really supposed to be geared towards performance. I like the language, but only see articles like this as resume fodder.
https://github.com/svilendobrev/transit-python3/blob/master/...
(comment out the import transit.* and the two checks after it as they are specific. Takes ~25 seconds to finish)
Results like below. Most make sense, after thinking deeper about it, but some are weird.
One thing stays axiomatic though: no way of doing something is faster than not doing it at all. Lesson: measure before assuming anything "from-not-so-fresh-experience".
btw, unrelated, probably will be looking for work next month. Have fun.
$ python timing-probi.py
...
:::: c_pfx_check ::::
f_tuple 0.10805996900126047
f_list 0.10568888399939169
f_tuple_global 0.10741564899944933
f_list_global 0.10980218799886643
f_dict_global 0.09630626599937386
f_tuple_global20 0.6103107449998788
f_list_global20 0.6878404369999771
f_one_by_one 0.05088467200039304
:::: c_func_glob_vs_staticmethod ::::
f_glob_func 0.08005491699987033
f_staticmethd 0.10022392999962904
:::: c_for_loop_vs_gen_for ::::
for_loop 0.2255296620005538
gen_loop 0.29973782500019297
:::: c_dictget_vs_dict_subscr ::::
dictgetattr_get 0.05093873599980725
dictfuncget 0.048424991000501905
dictin_dictsubscr 0.04722780499832879
dictsubscr 0.04069488099958107
:::: c_listget_vs_list_subscr ::::
listgetattr_get 0.0779018819994235
listfuncget 0.07271830799982126
listsubscr 0.057812218999970355
:::: c_property_vs_funccall ::::
property 0.08194994600125938
funccall 0.08422214100028214
:::: c_a_in_abc_vs_a_eq_b_or ::::
x_in_str 0.04265176899934886
x_in_tuple 0.0530087259994616
x_in_tuple_global 0.05479079300130252
x_eq_a_or 0.049807468998551485
:::: c_tuple_vs_slots ::::
slots 0.17708088200015482
plain 0.18551878399921407
tuple 0.07675717399979476
dict 0.14878148099887767
namedtuple 0.2637523979992693
dataclass 0.18731526199917425
dataclassfrozen 0.38634534500124573
:::: c_dictcomp_vs_dict_gen_tuples_vs_loop ::::
dictcomp 4.476536423999278
dict_gen_tuples 7.045798945999195
dict_listcomp 6.461099333000675
dictloop 4.889642943000581
:::: c_listcomp_vs_list_gen_vs_loop ::::
listcomp 1.4269603859993367
list_gen 2.6340354429994477
listloop1 1.856299590001072
listloop2 2.1041324060006446
:::: c_funccall_args_vs_kargs ::::
args 0.14374798999961058
kargs 0.1689707850000559
kargs_ignored 0.2658924890001799
kargs_default 0.26562809300048684
inline_min = min
while expr:
if inline_min(blah):
Related
Summary of Major Changes Between Python Versions
The article details Python updates from versions 3.7 to 3.12, highlighting async/await, Walrus operator, Type hints, F-strings, Assignment expressions, Typing enhancements, Structural Pattern Matching, Tomllib, and useful tools.
Free-threaded CPython is ready to experiment with
CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.
Mining JIT traces for missing optimizations with Z3
Using Z3, PyPy's JIT traces are analyzed to pinpoint inefficient integer operations for further optimization. By translating operations into Z3 formulas, redundancies are identified to enhance PyPy's JIT compiler efficiently.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.
Python extensions should be lazy
Python's `ast.parse` function is slow due to memory management issues. A Rust extension improved AST processing speed by 16x, suggesting lazy loading strategies for better performance in Python extensions.