August 8th, 2024

Recent Performance Improvements in Function Calls in CPython

Recent CPython updates have improved function call performance, reducing overhead in loops and built-in functions, with notable speed increases, making Python more efficient for developers.

Read original article

CuriositySkepticismAppreciation

Recent Performance Improvements in Function Calls in CPython

Recent updates to CPython have significantly improved the performance of function calls and built-in functions, addressing long-standing concerns about their efficiency in Python. The analysis highlights three benchmarks that measure the impact of these optimizations. The first benchmark evaluates the overhead of executing simple instructions in a loop, revealing that the introduction of super instructions in CPython 3.13 has reduced the number of bytecode instructions, enhancing execution speed. The second benchmark focuses on the cost of calling built-in functions, such as `min`, showing that optimizations like the specialized `LOAD_GLOBAL_BUILTIN` instruction and the switch to the vectorcall convention have drastically improved performance, with some operations seeing up to a 200% speed increase. The third benchmark assesses the overhead of Python-to-Python function calls, where the inlining of function calls in CPython 3.11 has streamlined the process, resulting in notable performance gains. Overall, these enhancements indicate that function calls in Python are becoming less costly, making the language more efficient for developers.

- Recent CPython releases have improved function call performance significantly.

- Super instructions and instruction specialization have reduced overhead in executing loops.

- Built-in functions like `min` have seen performance improvements due to optimized calling conventions.

- Python-to-Python function calls have become faster with inlining introduced in CPython 3.11.

- Overall, these changes enhance Python's efficiency, addressing previous performance bottlenecks.

Summary of Major Changes Between Python Versions

The article details Python updates from versions 3.7 to 3.12, highlighting async/await, Walrus operator, Type hints, F-strings, Assignment expressions, Typing enhancements, Structural Pattern Matching, Tomllib, and useful tools.

Free-threaded CPython is ready to experiment with

CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.

Mining JIT traces for missing optimizations with Z3

Using Z3, PyPy's JIT traces are analyzed to pinpoint inefficient integer operations for further optimization. By translating operations into Z3 formulas, redundancies are identified to enhance PyPy's JIT compiler efficiently.

Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.

Python extensions should be lazy

Python's `ast.parse` function is slow due to memory management issues. A Rust extension improved AST processing speed by 16x, suggesting lazy loading strategies for better performance in Python extensions.

AI: What people are saying

The discussion around recent Python performance improvements reveals several key themes and opinions among commenters.

Many commenters acknowledge that while Python's performance has improved, it still lags behind languages like Go and PHP in certain benchmarks.
Real-world examples highlight the trade-offs between Python's rich library ecosystem and its performance limitations, especially in data processing tasks.
Some users express skepticism about Python's performance, suggesting that it is often not a priority compared to code readability and flexibility.
There are suggestions for optimizing Python code, such as minimizing function calls within loops to enhance performance.
Overall, the community seems to appreciate ongoing performance enhancements but remains realistic about Python's inherent trade-offs.

15 comments

By @pansa2 - 9 months

One of the performance improvements mentioned is "Remove the usage of the C stack in Python to Python calls" [0]. Since Python 3.11, a Python-level function call can be evaluated within the bytecode interpreter loop, no longer requiring a C-level function call.

Interestingly, Lua 5.4 did the opposite. Its implementation introduced C-level function calls for performance reasons [1] (although this change was reverted in 5.4.2 [2]).

[0] https://bugs.python.org/issue45256

[1] https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...

[1] https://github.com/lua/lua/commit/5d8ce05b3f6fad79e37ed21c10...

By @physicsguy - 9 months

Here we go, another HN roundabout criticism of Python performance.

Here’s a real world example. I recently did some work implementing DSP data pipeline. We have a lot of code in Go, which I like generally. I looked at the library ecosystem in Go and there is no sensible set of standard filtering functions in any one library. I needed all the standards for my use case - Butterworth, Chebyshev, etc. and what I found was that they were all over the place, some libraries had one or another, but none had everything and they all had different interfaces. So I could have done it in Go, or I could have kept that part in Python and used SciPy. To me that’s an obvious choice because I and the business care more about getting something finished and working in a reasonable time, and in any case all the numeric work is in C anyway. In a couple of years, maybe that ecosystem for DSP will be better in Go, but right now it’s just not ready. This is the case with most of our algorithm/ML work. The orchestration ends up being in Go but almost everything scientific ends up in Python as the ecosystem is much much more mature.

By @maccard - 9 months

I'm always shocked at how much performance we leave on the table in these places. I did a super quick benchmark in python[0] and go [1] (based on necovek's[2] comment) and ran them both locally. The go implementation runs the entire benchmark quicker than the fastest of the python ones.

The deviation in go's performance is still large, but far less so than Python's. Making the "wrong" choice for a single function call (bearing in mind that this is 10k iterations so we're still in the realm of scales even a moderate app can hit) in python is catastrophic, making the wrong choice for go is a significant slowdown but still 5x faster than doing it in Python. That sort of mental overhead is going to be everywhere, and it certainly doesn't encourage me to want to use python for a project.

[0] https://www.online-python.com/9gcpKLe458 [1] https://go.dev/play/p/zYKE0oZMFF4?v=goprev [2] https://news.ycombinator.com/item?id=41196915

By @Waterluvian - 9 months

I think the safest way to understand Python is that performance is a “nice to have” feature, traded off for legibility and flexibility and even simplicity of the CPython code base (go read it! You barely need to understand C and a lot of it is written in Python!).

Don’t waste time being surprised that you can do better than the default implementation. Just assume that and do so when you’ve measured that it matters.

That being said, (mostly) free performance is always nice. I’m glad they’re working on performance improvements where they can do it without sacrificing much.

By @mg - 9 months

Reading this, I was interested to see how big the speed difference between Python and PHP is these days, so I tried it like this:

min.py:

    i = 10_000_000
    r = 0

    while i>0:
        i = i-1
        r += min(i,500)

    print(r)

min.php:

    <?php

    $i = 10_000_000;
    $r = 0;

    while ($i>0) {
        $i = $i-1;
        $r += min($i,500);
    }

    print($r);

The results:

    time python3 min.py
    4999874750
    real    0m2.523s

    time php min.php
    4999874750
    real    0m0.333s

Looks like Python is still 8x slower than PHP. Pretty significant.

I ran it with Python 3.11.2 and PHP 8.2.18

By @necovek - 9 months

I wonder how would simply doing a `return min(heights)` compare to any of the options given?

(It sure doesn't demonstrate the improvements between interpreter versions, but that's the classic, Python way of optimizing: let builtins do all the looping)

By @Bostonian - 9 months

On the general topic of Python performance, just importing needed modules can take a few seconds. The output of the code

   import time
   start = time.time()
   import os
   import numpy as np
   import pandas as pd
   import xgboost as xgb
   from sklearn.metrics import mean_squared_error
   from scipy.stats import pearsonr
   import matplotlib.pyplot as plt
   print("time elapsed (s):", "%0.3f"%(time.time() - start))

on my Windows machine is

time elapsed (s): 2.630

By @otteromkram - 9 months

I don't trust anyone who uses camelCase to write Python. Or, unnecessary while loops.

But, sometimes you can improve built in functions. I found that using a custom (but simple) string-to-int function in golang is a bit quicker than strconv.formatInt() for decimal numbers.

So, there's that.

Python isn't really supposed to be geared towards performance. I like the language, but only see articles like this as resume fodder.

By @hoten - 9 months

So there's only three super-instructions? I wonder if there are plans for more.

By @rbanffy - 9 months

I’ll just leave this here:

https://paulgraham.com/avg.html

By @svilen_dobrev - 9 months

few months ago i had to optimize some old python code, .. and found that my longtime assumptions from py 1,2 and early 3.x are not anymore true. So put up some comparisons.. Check them here:

https://github.com/svilendobrev/transit-python3/blob/master/...

(comment out the import transit.* and the two checks after it as they are specific. Takes ~25 seconds to finish)

Results like below. Most make sense, after thinking deeper about it, but some are weird.

One thing stays axiomatic though: no way of doing something is faster than not doing it at all. Lesson: measure before assuming anything "from-not-so-fresh-experience".

btw, unrelated, probably will be looking for work next month. Have fun.

    $ python timing-probi.py 
    ...
    :::: c_pfx_check ::::
      f_tuple              0.10805996900126047
      f_list               0.10568888399939169
      f_tuple_global       0.10741564899944933
      f_list_global        0.10980218799886643
      f_dict_global        0.09630626599937386
      f_tuple_global20     0.6103107449998788
      f_list_global20      0.6878404369999771
      f_one_by_one         0.05088467200039304
    :::: c_func_glob_vs_staticmethod ::::
      f_glob_func          0.08005491699987033
      f_staticmethd        0.10022392999962904
    :::: c_for_loop_vs_gen_for ::::
      for_loop             0.2255296620005538
      gen_loop             0.29973782500019297
    :::: c_dictget_vs_dict_subscr ::::
      dictgetattr_get      0.05093873599980725
      dictfuncget          0.048424991000501905
      dictin_dictsubscr    0.04722780499832879
      dictsubscr           0.04069488099958107
    :::: c_listget_vs_list_subscr ::::
      listgetattr_get      0.0779018819994235
      listfuncget          0.07271830799982126
      listsubscr           0.057812218999970355
    :::: c_property_vs_funccall ::::
      property             0.08194994600125938
      funccall             0.08422214100028214
    :::: c_a_in_abc_vs_a_eq_b_or ::::
      x_in_str             0.04265176899934886
      x_in_tuple           0.0530087259994616
      x_in_tuple_global    0.05479079300130252
      x_eq_a_or            0.049807468998551485
    :::: c_tuple_vs_slots ::::
      slots                0.17708088200015482
      plain                0.18551878399921407
      tuple                0.07675717399979476
      dict                 0.14878148099887767
      namedtuple           0.2637523979992693
      dataclass            0.18731526199917425
      dataclassfrozen      0.38634534500124573
    :::: c_dictcomp_vs_dict_gen_tuples_vs_loop ::::
      dictcomp             4.476536423999278
      dict_gen_tuples      7.045798945999195
      dict_listcomp        6.461099333000675
      dictloop             4.889642943000581
    :::: c_listcomp_vs_list_gen_vs_loop ::::
      listcomp             1.4269603859993367
      list_gen             2.6340354429994477
      listloop1            1.856299590001072
      listloop2            2.1041324060006446
    :::: c_funccall_args_vs_kargs ::::
      args                 0.14374798999961058
      kargs                0.1689707850000559
      kargs_ignored        0.2658924890001799
      kargs_default        0.26562809300048684

By @L-four - 9 months

Loops should be avoided in python. Only constant time operations should be performed.

By @winrid - 9 months

AFAIK this is because CPython has to walk the scope up to find the import for every call in your loop, and still applies to python3, right? You can still use the built in min, just create a "closer" reference before your loop for the same speedup:

inline_min = min

while expr:

    if inline_min(blah):

Recent Performance Improvements in Function Calls in CPython

Related

Summary of Major Changes Between Python Versions

Free-threaded CPython is ready to experiment with

Mining JIT traces for missing optimizations with Z3

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Python extensions should be lazy

Related

Summary of Major Changes Between Python Versions

Free-threaded CPython is ready to experiment with

Mining JIT traces for missing optimizations with Z3

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Python extensions should be lazy