September 5th, 2024

Giving C++ std:regex a C makeover

A C interface for C++ `std::regex` is proposed to simplify usage in C, utilizing an arena for memory management. It improves performance but lacks Unicode support and is inherently slow.

Read original articleLink Icon
CuriositySkepticismAdmiration
Giving C++ std:regex a C makeover

This article discusses the process of creating a C interface for the C++ standard library's regex functionality, specifically `std::regex`. The author highlights the challenges of using C++ libraries in a C environment and proposes a solution that wraps `std::regex` in a C-friendly interface. The new interface, defined in `regex.h`, utilizes structures for string and memory management, allowing for regex operations without exposing C++ complexities. The implementation avoids memory allocation issues by using an arena for memory management, which simplifies the cleanup process. The article provides example usage of the new interface, demonstrating how to create regex objects and match strings. The author notes that while this approach can improve performance, especially in MSVC, it has limitations, such as lack of Unicode support and the inherent slowness of `std::regex`. The article concludes by acknowledging the trade-offs involved in using this method, including the size of the resulting DLL and the potential need for alternative regex libraries.

- A C interface for C++ `std::regex` is proposed to simplify usage in C environments.

- The implementation uses an arena for memory management, avoiding individual memory deallocations.

- Example code demonstrates how to create regex objects and perform string matching.

- The approach can lead to performance improvements, particularly in MSVC.

- Limitations include lack of Unicode support and the overall slowness of `std::regex`.

AI: What people are saying
The comments on the article reflect a mix of opinions regarding the proposed C interface for C++ `std::regex`.
  • Many commenters express concerns about the performance of `std::regex`, with some suggesting that it is too slow for practical use.
  • There is a discussion about memory management, with some praising the arena allocation approach while others find it confusing for C programmers.
  • Several users mention the desire for a simpler, full C implementation without C++ complexities.
  • Some commenters appreciate the author's innovative approach, while others feel he should have addressed the limitations of existing C regex libraries more openly.
  • Overall, there is a shared interest in improving regex functionality in C, but skepticism about the proposed solution's effectiveness.
Link Icon 13 comments
By @lieks - 8 months
For context: the author (see his other posts) is exploring the possibilities of writing C with no C runtime to avoid having to deal with it on Windows. He began to kind of treat it as a new language, with the string type, arenas and such, which help avoid memory bugs (and from my experience, are very useful).

This is a pretty cool hack. Makes me want to write a regex library again.

By @judah - 8 months
The article was interesting, but even more so was his link to arena allocation in C: https://www.rfleury.com/p/untangling-lifetimes-the-arena-all...

This comprehensive article goes over the problems of memory allocation, how programmers and educators have been trained to wrongly think about the problem, and how the concept of arenas solve it.

As someone who spends most of his time in garbage collected languages, this was wildly fascinating to me.

By @jklowden - 8 months
So bad is the performance of gcc std::regex that I reimplemented part of it using regex(3). Of course, I didn’t discover the problem until I’d committed to the interface, so I put mine in namespace dts, just in case one day the supplied implementation becomes useful.

As it stands, std::regex should come with a warning label. It’s fine for occasional use. As part of a parser, it’s not. Slow is better than broken, until slow is broken.

By @yosefk - 8 months
Around 30 years ago, STL introduced an allocator template parameter everywhere to let you control allocation. Here in 2024 we read about making use of the, erm, strange semantics of dynamic linking to force standard C++ code to allocate your way
By @lelanthran - 8 months
I can't say that I like this very much.

Problematic macro in the header, custom string type compatible with nothing else in C, and I have no idea where the arena type comes from.

Having it magically deallocate memory is nice, but will confuse C programmers reading the caller.

Honestly, adding -lre to the linker is just much easier, and that library comes with docs too.

By @unwind - 8 months
This is fun and impressive, but it feel the author kind of misses out on explaining in the intro why it would be wrong to just ... use C's regex library [1]?

I guess the entire post could be seen as an exercise in wrapping C++ to C with nice memory-handling properties and so on, but it would also be fine to be open and upfront about that, in my opinion.

1: https://www.man7.org/linux/man-pages/man3/regex.3.html

By @malkia - 8 months
Back in the old days of console game programming, most SDKs would come with something like:

my_audio_sdk_init(&arena, sizeof(arena)); // char arena[65536]; // or something like this

By @WalterBright - 8 months
> The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc.

I do something quite different. I design the API so any data returned by the library function is allocated by the caller. This means the caller has full control over what style of memory management works best.

For example, you can then choose to use stack allocation, RAII, malloc/free, the GC, static allocation, etc.

For a primitive example, snprintf.

By @chrsw - 8 months
This guy is brilliant. He tries to simplify things when so many are going the other way.
By @qalmakka - 8 months
std::regex has such horrible performance that it's probably better not to use it even in C++.
By @D4ckard - 8 months
I’d really like to see a full C implementation of this interface. Remove all the C++ complexity in the back end
By @rurban - 8 months
Why wrapping the extremely poor and slow std::regex, when you have pcre2?
By @tbe-stream - 8 months
This seems like a bad idea, if only because std::regex (performance) is horrible.