June 20th, 2024

Wc2: Investigates optimizing 'wc', the Unix word count program

The GitHub project "wc2" presents an innovative algorithm for the `wc` program, focusing on asynchronous state-machine parsing in C and JavaScript. It enhances efficiency, scalability, and speed compared to traditional `wc` implementations.

Read original article

Wc2: Investigates optimizing 'wc', the Unix word count program

The GitHub project "wc2 - asynchronous state machine parsing" introduces a novel algorithm for the classic `wc` program, emphasizing an asynchronous state-machine parsing technique. It offers versions in C and JavaScript, including `wc2o.c` as a simplified version, `wc2.c` supporting Unicode, and `wc2.js` for JavaScript. The algorithm processes input byte by byte through a state-machine for parsing, showcasing improved efficiency and scalability. Benchmarking results compare its performance with traditional `wc` programs on various input files. The project provides tools like `wctool` for test file generation, `wcdiff` for comparison, and `wcstream` for input file fragmentation. It delves into pointer arithmetic in C, offering an `-P` option in `wc2.c` to evaluate speed differences. Overall, the project highlights the advantages of the asynchronous state-machine parsing algorithm in terms of speed, scalability, and efficiency over conventional `wc` programs.

X debut 40 years ago (1984)

Robert W. Scheifler introduced the X window system in June 1984 for the VS100 Unix server, offering improved performance over W. The system was stable, with the Laboratory for Computer Science already transitioning to X and developing applications. Scheifler encouraged experimentation and welcomed volunteers for documentation contributions.

Let's write a video game from scratch like it's 1987

Philippe Gaultier created Minesweeper in 1987 using X11 without libraries, resulting in a compact executable. The article explores X authentication, Odin language usage, and minimalistic GUI development, offering insights into X11 intricacies. Source code is on GitHub.

Eight million pixels and counting: improving texture atlas allocation in Firefox (2021)

Improving texture atlas allocation in WebRender with the guillotiere crate reduces texture memory usage. The guillotine algorithm was replaced due to fragmentation issues, leading to a more efficient allocator. Visualizing the atlas in SVG aids debugging. Rust's simplicity and Cargo fuzz testing are praised for code development and robustness. Enhancements in draw call batching and texture upload aim to boost performance on low-end Intel GPUs by optimizing texture atlases.

Exposition of Front End Build Systems

Frontend build systems are crucial in web development, involving transpilation, bundling, and minification steps. Tools like Babel and Webpack optimize code for performance and developer experience. Various bundlers like Webpack, Rollup, Parcel, esbuild, and Turbopack are compared for features and performance.

Gren 0.4: New Foundations

Gren 0.4 updates its functional language with enhanced core packages, a new compiler, revamped FileSystem API, improved functions, and a community shift to Discord. These updates aim to boost usability and community engagement.

32 comments

By @jmholla - 10 months

> No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine.

This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.

Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:

    /**
     * Build an ASCII row. This configures low-order 7-bits, which should
     * be roughly the same for all states
     */
    static void
    build_basic(unsigned char *row, unsigned char default_state, unsigned char ubase)
    
    ...
    
    /**
     * This function compiles a DFA-style state-machine for parsing UTF-8
     * variable-length byte sequences.
     */
    static void
    compile_utf8_statemachine(int is_multibyte)

Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.

Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?

By @lesuorac - 10 months

State machines are an underrated approached.

Just remember, if you're ever debugging something and some values in a Class/Object/Group of them are set and others are unset the state machine avoids that issue. When you go to transition between states check that the future state will be valid and if-not go to an error state with that information. The amount of time I've seen spent debugging where spurious `null`s came from completely dwarfs the time it would've taken just to write that class/algorithm as a state machine.

Kinda surprised WC wasn't a state machine to beginning with. Isn't it effectively a special characters counter where if you see a charactered followed by a space bump up the word count? I'm judging by the repos comment of "The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces -- which re-implementations of wc skip." that the big improvement is removing unnecessary work of those methods. mbrtowc [1] appears to re-create the provided substring which isn't necessary to count.

[1]: https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc

By @btown - 10 months

State machines are great for complex situations, but when it comes to performance, it's not at all clear to me that they're the most scalable approach with modern systems.

The data dependency between a loop iteration for each character might be pipelined really well when executed, and we can assume large enough L1/L2 cache for our lookup tables. But we're still using at least one lookup per character.

Projects like https://github.com/simdjson/simdjson?tab=readme-ov-file#abou... are truly fascinating, because they're based on SIMD instructions that can process 64 or more bytes with a single instruction. Very much worth checking out the papers at that link.

By @GrantMoyer - 10 months

> The algorithm is known as an "asynchronous state-machine parser". It's a technique for parsing that you don't learn in college.

The mapping from regular and context free languages to state machines and their table-based implementations was covered in depth in the Compilers course I took. A table based state machine is literally the textbook algorithm for parsing these classes of languages, and implementing them has been almost entirely automated by tools like Flex and Bison since the 80s.

By @mananaysiempre - 10 months

> This is analogous to NFA and DFA regular-expressions. If you use the NFA approach, you need to buffer the entire chunk of data, so that the regex can backtrack. Using the DFA approach, input can be provided as a stream

Sorry, no, this is completely wrong. There are other reasons NFAs or DFAs might be preferable, and there are other performance considerations as far as backtracking is concerned, but NFA implementations do not backtrack. I mean, they could, but they don’t, and in common usage a “NFA regex implementation” means a non-backtracking one (that goes through all input once with a bag of states in hand and advances each one of them on each step), whereas a “backtracking regex implementation” means one that doesn’t use automata at all (and is usually worst-case exponential in the input).

What this is actually talking about is a streaming parser. Just call it like it is, it’s an OK term that sees plenty of usage.

[What you could mention here is PEG parsers, because even linear PEG implementations may need to keep O(input) state and thus are essentially incapable of streaming, unlike LL or LR ones; and GLR et al. can be even worse, for what it’s worth. But this is a much more obscure topic.]

By @moody__ - 10 months

The "asynchronous state machine" name here is a bit strange, when searching for this term used elsewhere I couldn't find any formal definition what it is. Reading further in the README it looks like the author implies that it really just means a DFA? Not entirely sure.

I'd also like to add the Plan 9 implementation[0], which also uses the properties of utf8 as part of its state machine and anecdotally has been quite performant when I've used it.

[0] http://git.9front.org/plan9front/plan9front/107a7ba9717429ae...

By @Maro - 10 months

I wrote a clean C++ version of `wc` for fun, and successfully beat the system version by 10-20% iirc. My goal was to make the code much more readable than the GNU/C version, which is a big spaghetti loop, while maintaining the same semantics (ie. return the same counts).

Blog posts:

https://bytepawn.com/tag/wc.html

Code:

https://github.com/mtrencseni/wcpp/

By @im3w1l - 10 months

I'm surprised there is no mention of simd. Like I'm sure this is "fast enough" but if you want to make a really fast wc for fun wouldn't that be a natural direction?

By @nitwit005 - 10 months

The primary reason their code is faster as they've made incorrect code that assumes you always have a UTF-8 locale. The normal isspace does not assume this:

> The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces

This will produce bad results in the real world. People have previously posted about bugs related to this on Hacker News: https://news.ycombinator.com/item?id=36216389

By @YesThatTom2 - 10 months

I love state machines and every time I use one my workers think I invented it because they’ve never seen them before.

The data for state machine in this article might be best prepared by generating it from a program. That generator program doesn’t need to care (too much) about performance since it is run during the build process. I like the idea of doing a lot of work now to save time in the future.

By @bruce343434 - 10 months

What about this is "asynchronous"?

By @JonChesterfield - 10 months

Recognising that wc is a lexer problem and implementing it using re2c would be my first choice here. That'll be building a similar state machine to what is done by hand here, especially since the hand rolled one is still byte at a time (the main drawback to re2c's generated output).

By @JackSlateur - 10 months

Reading the code to help me understand how things are

I got the wc_lines from coreutils: https://github.com/coreutils/coreutils/blob/master/src/wc.c#...

And I thought "damn, I understand nothing about the state-machine stuff, how did they made this faster ?"

Truth is: they did not Of course, this is just the "count line" part. Other parts are indeed faster.

Coreutils 9.4:

  0.67 [jack:/tmp] /usr/bin/time -v wc -l debian-12.2.0-amd64-netinst.iso 
  2524855 debian-12.2.0-amd64-netinst.iso
   Command being timed: "wc -l debian-12.2.0-amd64-netinst.iso"
 User time (seconds): 0.00
 System time (seconds): 0.07
 Percent of CPU this job got: 98%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 2176
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 0
 Minor (reclaiming a frame) page faults: 99
 Voluntary context switches: 1
 Involuntary context switches: 0
 Swaps: 0
 File system inputs: 0
 File system outputs: 0
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0

And wc2 (from gcc -O3 -o wc2 wc2.c):

  0.47 [jack:/tmp] /usr/bin/time -v ./wc2 -l debian-12.2.0-amd64-netinst.iso 
  2524855 debian-12.2.0-amd64-netinst.iso
 Command being timed: "./wc2 -l debian-12.2.0-amd64- netinst.iso"
 User time (seconds): 0.97
 System time (seconds): 0.05
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.03
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 2176
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 0
 Minor (reclaiming a frame) page faults: 146
 Voluntary context switches: 1
 Involuntary context switches: 1
 Swaps: 0
 File system inputs: 0
 File system outputs: 0
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0

Gotta keep reading

By @throwaway55533 - 10 months

From reading the article, how is this more efficient? Doesn't any word counting algorithm have to iterate through all the characters and count spaces?

What makes this better than the standard algorithm of

    wc = 0
    prev = null
    for ch in charStream:
        if !isSpace(ch) and isSpace(prev):
           wc += 1
        prev = ch

By @rep_movsd - 10 months

Remember "The Zen of Code Optimization" by Michael Abrash in the 90s?

This word count challenge was the first one he described. The table driven approach was discussed, but was not necessarily the fastest due to the large cache overflowing (on 90s processors) table needed.

I used the same example in a chapter I wrote for a C++ book

By @smaudet - 10 months

I was a bit surprised by the macos/linux speed differential here:

> The difference in macOS and Linux speed is actually the difference in clang and gcc speed. The LLVM clang compiler is doing better optimizations for x86 processors here.

I know GCC has been getting much, much more aggressive as of late, but there have been some complaints that it is now much, much less safe (checks intended for security getting optimized away, e.g.).

I wonder if you were to go back to 5 years ago, if you'd see the linux code was generally safer than the speedy osx llvm code...

By @1vuio0pswjnm7 - 10 months

"The wc program included with macOS and Linux are completely different. Therefore, the following table shows them benchmarked against each other on the same hardware."

Appears BSD wc ("the macOS program") is much faster than the GNU coreutils wc ("the Linux program").

It certainly illustrates a point about people rewriting wc in newer languages to demonstrate alleged speed against C.

Which wc are they using for comparison.

Pehaps they should use wc2 or wc2o.

By @kazinator - 10 months

The target audience, Unix graybeards programming shell scripts using commands like wc, is extremely hard to convince to use new tools. If you don't integrate these optimizations into, say, the GNU Core Utils wc program, the impact will be nil.

By @apitman - 10 months

I didn't know you were allowed to use a language other than Rust for something like this.

By @rakoo - 10 months

> This state machine approach always results in the same speed, regardless of input.

This is strange to me, shouldn't it be O(N) with the size of the input ? Or maybe in this benchmark the inputs are small enough that the difference isn't measurable ?

By @fragmede - 10 months

In the bringing a tank to a knife fight kind of way, could this be optimized to run on a GPU? Load the contents then do an "and" across the whole contents in parallel, and then sum the whitespaces?

By @aftbit - 10 months

It would have been nice for the author to label the states and explain how and why they chose the states that they did. Otherwise, it's hard to understand how to apply this sort of thing to larger problems.

By @mjamil - 10 months

"Many programmers think pointer-arithmetic is faster." Don't modern compilers make this statement false (i.e., both approaches are implemented via the same machine instructions)?

By @markstos - 10 months

Was this ever submitted an update to the `wc` shipped on systems?

By @1vuio0pswjnm7 - 10 months

https://github.com/katef/libfsm/

By @nikau - 10 months

Good to see effort to speed up wc, its been taking way too many cycles away from my 1.5GiB electron chat program

By @kristianpaul - 10 months

Isn't fedora working on improving some gnu tools?

By @NunoSempere - 10 months

From a previous project: https://nunosempere.com/blog/2023/09/15/wc/ that was looking not at speed, but at simplicity, here is a comparison of different historical versions of wc:

---

The [version of wc.c](https://git.nunosempere.com/personal/wc/src/branch/master/sr...) in this repository sits at 44 lines. It decides to read from stdin if the number of arguments fed to it is otherwise zero, and uses the standard C function getc to read character by character. It doesn't have flags, instead, there are further utilities in the src/extra/ folder for counting characters and lines, sitting at 32 and 35 lines of code, respectively. This version also has little error checking.

[Here](https://github.com/dspinellis/unix-history-repo/blob/Researc...) is a version of wc from UNIX V7, at 86 lines. It allows for counting characters, words and lines. I couldn't find a version in UNIX V6, so I'm guessing this is one of the earliest versions of this program. It decides to read from stdin if the number of arguments fed to it is zero, and reads character by character using the standard C getc function.

The busybox version ([git.busybox.net](https://git.busybox.net/busybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with comments stripped), while striving to be [POSIX-compliant](https://pubs.opengroup.org/onlinepubs/9699919799/), meaning it has a fair number of flags and a bit of complexity. It reads character by character by using the standard getc function, and decides to read from stdin or not using its own fopen_or_warn_stdin function. It uses two GOTOs to get around, and has some incomplete Unicode support.

The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version implements some sort of table method in 331 lines. It uses plan9 rather than Unix libraries and methods, and seems to read from stdin if the number of args is 0.

The plan9port version of wc ([github](https://github.com/9fans/plan9port/blob/master/src/cmd/wc.c)) also implements some sort of table method, in 352 lines. It reads from stdin if the number of args is 0, and uses the Linux read function to read character by character.

The [OpenBSD](https://github.com/openbsd/src/blob/master/usr.bin/wc/wc.c) version is just *nice*. It reads from stdin by default, and uses a bit of buffering using read to speed things up. It defaults to using fstat when counting characters. It is generally pleasantly understandable, nice to read. I'm actually surprised at how pleasant it is to read.

The [FreeBSD version](https://cgit.freebsd.org/src/tree/usr.bin/wc/wc.c) sits at 367 lines. It has enough new things that I can't parse all that it's doing: in lines 137-143, what is capabilities mode? what is casper?, but otherwise it decides whether to read from stdin by the number of arguments, in line 157. It uses a combination of fstat and read, depending on the type of file.

Finally, the GNU utils version ([github](https://github.com/coreutils/coreutils/tree/master/src/wc.c), [savannah](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...)) is a bit over 1K lines of C. It does many things and checks many possible failure modes. I think it detects whether it should be reading from stdin using some very wrapped fstat, and it reads character by character using its own custom function.

So this utility started out reasonably small, then started getting more and more complex. [The POSIX committee](https://pubs.opengroup.org/onlinepubs/9699919799/) ended up codifying that complexity, and now we are stuck with it because even implementations like busybox which strive to be quite small try to keep to POSIX.

By @fsckboy - 10 months

while you are in there,

  wc --unicode
  wc --bom

By @cryptos - 10 months

Now we need another round of the language battle, because now this approach needs to be implemented in Rust, Go, Zig, Nim ... you name it!

By @aargh_aargh - 10 months

I wanted to take a peek the state diagram (ASCII version). But I was feeling lazy, so apologies for having ChatGPT do the work for me. I wasn't going for correctness and didn't check it. If you're like me, here you go:

https://dreampuf.github.io/GraphvizOnline/#digraph%20StateMa...

digraph StateMachine { rankdir=LR; size="8,5"; node [shape = circle];

    LookingForWord -> InsideWord [label="Non-space"];
    LookingForWord -> LookingForWord [label="Space"];
    LookingForWord -> Newline [label="Newline"];
    LookingForWord -> LookingForWord [label="Other"];
    
    Newline -> InsideWord [label="Non-space"];
    Newline -> LookingForWord [label="Space"];
    Newline -> Newline [label="Newline"];
    Newline -> LookingForWord [label="Other"];
    
    InsideWord -> ContinueWord [label="Non-space"];
    InsideWord -> LookingForWord [label="Space"];
    InsideWord -> Newline [label="Newline"];
    InsideWord -> LookingForWord [label="Other"];
    
    ContinueWord -> ContinueWord [label="Non-space"];
    ContinueWord -> LookingForWord [label="Space"];
    ContinueWord -> Newline [label="Newline"];
    ContinueWord -> LookingForWord [label="Other"];

}

EDIT: Under peer pressure, I checked it and it correctly reflects the code apart from being designed for one specific line ending sequence (as it should, being an example optimized for brevity).

As for the replies, as opposed to my approach, I'm sure when you're browsing research papers, you're doing a full reproducibility study for each one. I'm sorry I commented, I should have waited for you to post your results.

Wc2: Investigates optimizing 'wc', the Unix word count program

Related

X debut 40 years ago (1984)

Let's write a video game from scratch like it's 1987

Eight million pixels and counting: improving texture atlas allocation in Firefox (2021)

Exposition of Front End Build Systems

Gren 0.4: New Foundations

Related

X debut 40 years ago (1984)

Let's write a video game from scratch like it's 1987

Eight million pixels and counting: improving texture atlas allocation in Firefox (2021)

Exposition of Front End Build Systems

Gren 0.4: New Foundations