Wc2: Investigates optimizing 'wc', the Unix word count program
The GitHub project "wc2" presents an innovative algorithm for the `wc` program, focusing on asynchronous state-machine parsing in C and JavaScript. It enhances efficiency, scalability, and speed compared to traditional `wc` implementations.
Read original articleThe GitHub project "wc2 - asynchronous state machine parsing" introduces a novel algorithm for the classic `wc` program, emphasizing an asynchronous state-machine parsing technique. It offers versions in C and JavaScript, including `wc2o.c` as a simplified version, `wc2.c` supporting Unicode, and `wc2.js` for JavaScript. The algorithm processes input byte by byte through a state-machine for parsing, showcasing improved efficiency and scalability. Benchmarking results compare its performance with traditional `wc` programs on various input files. The project provides tools like `wctool` for test file generation, `wcdiff` for comparison, and `wcstream` for input file fragmentation. It delves into pointer arithmetic in C, offering an `-P` option in `wc2.c` to evaluate speed differences. Overall, the project highlights the advantages of the asynchronous state-machine parsing algorithm in terms of speed, scalability, and efficiency over conventional `wc` programs.
Related
X debut 40 years ago (1984)
Robert W. Scheifler introduced the X window system in June 1984 for the VS100 Unix server, offering improved performance over W. The system was stable, with the Laboratory for Computer Science already transitioning to X and developing applications. Scheifler encouraged experimentation and welcomed volunteers for documentation contributions.
Let's write a video game from scratch like it's 1987
Philippe Gaultier created Minesweeper in 1987 using X11 without libraries, resulting in a compact executable. The article explores X authentication, Odin language usage, and minimalistic GUI development, offering insights into X11 intricacies. Source code is on GitHub.
Eight million pixels and counting: improving texture atlas allocation in Firefox (2021)
Improving texture atlas allocation in WebRender with the guillotiere crate reduces texture memory usage. The guillotine algorithm was replaced due to fragmentation issues, leading to a more efficient allocator. Visualizing the atlas in SVG aids debugging. Rust's simplicity and Cargo fuzz testing are praised for code development and robustness. Enhancements in draw call batching and texture upload aim to boost performance on low-end Intel GPUs by optimizing texture atlases.
Exposition of Front End Build Systems
Frontend build systems are crucial in web development, involving transpilation, bundling, and minification steps. Tools like Babel and Webpack optimize code for performance and developer experience. Various bundlers like Webpack, Rollup, Parcel, esbuild, and Turbopack are compared for features and performance.
Gren 0.4: New Foundations
Gren 0.4 updates its functional language with enhanced core packages, a new compiler, revamped FileSystem API, improved functions, and a community shift to Discord. These updates aim to boost usability and community engagement.
This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.
Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:
/**
* Build an ASCII row. This configures low-order 7-bits, which should
* be roughly the same for all states
*/
static void
build_basic(unsigned char *row, unsigned char default_state, unsigned char ubase)
...
/**
* This function compiles a DFA-style state-machine for parsing UTF-8
* variable-length byte sequences.
*/
static void
compile_utf8_statemachine(int is_multibyte)
Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?
Just remember, if you're ever debugging something and some values in a Class/Object/Group of them are set and others are unset the state machine avoids that issue. When you go to transition between states check that the future state will be valid and if-not go to an error state with that information. The amount of time I've seen spent debugging where spurious `null`s came from completely dwarfs the time it would've taken just to write that class/algorithm as a state machine.
Kinda surprised WC wasn't a state machine to beginning with. Isn't it effectively a special characters counter where if you see a charactered followed by a space bump up the word count? I'm judging by the repos comment of "The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces -- which re-implementations of wc skip." that the big improvement is removing unnecessary work of those methods. mbrtowc [1] appears to re-create the provided substring which isn't necessary to count.
[1]: https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc
The data dependency between a loop iteration for each character might be pipelined really well when executed, and we can assume large enough L1/L2 cache for our lookup tables. But we're still using at least one lookup per character.
Projects like https://github.com/simdjson/simdjson?tab=readme-ov-file#abou... are truly fascinating, because they're based on SIMD instructions that can process 64 or more bytes with a single instruction. Very much worth checking out the papers at that link.
The mapping from regular and context free languages to state machines and their table-based implementations was covered in depth in the Compilers course I took. A table based state machine is literally the textbook algorithm for parsing these classes of languages, and implementing them has been almost entirely automated by tools like Flex and Bison since the 80s.
Sorry, no, this is completely wrong. There are other reasons NFAs or DFAs might be preferable, and there are other performance considerations as far as backtracking is concerned, but NFA implementations do not backtrack. I mean, they could, but they don’t, and in common usage a “NFA regex implementation” means a non-backtracking one (that goes through all input once with a bag of states in hand and advances each one of them on each step), whereas a “backtracking regex implementation” means one that doesn’t use automata at all (and is usually worst-case exponential in the input).
What this is actually talking about is a streaming parser. Just call it like it is, it’s an OK term that sees plenty of usage.
[What you could mention here is PEG parsers, because even linear PEG implementations may need to keep O(input) state and thus are essentially incapable of streaming, unlike LL or LR ones; and GLR et al. can be even worse, for what it’s worth. But this is a much more obscure topic.]
I'd also like to add the Plan 9 implementation[0], which also uses the properties of utf8 as part of its state machine and anecdotally has been quite performant when I've used it.
[0] http://git.9front.org/plan9front/plan9front/107a7ba9717429ae...
Blog posts:
https://bytepawn.com/tag/wc.html
Code:
> The real programs spend most of their time in functions like mbrtowc() to parse multi-byte characters and iswspace() to test if they are spaces
This will produce bad results in the real world. People have previously posted about bugs related to this on Hacker News: https://news.ycombinator.com/item?id=36216389
The data for state machine in this article might be best prepared by generating it from a program. That generator program doesn’t need to care (too much) about performance since it is run during the build process. I like the idea of doing a lot of work now to save time in the future.
I got the wc_lines from coreutils: https://github.com/coreutils/coreutils/blob/master/src/wc.c#...
And I thought "damn, I understand nothing about the state-machine stuff, how did they made this faster ?"
Truth is: they did not Of course, this is just the "count line" part. Other parts are indeed faster.
Coreutils 9.4:
0.67 [jack:/tmp] /usr/bin/time -v wc -l debian-12.2.0-amd64-netinst.iso
2524855 debian-12.2.0-amd64-netinst.iso
Command being timed: "wc -l debian-12.2.0-amd64-netinst.iso"
User time (seconds): 0.00
System time (seconds): 0.07
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 99
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And wc2 (from gcc -O3 -o wc2 wc2.c): 0.47 [jack:/tmp] /usr/bin/time -v ./wc2 -l debian-12.2.0-amd64-netinst.iso
2524855 debian-12.2.0-amd64-netinst.iso
Command being timed: "./wc2 -l debian-12.2.0-amd64- netinst.iso"
User time (seconds): 0.97
System time (seconds): 0.05
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.03
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 146
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Gotta keep readingWhat makes this better than the standard algorithm of
wc = 0
prev = null
for ch in charStream:
if !isSpace(ch) and isSpace(prev):
wc += 1
prev = ch
This word count challenge was the first one he described. The table driven approach was discussed, but was not necessarily the fastest due to the large cache overflowing (on 90s processors) table needed.
I used the same example in a chapter I wrote for a C++ book
> The difference in macOS and Linux speed is actually the difference in clang and gcc speed. The LLVM clang compiler is doing better optimizations for x86 processors here.
I know GCC has been getting much, much more aggressive as of late, but there have been some complaints that it is now much, much less safe (checks intended for security getting optimized away, e.g.).
I wonder if you were to go back to 5 years ago, if you'd see the linux code was generally safer than the speedy osx llvm code...
Appears BSD wc ("the macOS program") is much faster than the GNU coreutils wc ("the Linux program").
It certainly illustrates a point about people rewriting wc in newer languages to demonstrate alleged speed against C.
Which wc are they using for comparison.
Pehaps they should use wc2 or wc2o.
This is strange to me, shouldn't it be O(N) with the size of the input ? Or maybe in this benchmark the inputs are small enough that the difference isn't measurable ?
---
The [version of wc.c](https://git.nunosempere.com/personal/wc/src/branch/master/sr...) in this repository sits at 44 lines. It decides to read from stdin if the number of arguments fed to it is otherwise zero, and uses the standard C function getc to read character by character. It doesn't have flags, instead, there are further utilities in the src/extra/ folder for counting characters and lines, sitting at 32 and 35 lines of code, respectively. This version also has little error checking.
[Here](https://github.com/dspinellis/unix-history-repo/blob/Researc...) is a version of wc from UNIX V7, at 86 lines. It allows for counting characters, words and lines. I couldn't find a version in UNIX V6, so I'm guessing this is one of the earliest versions of this program. It decides to read from stdin if the number of arguments fed to it is zero, and reads character by character using the standard C getc function.
The busybox version ([git.busybox.net](https://git.busybox.net/busybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with comments stripped), while striving to be [POSIX-compliant](https://pubs.opengroup.org/onlinepubs/9699919799/), meaning it has a fair number of flags and a bit of complexity. It reads character by character by using the standard getc function, and decides to read from stdin or not using its own fopen_or_warn_stdin function. It uses two GOTOs to get around, and has some incomplete Unicode support.
The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version implements some sort of table method in 331 lines. It uses plan9 rather than Unix libraries and methods, and seems to read from stdin if the number of args is 0.
The plan9port version of wc ([github](https://github.com/9fans/plan9port/blob/master/src/cmd/wc.c)) also implements some sort of table method, in 352 lines. It reads from stdin if the number of args is 0, and uses the Linux read function to read character by character.
The [OpenBSD](https://github.com/openbsd/src/blob/master/usr.bin/wc/wc.c) version is just *nice*. It reads from stdin by default, and uses a bit of buffering using read to speed things up. It defaults to using fstat when counting characters. It is generally pleasantly understandable, nice to read. I'm actually surprised at how pleasant it is to read.
The [FreeBSD version](https://cgit.freebsd.org/src/tree/usr.bin/wc/wc.c) sits at 367 lines. It has enough new things that I can't parse all that it's doing: in lines 137-143, what is capabilities mode? what is casper?, but otherwise it decides whether to read from stdin by the number of arguments, in line 157. It uses a combination of fstat and read, depending on the type of file.
Finally, the GNU utils version ([github](https://github.com/coreutils/coreutils/tree/master/src/wc.c), [savannah](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...)) is a bit over 1K lines of C. It does many things and checks many possible failure modes. I think it detects whether it should be reading from stdin using some very wrapped fstat, and it reads character by character using its own custom function.
So this utility started out reasonably small, then started getting more and more complex. [The POSIX committee](https://pubs.opengroup.org/onlinepubs/9699919799/) ended up codifying that complexity, and now we are stuck with it because even implementations like busybox which strive to be quite small try to keep to POSIX.
wc --unicode
wc --bom
https://dreampuf.github.io/GraphvizOnline/#digraph%20StateMa...
digraph StateMachine { rankdir=LR; size="8,5"; node [shape = circle];
LookingForWord -> InsideWord [label="Non-space"];
LookingForWord -> LookingForWord [label="Space"];
LookingForWord -> Newline [label="Newline"];
LookingForWord -> LookingForWord [label="Other"];
Newline -> InsideWord [label="Non-space"];
Newline -> LookingForWord [label="Space"];
Newline -> Newline [label="Newline"];
Newline -> LookingForWord [label="Other"];
InsideWord -> ContinueWord [label="Non-space"];
InsideWord -> LookingForWord [label="Space"];
InsideWord -> Newline [label="Newline"];
InsideWord -> LookingForWord [label="Other"];
ContinueWord -> ContinueWord [label="Non-space"];
ContinueWord -> LookingForWord [label="Space"];
ContinueWord -> Newline [label="Newline"];
ContinueWord -> LookingForWord [label="Other"];
}EDIT: Under peer pressure, I checked it and it correctly reflects the code apart from being designed for one specific line ending sequence (as it should, being an example optimized for brevity).
As for the replies, as opposed to my approach, I'm sure when you're browsing research papers, you're doing a full reproducibility study for each one. I'm sorry I commented, I should have waited for you to post your results.
Related
X debut 40 years ago (1984)
Robert W. Scheifler introduced the X window system in June 1984 for the VS100 Unix server, offering improved performance over W. The system was stable, with the Laboratory for Computer Science already transitioning to X and developing applications. Scheifler encouraged experimentation and welcomed volunteers for documentation contributions.
Let's write a video game from scratch like it's 1987
Philippe Gaultier created Minesweeper in 1987 using X11 without libraries, resulting in a compact executable. The article explores X authentication, Odin language usage, and minimalistic GUI development, offering insights into X11 intricacies. Source code is on GitHub.
Eight million pixels and counting: improving texture atlas allocation in Firefox (2021)
Improving texture atlas allocation in WebRender with the guillotiere crate reduces texture memory usage. The guillotine algorithm was replaced due to fragmentation issues, leading to a more efficient allocator. Visualizing the atlas in SVG aids debugging. Rust's simplicity and Cargo fuzz testing are praised for code development and robustness. Enhancements in draw call batching and texture upload aim to boost performance on low-end Intel GPUs by optimizing texture atlases.
Exposition of Front End Build Systems
Frontend build systems are crucial in web development, involving transpilation, bundling, and minification steps. Tools like Babel and Webpack optimize code for performance and developer experience. Various bundlers like Webpack, Rollup, Parcel, esbuild, and Turbopack are compared for features and performance.
Gren 0.4: New Foundations
Gren 0.4 updates its functional language with enhanced core packages, a new compiler, revamped FileSystem API, improved functions, and a community shift to Discord. These updates aim to boost usability and community engagement.