November 8th, 2024

Why I love Rust for tokenising and parsing

The author develops sqleibniz, a Rust-based static analysis tool for SQL, focusing on syntax checks and validation. Future plans include creating a Language Server Protocol server for SQL.

Read original articleLink Icon
Why I love Rust for tokenising and parsing

The blog post discusses the author's experience using Rust for developing a static analysis tool for SQL, named sqleibniz, specifically targeting the SQLite dialect. The tool aims to perform syntax checks and validate the existence of tables, columns, and functions in SQL input. The author highlights the importance of creating a tokenizer and parser for SQL, emphasizing Rust's features that facilitate code deduplication through macros. The post details the implementation of an Abstract Syntax Tree (AST) and the use of traits to define nodes, which helps in managing code complexity. The author also shares insights on testing strategies, particularly the use of table-driven tests, which are adapted from Go to Rust. The testing framework allows for organized and efficient validation of the lexer and parser functionalities, ensuring that both valid and invalid SQL inputs are handled correctly. The author expresses enthusiasm for Rust's capabilities in building robust software and hints at future developments, including the creation of a Language Server Protocol (LSP) server for SQL.

- The author is developing a static analysis tool for SQL called sqleibniz using Rust.

- The tool focuses on syntax checks and validating SQL constructs against SQLite documentation.

- Rust's macros are utilized for code deduplication in defining AST nodes and traits.

- The author adapts table-driven testing from Go to Rust for validating lexer and parser functionalities.

- Future plans include developing a Language Server Protocol (LSP) server for SQL.

Link Icon 21 comments
By @miningape - 6 months
About 2 months ago I would have said the same as the author, but I kept running against the hard edges of Rust: the borrow checker. I realised that while I really liked using algebraic data types (e.g. Enums) and pattern matching, the borrow checker and the low level memory concerns meant I spent a lot of time fighting the borrow checker instead of fighting the PL issues at the heart of my project. So while tokenising/parsing was nice, interpreting and typechecking became the bane of my existence

With that realisation I started looking for another more suitable language - I knew the FP aspects of Rust are what I was looking for so at first I considered something like F# but I didn't like that it's tied to microsoft/.NET. Looking a bit further I could have gone with something like Zig/C but then I lose the FP niceness I'm looking for. I also spent a fair amount of time looking at Go, but eventually decided that 1. I wanted a fair amount of syntax sugar, and 2. golang is a server side language, a lot of its features and library are geared towards this use case.

Finally I found OCaml, what really convinced me was seeing the syntax was like a friendly version of Haskell, or like Rust without lifetimes. In fact the first Rust compiler was written in OCaml, and OCaml is well known in the programming language space. I'm still learning OCaml so I'm not sure I can give a fair review yet, but so far it's exactly what I was looking for.

By @noelwelsh - 6 months
This is, to me, an odd way to approach parsing. I get the impression the author is relatively inexperienced with Rust and the PL ideas it builds on.

A few notes:

* The AST would, I believe, be much simpler defined as an algebraic data types. It's not like the sqlite grammar is going to randomly grow new nodes that requires the extensibility their convoluted encoding requires. The encoding they uses looks like what someone familiar with OO, but not algebraic data types, would come up with.

* "Macros work different in most languages. However they are used for mostly the same reasons: code deduplication and less repetition." That could be said for any abstraction mechanism. E.g. functions. The defining features of macros is they run at compile-time.

* The work on parser combinators would be a good place to start to see how to structure parsing in a clean way.

By @ryandv - 6 months
I don't know. Having written a small parser [0] for Forsyth-Edwards chess notation [1] Haskell takes the cake here in terms of simplicity and legibility; it reads almost as clearly as BNF, and there is very little technical ceremony involved, letting you focus on the actual grammar of whatever it is you are trying to parse.

[0] https://github.com/ryandv/chesskell/blob/master/src/Chess/Fa...

[1] https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notati...

By @tptacek - 6 months
So, just to kick this off: I wrote an eBPF disassembler and (half-hearted) emulator in Rust and I also found it a pleasant language to do parsing-type stuff in. But: I think the author cuts against their argument when they manage to necessitate a macro less than 1/6th of the way into their case study. A macro isn't quite code-gen, but it also doesn't quite feel like working idiomatically within the language, either.

Again: not throwing shade. I think this is a place where Rust is genuinely quite strong.

By @gritzko - 6 months
I have the experience of writing parsers (lexers) in Ragel, using Go, Java C++, and C. I must say, once you have some boilerplate generator in place, raw C is as good as the Rust code the author describes. Maybe even better because simplicity. For example, this is the most of code necessary to have a JSON parser: https://github.com/gritzko/librdx/blob/master/JSON.lex

In fact, that eBNF only produces the lexer. The parser part is not that impressive either, 120 LoC and quite repetitive https://github.com/gritzko/librdx/blob/master/JSON.c

So, I believe, a parser infrastructure evolves till it only needs eBNF to make a parser. That is the saturation point.

By @hu3 - 6 months
Related, I love Rob Pike's talk about lexical Scanning in Go (2011).

Educational and elegant approach.

https://www.youtube.com/watch?v=HxaD_trXwRE

By @sshine - 6 months
One mind-blowing experience for me:

I can take my parser combinator library that I use for high-level compiler parsers, and use that same library in a no-std setting and compile it to a micro-controller, and deploy that as a high-performance protocol parser in an embedded environment. Exact same library! Just with fewer String and more &'static str.

So toying around with compilers translates my skill-set rather well into doing embedded protocol parsers.

By @brundolf - 6 months
Something that was hard when I wrote a full AST parser in Rust was representing a hierarchy of concrete AST types, with upcasting and downcasting. I was able to figure out a way, but it required some really weird type shenanigans (eg PhantomData) and some macros. Looks like they had to do crazy macros here too

Curious what the rest of the prior art looks like

By @ketzo - 6 months
So how do you debug code written with macros like this, or come into it as a new user of the codebase?

I’m imagining seeing the node! macro used, and seeing the macro definition, but still having a tough time knowing exactly what code is produced.

Do I just use the Example and see what type hints I get from it? Can I hover over it in my IDE and see an expanded version? Do I need to reference the compiled code to be sure?

(I do all my work in JS/TS so I don’t touch any macros; just curious about the workflow here!)

By @nhatcher - 6 months
Well good luck parsing sqlite syntax! I had to write a (fairly small) subset sqlite parser for work a couple of years ago. I really like sqlite, it's always a source of inspiration.

The railroad diagrams are tremendously useful:

https://www.sqlite.org/syntaxdiagrams.html

I don't think the lemon parser generator gets enough credit:

https://sqlite.org/src/doc/trunk/doc/lemon.html

With respect of the choice of the language, any language with Algebraic Data Types would work great. Even Typescript would be great for this.

FWIW I wrote a small introduction to writing parsers by hand in Rust a while ago:

https://www.nhatcher.com/post/a-rustic-invitation-to-parsing...

By @ForHackernews - 6 months
I'll throw in a plug for https://pest.rs/ a PEG-based parser-generator library in Rust. Delightful to work with and removes so much of the boilerplate involved in a parser.
By @djfobbz - 6 months
Sorry my OCD is kicking in but "Asterisk" is spelled wrong as "Asteriks" in your entire sample code.
By @WiSaGaN - 6 months
I think except macros, most of these features are ML family language features as well. Rust stands out because it can implement this in an efficient, zero overhead abstraction way.
By @kldx - 6 months
I like MegaParsec in haskell quite expressive, based on my limited experience using nom in Rust
By @samsartor - 6 months
Imperative rust is really good for parsing, but you can also get a long way with regexes. Especially if you are just prototyping or doing Advent of Code.

I do still like declarative parsing over imperative, so I wrote https://docs.rs/inpt on top of the regex crate. But Andrew Gallant gets all the credit, the regex crate is overpowered.

By @sksxihve - 6 months
I've found that the logos crate is really nice for writing lexers in rust

https://docs.rs/logos/0.14.2/logos/

By @omani - 6 months
this is the third day in a row this article is being posted here.

this time it got traction. funny how HN works.

https://news.ycombinator.com/item?id=42055954

https://news.ycombinator.com/item?id=42058920

By @James_K - 6 months
Mentioning macros as a reason to love Rust goes against my experience with them.
By @jamra - 6 months
Does anyone have a good EBNF notation for Sqlite? I tried to make a tree-sitter grammar, which produces C code and great Rust bindings for it. But they use some lemon parser. Not sure how to read the grammar from that.
By @nicoco - 6 months
Every rust article: "Look how great this rust feature is and how clean and concise the resulting code is!"

Me: "How can a programming language be so damn complex? Am I just dumb?"

By @jurschreuder - 6 months
I cannot agree less, C++ is the best and always will be. You youngsters made up this new dialect that can also compile with the C++ compiler. This is like people putting VS Code in dark mode thinking they're now also working in the Terminal like the Gods of Binary.