August 3rd, 2024

Parsing Protobuf Definitions with Tree-sitter

The article discusses using Tree-sitter to parse Protocol Buffers definitions, addressing limitations of existing tools and providing a practical guide for developers to enhance workflows in software development.

Read original articleLink Icon
Parsing Protobuf Definitions with Tree-sitter

The article discusses the use of Tree-sitter, a parsing library, to extract information from Protocol Buffers (protobuf) definitions, which are crucial for schema definitions and event serialization in software development. The author, Karl Matthias, highlights the limitations of existing tools like protoc and protoc-gen-gotemplate, which do not support complex logic or custom workflows. He emphasizes the need for a more automated and repeatable solution for handling protobuf definitions, particularly in the context of a backend system at Mozi that relies on these definitions across its architecture.

The article provides a detailed example of a protobuf message definition and explains how Tree-sitter can be utilized to parse this definition effectively. It outlines the process of constructing queries to extract relevant data such as message names, enum names, and field types. The author demonstrates how to visualize the abstract syntax tree (AST) in Neovim, making it easier to write and test queries.

The implementation involves creating a Go data structure to store parsed information and functions to read and parse protobuf files using Tree-sitter. The article concludes by noting the effectiveness of this approach in generating bindings and suggests that Tree-sitter can be applied to other parsing challenges in the future. Overall, the article serves as a practical guide for developers looking to enhance their workflow with protobuf definitions using Tree-sitter.

Link Icon 7 comments
By @MathMonkeyMan - 6 months
I need to get around to playing with tree-sitter. The approach in this article is neat.

Here's another approach. The AST of a .proto file is itself a protobuf. That's how the codegen plugins work. Protobuf also has a canonical mapping to JSON, so...

What you can do is use protoc to parse the .proto file, spit it out as JSON, and then process that data using your favorite pattern matching language. I wrote a [tool][1] that helps with that. For example, here's some [js code][2] that translates protobuf message definitions into "types" for use in an ORM.

[1]: https://github.com/dgoffredo/protojson

[2]: https://github.com/dgoffredo/okra/blob/master/lib/proto2type...

By @grumbles - 6 months
Huh. tree-sitter seems neat, but I don’t really get why the author thinks processing the descriptor set is so hard. Seems equally difficult to learn a bunch of new abstractions in the form of tree-sitter vs just learning protobuf’s own ones.

Also, if you’re parsing .proto files directly, you have to deal with a bunch of annoying issues like include paths, how you package sets of them to move around, etc. descriptor sets seem like a better solution to me.

By @pcj-github - 6 months
From the docs "The protocol compiler can output a FileDescriptorSet containing the .proto files it parses." (https://github.com/protocolbuffers/protobuf/blob/main/src/go...)

I don't understand the point of using tree-sitter to repeat that work (almost certainly having bugs doing so). Am I missing something?

By @cyberax - 6 months
I don't get it. Why not just use a better Protobuf model? Go's serialization format for protobufs is not the most brilliant one, but it's reasonable.

E.g. just use `string` instead of `StringValue`.

By @Arainach - 6 months
Like others, I don't understand the author's issues getting the stock proto reflection behavior to extract this information.

I'm not as familiar with the Go reflection tools, but getting the information the author wants is trivial in Java reflection.

By @danenania - 6 months
tree-sitter is an incredible tool. I wonder if there's been a dedicated discussion for it on the HN front page at some point—will have to check on the HN algolia search.

I'm using it for syntax checking across 30+ languages in Plandex[1], an LLM coding tool. tree-sitter runs in single digit milliseconds on typical files and is highly accurate. When it encounters a syntax problem, it can pinpoint the exact location/expression in the file, and it's fault-tolerant so it can keep going and identify multiple issues rather than stopping on the first one. These results can be sent back to the LLM, which can then often fix its own errors. I was able to reduce syntax issues by roughly 90% with gpt-4o using this approach.

Afaik, there's no other viable option for a use case like this. You'd need a menagerie of language specific linters, compilers, and/or language servers to get anywhere close, and many of those are way too slow to run inline.

1 - https://github.com/plandex-ai/plandex

By @Cloudef - 6 months
I used to write proto parser using ragel <https://www.colm.net/open-source/ragel/> for work way back as well, it was surprisingly painless. Think this was way back when protobuf was transitioning to proto3.