August 3rd, 2024

Parsing Protobuf Definitions with Tree-sitter

The article discusses using Tree-sitter to parse Protocol Buffers definitions, addressing limitations of existing tools and providing a practical guide for developers to enhance workflows in software development.

Read original article

Parsing Protobuf Definitions with Tree-sitter

The article discusses the use of Tree-sitter, a parsing library, to extract information from Protocol Buffers (protobuf) definitions, which are crucial for schema definitions and event serialization in software development. The author, Karl Matthias, highlights the limitations of existing tools like protoc and protoc-gen-gotemplate, which do not support complex logic or custom workflows. He emphasizes the need for a more automated and repeatable solution for handling protobuf definitions, particularly in the context of a backend system at Mozi that relies on these definitions across its architecture.

The article provides a detailed example of a protobuf message definition and explains how Tree-sitter can be utilized to parse this definition effectively. It outlines the process of constructing queries to extract relevant data such as message names, enum names, and field types. The author demonstrates how to visualize the abstract syntax tree (AST) in Neovim, making it easier to write and test queries.

The implementation involves creating a Go data structure to store parsed information and functions to read and parse protobuf files using Tree-sitter. The article concludes by noting the effectiveness of this approach in generating bindings and suggests that Tree-sitter can be applied to other parsing challenges in the future. Overall, the article serves as a practical guide for developers looking to enhance their workflow with protobuf definitions using Tree-sitter.

Modern Emacs TypeScript Web Config

Setting up modern Emacs config for TypeScript web dev includes lsp-mode, Treesitter, Tailwind, TSX support, multiple LSP servers, Corfu completion, flycheck linter, eslint, Tailwind LSP, lsp-doctor, and Emacs LSP Booster.

Parse, Don't Validate

The article explores type-driven design in programming, emphasizing "Parse, don’t validate" in Haskell. It showcases using types for robust code, avoiding errors, and enhancing input parsing efficiency in various tasks.

How I Use Git Worktrees

The author advocates for using Git worktrees to manage multiple coding tasks concurrently, highlighting their benefits over branches for context switching and productivity in software development.

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.

Why I Prefer RST to Markdown

The author prefers reStructured Text (rST) over Markdown for technical documentation, citing its complex structure, custom directives, and better management of content across formats, despite its less attractive syntax.

7 comments

By @MathMonkeyMan - 9 months

I need to get around to playing with tree-sitter. The approach in this article is neat.

Here's another approach. The AST of a .proto file is itself a protobuf. That's how the codegen plugins work. Protobuf also has a canonical mapping to JSON, so...

What you can do is use protoc to parse the .proto file, spit it out as JSON, and then process that data using your favorite pattern matching language. I wrote a [tool][1] that helps with that. For example, here's some [js code][2] that translates protobuf message definitions into "types" for use in an ORM.

[1]: https://github.com/dgoffredo/protojson

[2]: https://github.com/dgoffredo/okra/blob/master/lib/proto2type...

By @grumbles - 9 months

Huh. tree-sitter seems neat, but I don’t really get why the author thinks processing the descriptor set is so hard. Seems equally difficult to learn a bunch of new abstractions in the form of tree-sitter vs just learning protobuf’s own ones.

Also, if you’re parsing .proto files directly, you have to deal with a bunch of annoying issues like include paths, how you package sets of them to move around, etc. descriptor sets seem like a better solution to me.

By @pcj-github - 9 months

From the docs "The protocol compiler can output a FileDescriptorSet containing the .proto files it parses." (https://github.com/protocolbuffers/protobuf/blob/main/src/go...)

I don't understand the point of using tree-sitter to repeat that work (almost certainly having bugs doing so). Am I missing something?

By @cyberax - 9 months

I don't get it. Why not just use a better Protobuf model? Go's serialization format for protobufs is not the most brilliant one, but it's reasonable.

E.g. just use `string` instead of `StringValue`.

By @Arainach - 9 months

Like others, I don't understand the author's issues getting the stock proto reflection behavior to extract this information.

I'm not as familiar with the Go reflection tools, but getting the information the author wants is trivial in Java reflection.

By @danenania - 9 months

tree-sitter is an incredible tool. I wonder if there's been a dedicated discussion for it on the HN front page at some point—will have to check on the HN algolia search.

I'm using it for syntax checking across 30+ languages in Plandex[1], an LLM coding tool. tree-sitter runs in single digit milliseconds on typical files and is highly accurate. When it encounters a syntax problem, it can pinpoint the exact location/expression in the file, and it's fault-tolerant so it can keep going and identify multiple issues rather than stopping on the first one. These results can be sent back to the LLM, which can then often fix its own errors. I was able to reduce syntax issues by roughly 90% with gpt-4o using this approach.

Afaik, there's no other viable option for a use case like this. You'd need a menagerie of language specific linters, compilers, and/or language servers to get anywhere close, and many of those are way too slow to run inline.

1 - https://github.com/plandex-ai/plandex

By @Cloudef - 9 months

I used to write proto parser using ragel <https://www.colm.net/open-source/ragel/> for work way back as well, it was surprisingly painless. Think this was way back when protobuf was transitioning to proto3.