Parsing Protobuf Definitions with Tree-sitter
The article discusses using Tree-sitter to parse Protocol Buffers definitions, addressing limitations of existing tools and providing a practical guide for developers to enhance workflows in software development.
Read original articleThe article discusses the use of Tree-sitter, a parsing library, to extract information from Protocol Buffers (protobuf) definitions, which are crucial for schema definitions and event serialization in software development. The author, Karl Matthias, highlights the limitations of existing tools like protoc and protoc-gen-gotemplate, which do not support complex logic or custom workflows. He emphasizes the need for a more automated and repeatable solution for handling protobuf definitions, particularly in the context of a backend system at Mozi that relies on these definitions across its architecture.
The article provides a detailed example of a protobuf message definition and explains how Tree-sitter can be utilized to parse this definition effectively. It outlines the process of constructing queries to extract relevant data such as message names, enum names, and field types. The author demonstrates how to visualize the abstract syntax tree (AST) in Neovim, making it easier to write and test queries.
The implementation involves creating a Go data structure to store parsed information and functions to read and parse protobuf files using Tree-sitter. The article concludes by noting the effectiveness of this approach in generating bindings and suggests that Tree-sitter can be applied to other parsing challenges in the future. Overall, the article serves as a practical guide for developers looking to enhance their workflow with protobuf definitions using Tree-sitter.
Related
Modern Emacs TypeScript Web Config
Setting up modern Emacs config for TypeScript web dev includes lsp-mode, Treesitter, Tailwind, TSX support, multiple LSP servers, Corfu completion, flycheck linter, eslint, Tailwind LSP, lsp-doctor, and Emacs LSP Booster.
Parse, Don't Validate
The article explores type-driven design in programming, emphasizing "Parse, don’t validate" in Haskell. It showcases using types for robust code, avoiding errors, and enhancing input parsing efficiency in various tasks.
How I Use Git Worktrees
The author advocates for using Git worktrees to manage multiple coding tasks concurrently, highlighting their benefits over branches for context switching and productivity in software development.
TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.
Why I Prefer RST to Markdown
The author prefers reStructured Text (rST) over Markdown for technical documentation, citing its complex structure, custom directives, and better management of content across formats, despite its less attractive syntax.
Here's another approach. The AST of a .proto file is itself a protobuf. That's how the codegen plugins work. Protobuf also has a canonical mapping to JSON, so...
What you can do is use protoc to parse the .proto file, spit it out as JSON, and then process that data using your favorite pattern matching language. I wrote a [tool][1] that helps with that. For example, here's some [js code][2] that translates protobuf message definitions into "types" for use in an ORM.
[1]: https://github.com/dgoffredo/protojson
[2]: https://github.com/dgoffredo/okra/blob/master/lib/proto2type...
Also, if you’re parsing .proto files directly, you have to deal with a bunch of annoying issues like include paths, how you package sets of them to move around, etc. descriptor sets seem like a better solution to me.
I don't understand the point of using tree-sitter to repeat that work (almost certainly having bugs doing so). Am I missing something?
E.g. just use `string` instead of `StringValue`.
I'm not as familiar with the Go reflection tools, but getting the information the author wants is trivial in Java reflection.
I'm using it for syntax checking across 30+ languages in Plandex[1], an LLM coding tool. tree-sitter runs in single digit milliseconds on typical files and is highly accurate. When it encounters a syntax problem, it can pinpoint the exact location/expression in the file, and it's fault-tolerant so it can keep going and identify multiple issues rather than stopping on the first one. These results can be sent back to the LLM, which can then often fix its own errors. I was able to reduce syntax issues by roughly 90% with gpt-4o using this approach.
Afaik, there's no other viable option for a use case like this. You'd need a menagerie of language specific linters, compilers, and/or language servers to get anywhere close, and many of those are way too slow to run inline.
Related
Modern Emacs TypeScript Web Config
Setting up modern Emacs config for TypeScript web dev includes lsp-mode, Treesitter, Tailwind, TSX support, multiple LSP servers, Corfu completion, flycheck linter, eslint, Tailwind LSP, lsp-doctor, and Emacs LSP Booster.
Parse, Don't Validate
The article explores type-driven design in programming, emphasizing "Parse, don’t validate" in Haskell. It showcases using types for robust code, avoiding errors, and enhancing input parsing efficiency in various tasks.
How I Use Git Worktrees
The author advocates for using Git worktrees to manage multiple coding tasks concurrently, highlighting their benefits over branches for context switching and productivity in software development.
TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.
Why I Prefer RST to Markdown
The author prefers reStructured Text (rST) over Markdown for technical documentation, citing its complex structure, custom directives, and better management of content across formats, despite its less attractive syntax.