August 7th, 2024

Thoughts on Canonical S-Expressions (2019)

Canonical S-Expressions (csexp) efficiently handle binary data without base64 encoding but lack associative array support, complicating complex data serialization. Alternatives like Bencoding and MessagePack may offer better solutions.

Read original articleLink Icon
Thoughts on Canonical S-Expressions (2019)

Canonical S-Expressions (csexp) are a data format used by Datashards, characterized by their efficiency in handling binary data. Unlike traditional S-Expressions, csexp represents every atom as a byte object, allowing for compact storage without the need for base64 encoding, which is common in formats like JSON. This format is flexible and easy to parse, making it straightforward to implement a reader for csexp. However, it lacks support for associative arrays, which complicates the serialization and deserialization of more complex data structures. Users must create their own methods for handling such structures, which can lead to ambiguity when interpreting the data. While csexp avoids the overhead of XML and does not impose type conversions, it still requires a reader to convert parsed data into application-specific formats. The author suggests that incorporating type hints could simplify the reader's task, similar to JSON-LD. Alternatives like Bencoding, MessagePack, and Preserves offer varying degrees of expressiveness and ease of use, with Bencoding providing type support. Overall, csexp is a suitable choice for straightforward applications, but its limitations may necessitate a reevaluation for more complex data needs.

- Canonical S-Expressions are efficient for binary data storage without base64 encoding.

- They lack support for associative arrays, complicating complex data handling.

- A reader is necessary to convert parsed data into application-specific formats.

- Incorporating type hints could enhance usability and clarity.

- Alternatives like Bencoding and MessagePack may offer additional benefits for complex data structures.

Link Icon 8 comments
By @papaver-somnamb - 5 months
The primary intended use case of this in contrast to say Extensible Data Notation (EDN) seems to be for faster machine processing. The necessitating of prefixing datums with their lengths (Pascal string style) alone is the clue. So an advantage here is it is much easier to place a hard bound on memory and CPU for reading this format, and confers security properties like systemically reduced possibility of buffer overflows. Good for hard real-time, for example guidance systems that do all allocation only at startup.

Beyond lists and string atoms (or whatever the actual list is), this format also makes an affordance for custom types, but as TFA points out, you still have to roll your own other / higher order data types. Data types that you almost definitely have on hand. Now we are talking about needing to do additional processing on the decoded output, just to interpret common data structures like associative arrays and sets. And as a machine-first serialization format, if you are interchanging with other people or with yourself in the future, sure hope you have full agreement on those custom types.

So what do you do: Add libs? Roll your own? Well, competing alternatives already offer that complete picture as mature, battle-tested solutions. So I'm inclined to view Canonical S-Expressions merely as a way-point on our path of technological evolution, worthy of fleeting, mild curiosity.

By @shalabhc - 5 months
I would suggest the author also look at Amazon Ion:

* It can be used as schema-less

* allows attaching metadata tags to values (which can serve as type hints[1]), and

* encodes blobs efficiently

I have not used it, but in the space of flexible formats it appears to have other interesting properties. For instance it can encode a symbol table making symbols really compact in the rest of the message. Symbol tables can be shared out of band.

[1] https://amazon-ion.github.io/ion-docs/docs/spec.html#annot

By @amiga386 - 5 months
Canonical S-expressions seem remarkably similar to bencoding as used in BitTorrent files. They both use length prefixes written in ASCII digits followed by a colon.

    Canonical S-Expression: (9:groceries(4:milk5:bread))
    Bencoding:              l9:groceriesl4:milk5:breadee
Bencoding also manages to specify dictionaries, and yet still have a canonical encoding, by requiring dictionaries be sorted by key (and keys be unique).

It doesn't have the option for arbitrary type names, it just has actual types: integer, bytestring, list and dictionary.

FTA:

> Bencoding offers many of the same benefits of CSEXP, but because it also supports types, is a bit easier to work with.

Hmm, well there you go.

By @bsima - 5 months
> One thought that I keep having while I'm using csexp is to use the type hints to store information such as the data type.

This is exactly what edn does. Seems like the author would like edn but doesn’t mention it

https://github.com/edn-format/edn

By @dang - 5 months
We changed the url from https://en.wikipedia.org/wiki/Canonical_S-expressions to a non-Wikipedia article. (Wikipedia submissions are fine but if there's a good third-party source, those are usually preferred because they're less generic.)

Readers may want to look at both of course!

By @floren - 5 months
> In some cases this conversion is easy, 3:100 becomes the integer 100.

Huh, I'd have said it should become the string "100", based on earlier examples such as 5:hello

By @djaouen - 5 months
Imagine if we had this instead of shitty ass C or asm lol