September 22nd, 2024

The sorry state of Java deserialization

The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.

Read original article

The article discusses the challenges of Java deserialization, particularly in the context of efficiently reading large datasets from disk for search engine applications. The author highlights the performance issues associated with various data reading methods, including Java's traditional InputStreams, JDBC, and Protobuf. A benchmark is conducted using a dataset of 1 billion temperature measurements, comparing the performance of different data formats and reading techniques, such as Parquet, Protobuf, and custom serialization methods. The results indicate that while some methods, like DuckDB, perform well, others, particularly those relying on older Java APIs, are significantly slower. The author emphasizes the importance of optimizing data reading processes, especially when dealing with large volumes of data, as even minor performance improvements can lead to substantial time savings. The article concludes with a discussion on custom serialization strategies that can enhance performance by reducing memory allocations and improving data locality.

- Java deserialization poses significant performance challenges, especially with large datasets.

- Benchmark tests reveal that modern tools like DuckDB outperform traditional Java I/O methods.

- Optimizing data reading techniques can lead to substantial performance improvements.

- Custom serialization methods can enhance efficiency by minimizing memory allocations.

- The choice of data format and reading strategy is crucial for performance in data-intensive applications.

Optimizing JavaScript for Fun and for Profit

Optimizing JavaScript code for performance involves benchmarking, avoiding unnecessary work, string comparisons, and diverse object shapes. JavaScript engines optimize based on object shapes, impacting array/object methods and indirection. Creating objects with the same shape improves optimization, cautioning against slower functional programming methods. Costs of indirection like proxy objects and function calls affect performance. Code examples and benchmarks demonstrate optimization variances.

Memory Efficient Data Streaming to Parquet Files

Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.

Boosting Jackson's Serialization Performance with Quarkus

Quarkus enhances application performance by shifting tasks to build time, improving Jackson's serialization by reducing reflection, and enabling automatic custom serializer generation with Jandex and Gizmo for efficiency.

Immutable Data Structures in Qdrant

Qdrant's article highlights the benefits of immutable data structures in vector databases, improving performance and memory efficiency, while addressing challenges through mutable segments and techniques like perfect hashing and defragmentation.

When Bloom filters don't bloom (2020)

The author explored Bloom filters for IP spoofing but faced performance issues due to memory access times. A simpler hash table approach ultimately provided better performance for large datasets.

11 comments

By @charleslmunger - 7 months

I noticed:

1. Small read buffers. No reason to sequentially read and parse gigabytes only 4kb at a time.

2. parseDelimitedFrom created a new CodedInputStream on every message, which has its own internal buffer; that's why you don't see a buffered stream wrapper in the examples. Every iteration of the loop is allocating fresh 4kb byte[]s.

3. The nio protobuf code creates wrappers for the allocated ByteBuffer on every iteration of the loop.

But the real sin with the protobuf code is serializing the same city names over and over, reading parsing and hashing. Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing. If that was done, your cost would essentially be the cost of decoding varints.

By @mynegation - 7 months

“I admit I don’t understand these results. There’s clearly nothing in the runtime itself that prevents these types of speeds.” Oh, there is. The default Java serialization is sort of like “pickle” module in Python - if you are familiar. It will deal with pretty much anything you throw at it, figuring the data structures and offsets to serialize or parse at runtime. More efficient methods trade universality for speed, where the offsets and calls to read/write the parts of the structure are determined in advance. Also, hard to say without source code but there is a high chance even more efficient methods like Protobuf create a lot of Java objects and that kills cache locality. With Java, you have to go out of your way to maintain good cache locality because you give up control over memory layout for automatic memory management.

By @garblegarble - 7 months

I know it's bad form to comment on style instead of content, but saying Smartphone enjoyers will want to switch to horizontal mode for this article due to code samples that barely fit on desktop while having the article text column shrink to less than 1/3rd of the horizontal space just feels disrespectful

By @splix - 7 months

I don't think that Java serialization is designed for such a small object with just two fields. It's designed for large and complex objects. Obviously it would be slower and much larger in size that a columnar implementation designed and heavily optimized for this scenarios. It's not a fair comparison and too far from a real use case.

Try with nested objects and at least a dozen of fields across this hierarchy. And different structure for each row. It's still not a use case for Java serialization, but at least closer to what a real code would do.

Same for Protobuf, I guess. Also the JSON serialization plays the same role more or less.

Maybe something like Avro Data Files is better for a comparison with columnar formats.

By @mike_hearn - 7 months

It'd be interesting to see Cap'n'Proto be profiled here, as the whole idea of that format is to eliminate deserialization overhead entirely.

By @marginalia_nu - 7 months

A repo with benchmark code is up now:

https://github.com/vlofgren/Serialization1BRCBenchmark/

By @SirYwell - 7 months

Why doesn't it mention the used Java version? And a few flame graphs would be interesting as well.

By @twoodfin - 7 months

I’m probably missing something obvious, but what’s wrong with Apache parquet-java for this use case?

By @krackers - 7 months

What's the reason why reading 3GB via duckdb is _faster_ than raw HDD speed? Is it compression/caching?

By @pestatije - 7 months

are the benchmarks done properly? whats the actual test code?

Optimizing JavaScript for Fun and for Profit

Memory Efficient Data Streaming to Parquet Files

Boosting Jackson's Serialization Performance with Quarkus

Immutable Data Structures in Qdrant

When Bloom filters don't bloom (2020)

The author explored Bloom filters for IP spoofing but faced performance issues due to memory access times. A simpler hash table approach ultimately provided better performance for large datasets.

The sorry state of Java deserialization

Related

Optimizing JavaScript for Fun and for Profit

Memory Efficient Data Streaming to Parquet Files

Boosting Jackson's Serialization Performance with Quarkus

Immutable Data Structures in Qdrant

When Bloom filters don't bloom (2020)

Related

Optimizing JavaScript for Fun and for Profit

Memory Efficient Data Streaming to Parquet Files

Boosting Jackson's Serialization Performance with Quarkus

Immutable Data Structures in Qdrant

When Bloom filters don't bloom (2020)