The sorry state of Java deserialization
The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.
Read original articleThe article discusses the challenges of Java deserialization, particularly in the context of efficiently reading large datasets from disk for search engine applications. The author highlights the performance issues associated with various data reading methods, including Java's traditional InputStreams, JDBC, and Protobuf. A benchmark is conducted using a dataset of 1 billion temperature measurements, comparing the performance of different data formats and reading techniques, such as Parquet, Protobuf, and custom serialization methods. The results indicate that while some methods, like DuckDB, perform well, others, particularly those relying on older Java APIs, are significantly slower. The author emphasizes the importance of optimizing data reading processes, especially when dealing with large volumes of data, as even minor performance improvements can lead to substantial time savings. The article concludes with a discussion on custom serialization strategies that can enhance performance by reducing memory allocations and improving data locality.
- Java deserialization poses significant performance challenges, especially with large datasets.
- Benchmark tests reveal that modern tools like DuckDB outperform traditional Java I/O methods.
- Optimizing data reading techniques can lead to substantial performance improvements.
- Custom serialization methods can enhance efficiency by minimizing memory allocations.
- The choice of data format and reading strategy is crucial for performance in data-intensive applications.
Related
Optimizing JavaScript for Fun and for Profit
Optimizing JavaScript code for performance involves benchmarking, avoiding unnecessary work, string comparisons, and diverse object shapes. JavaScript engines optimize based on object shapes, impacting array/object methods and indirection. Creating objects with the same shape improves optimization, cautioning against slower functional programming methods. Costs of indirection like proxy objects and function calls affect performance. Code examples and benchmarks demonstrate optimization variances.
Memory Efficient Data Streaming to Parquet Files
Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.
Boosting Jackson's Serialization Performance with Quarkus
Quarkus enhances application performance by shifting tasks to build time, improving Jackson's serialization by reducing reflection, and enabling automatic custom serializer generation with Jandex and Gizmo for efficiency.
Immutable Data Structures in Qdrant
Qdrant's article highlights the benefits of immutable data structures in vector databases, improving performance and memory efficiency, while addressing challenges through mutable segments and techniques like perfect hashing and defragmentation.
When Bloom filters don't bloom (2020)
The author explored Bloom filters for IP spoofing but faced performance issues due to memory access times. A simpler hash table approach ultimately provided better performance for large datasets.
1. Small read buffers. No reason to sequentially read and parse gigabytes only 4kb at a time.
2. parseDelimitedFrom created a new CodedInputStream on every message, which has its own internal buffer; that's why you don't see a buffered stream wrapper in the examples. Every iteration of the loop is allocating fresh 4kb byte[]s.
3. The nio protobuf code creates wrappers for the allocated ByteBuffer on every iteration of the loop.
But the real sin with the protobuf code is serializing the same city names over and over, reading parsing and hashing. Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing. If that was done, your cost would essentially be the cost of decoding varints.
Try with nested objects and at least a dozen of fields across this hierarchy. And different structure for each row. It's still not a use case for Java serialization, but at least closer to what a real code would do.
Same for Protobuf, I guess. Also the JSON serialization plays the same role more or less.
Maybe something like Avro Data Files is better for a comparison with columnar formats.
Related
Optimizing JavaScript for Fun and for Profit
Optimizing JavaScript code for performance involves benchmarking, avoiding unnecessary work, string comparisons, and diverse object shapes. JavaScript engines optimize based on object shapes, impacting array/object methods and indirection. Creating objects with the same shape improves optimization, cautioning against slower functional programming methods. Costs of indirection like proxy objects and function calls affect performance. Code examples and benchmarks demonstrate optimization variances.
Memory Efficient Data Streaming to Parquet Files
Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.
Boosting Jackson's Serialization Performance with Quarkus
Quarkus enhances application performance by shifting tasks to build time, improving Jackson's serialization by reducing reflection, and enabling automatic custom serializer generation with Jandex and Gizmo for efficiency.
Immutable Data Structures in Qdrant
Qdrant's article highlights the benefits of immutable data structures in vector databases, improving performance and memory efficiency, while addressing challenges through mutable segments and techniques like perfect hashing and defragmentation.
When Bloom filters don't bloom (2020)
The author explored Bloom filters for IP spoofing but faced performance issues due to memory access times. A simpler hash table approach ultimately provided better performance for large datasets.