How to implement a hash table in C (2021)
This article explains implementing a hash table in C, covering linear/binary search, hash table design, simple hash function, collision handling, resizing, and API design. Code snippets and GitHub repository link provided.
Read original articleThis article explains how to implement a hash table data structure in C, focusing on linear and binary search methods before delving into hash table design and implementation. The author emphasizes the absence of a built-in hash table in C's standard library and presents options like using linear or binary search, utilizing existing implementations, or creating a custom hash table. The hash table implementation discussed involves using a simple hash function and linear probing to handle collisions efficiently. The article covers key concepts such as hash functions, collision resolution strategies like linear probing, resizing the hash table, and the importance of a good hash function for optimal performance. The author provides code snippets for both linear and binary search algorithms, as well as the hash table implementation. Additionally, the article touches on API design considerations for the hash table structure. The article concludes by pointing to the author's GitHub repository for the hash table implementation code and acknowledges feedback received for improving the implementation.
Related
Group Actions and Hashing Unordered Multisets
Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Benchmarking Perfect Hashing in C++
Benchmarking perfect hashing functions in C++ using clang++-19 and g++-13 reveals mph as the fastest with limitations. Various hash function implementations are compared for lookup time, build time, and size, aiding system optimization.
I _____ hate arrays in C++
The article explores challenges in using arrays in C++, focusing on array-to-pointer conversion pitfalls, differences from pointers, and practical examples of errors. Caution and awareness are advised for C++ developers.
Th64: Tiny Hash Function in C
A compact hash function "th64" on GitHub excels in performance, passing SMhasher3 tests. It achieves 43 cycles per hash for small keys and 2.61 bytes per cycle for larger keys, operating at 8.52 GiB per second. The function's C code demonstrates efficient data processing with a versatile design for different key sizes. It is available under public domain or MIT-0 license for flexible usage.
Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.
Also, in terms of this table, if you add the computed key hash value to the stored hash entry, you can check that the value of the hashes match before you do the full strcmp. If you have a weak hash you might also get a benefit from checking that the first characters match before calling the full strcmp.
It would also make rehashing easier since you already have the full key available to you and don't have to use your internal set function to move entries into the new table. In the posted implementation the big-O semantics of a rehash are worst case.
Anyways.. "man hsearch.3" if you want something cheap and easy.
$ echo abcdef |tr abc ghi
ghidef
For an eight-bit character set, I found that building an array to map every character improved on linear search, even for short replacement strings and relatively short input strings.There isn't as easy a win for Unicode, so I played with some simple hash tables. Although the conventional wisdom is to use the high-order bits of a hash function's output, FNV-1a is not so good for short inputs of one or two bytes. (I used the 32-bit variant.) It was better just to use the low-order bits.
I know size_t wraps around on overflow so this would actually work, but this still makes me do a double take.
I am fairly certain that in C it's actually impossible to have an object whose size is larger than half of the range of size_t unless ptrdiff_t is wider than size_t, which normally isn't. Unless, of course, C standard decided to make subtracting two valid pointers into the same array (or one past the end) a potential UB, just because.
How to implement a hash table in C - https://news.ycombinator.com/item?id=26590234 - March 2021 (156 comments)
or even better, a full fledged linear probing hash lookup instruction.
"void* ht_get(...)"
Wait. What? A void pointer? Interesting... I have no clue.
I like articles like these. For someone not familiar with C it's a perfect level. In terms of explanation and the code itself.
Related
Group Actions and Hashing Unordered Multisets
Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Benchmarking Perfect Hashing in C++
Benchmarking perfect hashing functions in C++ using clang++-19 and g++-13 reveals mph as the fastest with limitations. Various hash function implementations are compared for lookup time, build time, and size, aiding system optimization.
I _____ hate arrays in C++
The article explores challenges in using arrays in C++, focusing on array-to-pointer conversion pitfalls, differences from pointers, and practical examples of errors. Caution and awareness are advised for C++ developers.
Th64: Tiny Hash Function in C
A compact hash function "th64" on GitHub excels in performance, passing SMhasher3 tests. It achieves 43 cycles per hash for small keys and 2.61 bytes per cycle for larger keys, operating at 8.52 GiB per second. The function's C code demonstrates efficient data processing with a versatile design for different key sizes. It is available under public domain or MIT-0 license for flexible usage.