AVX Bitwise ternary logic instruction busted
The blog post examines the AVX-512 vpternlogd instruction for complex Boolean logic operations, comparing it to the Amiga blitter chip and providing methods for calculating minterm values.
Read original articleThe blog post discusses the AVX-512 instruction set architecture, specifically focusing on the vpternlogd instruction, which performs bitwise ternary logic operations using three input sources. This instruction allows for complex Boolean logic to be executed in a single command, processing 512 bits at once. The author draws a parallel between this modern instruction and the 1985 Amiga blitter chip, which also utilized an 8-bit value to control logical operations among three bitmap sources. The post highlights the challenges programmers faced in calculating the minterm values for the Amiga blitter, often relying on common values rather than understanding the underlying logic. The author provides a method for calculating these values, which can also be applied to the vpternlogd instruction, making it easier for programmers to define complex logical functions. The post concludes with a humorous observation about the potential influence of retro computing on modern Intel documentation, particularly regarding the choice of example values.
- The vpternlogd instruction in AVX-512 allows complex Boolean logic operations using three inputs.
- The instruction processes data in 512-bit registers, enhancing computational efficiency.
- The author compares the vpternlogd instruction to the Amiga blitter chip, which also used an 8-bit value for logical operations.
- A method for calculating minterm values is provided, applicable to both the Amiga blitter and modern AVX instructions.
- The post humorously suggests a retro influence in Intel's documentation choices.
Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Weird things I learned while writing an x86 emulator
The article explores writing an x86 and amd64 emulator for Time Travel Debugging, emphasizing x86 encoding, prefixes, flag behaviors, shift instructions, segment overrides, FS and GS segments, TEB structures, CPU configuration, and segment handling nuances in 32-bit and 64-bit modes.
tolower() with AVX-512
Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.
Zen5's AVX512 Teardown and More
AMD's Zen5 architecture enhances AVX512 capabilities with native implementation, achieving 4 x 512-bit throughput, while facing thermal throttling challenges. It shows significant performance gains, especially in high-performance computing.
Zen5's AVX512 Teardown and More (Without Redacted Content)
AMD's Zen5 architecture enhances AVX512 capabilities with full 512-bit execution paths, but faces memory bandwidth limitations affecting high-performance computing. IPC improvements vary, with some workloads achieving up to 98% gains.
- Many commenters appreciate the connection between the AVX instruction and historical hardware like the Amiga blitter chip, sharing personal experiences and nostalgia.
- There is a discussion about the practicality and implementation of the instruction in compilers, with some questioning whether compilers can effectively utilize it.
- Several users highlight the concept of using lookup tables for Boolean operations, drawing parallels to FPGAs and other technologies.
- Some commenters clarify the terminology around "ternary logic," noting that it typically refers to three truth values, while the instruction handles binary logic with three inputs.
- Overall, the article is well-received, with many expressing gratitude for the informative content.
(NOT A) OR ((NOT B) XOR (C AND A))
then you simply write ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))
Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang: typedef enum {
_MM_TERNLOG_A = 0xF0,
_MM_TERNLOG_B = 0xCC,
_MM_TERNLOG_C = 0xAA
} _MM_TERNLOG_ENUM;
For MSVC you define them yourself.In the end I did what pretty much everyone else did, Found the BLTCON0 for Bobs and straight copies and then pretended I newer saw the thing.
I did however get an A+ in computational logic at university years later, so maybe some of the trauma turned out to be beneficial.
I remember there are names for some of the codes like BLACKNESS for producing black whatever the inputs are, COPY (or something like that) to just copy the source to the destination etc. I always thought BLACKNESS and WHITENESS had a kind of poetic ring to them.
As far as I know, I think this is from Petzold, it's implemented in software but the opcode is actually converted to custom assembly inside the function when you call it, a rare example of self-modifying code in the Windows operating system.
The page in Mapping the Amiga: https://archive.org/details/1993-thomson-randy-rhett-anderso...
To take a related concept further, it would be nice if there were totally unportable, chip-superspecific ways of feeding uops directly, particularly with raw access to the unrenamed register file.
Say you have an inner loop, and a chip is popular. Let your compiler take a swing at it. If it's way faster than the ISA translation, add a special case to the fat binary for a single function.
Alas, it will probably never happen due to security, integrity, and testing costs.
In a weird sense it kind of helped me feel that, yes, I would probably understand stuff better if I tried re-learning the Amiga hardware today and also like I got a bit of it for free already! Is there such a thing as being protected from a nerd snipe? "This article was my nerd trench" ... or something. Thanks! :)
movei (%r1),(%r2),(%r3),value
Move the contents of memory pointed to by r1, to the contents of memory pointed to by r2, applying the boolean operator <value>, with the memory pointed to by r3. Then increment all three registers by 4 to point to the next word. There was something similar to this in the Intel 82786 graphics chip which had a sort of minimal cpu part that could run simple "programs".And yeah, I really enjoyed the blitter on the Amiga. It was a really cool bit of hardware.
So many super-clever instructions are next to impossible for compilers to automatically use.
That is super normal logical calculus that any worthwhile CS degree teaches about.
Granted, probably not what a teenager without access to a BBS, or Aminet, would be able to figure out.
Come on, vpternlog* is not obscure. It subsumes _all_ bitwise instructions, even loading the constant (-1) into a register.
Related
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Weird things I learned while writing an x86 emulator
The article explores writing an x86 and amd64 emulator for Time Travel Debugging, emphasizing x86 encoding, prefixes, flag behaviors, shift instructions, segment overrides, FS and GS segments, TEB structures, CPU configuration, and segment handling nuances in 32-bit and 64-bit modes.
tolower() with AVX-512
Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.
Zen5's AVX512 Teardown and More
AMD's Zen5 architecture enhances AVX512 capabilities with native implementation, achieving 4 x 512-bit throughput, while facing thermal throttling challenges. It shows significant performance gains, especially in high-performance computing.
Zen5's AVX512 Teardown and More (Without Redacted Content)
AMD's Zen5 architecture enhances AVX512 capabilities with full 512-bit execution paths, but faces memory bandwidth limitations affecting high-performance computing. IPC improvements vary, with some workloads achieving up to 98% gains.