October 6th, 2024

AVX Bitwise ternary logic instruction busted

The blog post examines the AVX-512 vpternlogd instruction for complex Boolean logic operations, comparing it to the Amiga blitter chip and providing methods for calculating minterm values.

Read original article

SatisfactionCuriosityAppreciation

AVX Bitwise ternary logic instruction busted

The blog post discusses the AVX-512 instruction set architecture, specifically focusing on the vpternlogd instruction, which performs bitwise ternary logic operations using three input sources. This instruction allows for complex Boolean logic to be executed in a single command, processing 512 bits at once. The author draws a parallel between this modern instruction and the 1985 Amiga blitter chip, which also utilized an 8-bit value to control logical operations among three bitmap sources. The post highlights the challenges programmers faced in calculating the minterm values for the Amiga blitter, often relying on common values rather than understanding the underlying logic. The author provides a method for calculating these values, which can also be applied to the vpternlogd instruction, making it easier for programmers to define complex logical functions. The post concludes with a humorous observation about the potential influence of retro computing on modern Intel documentation, particularly regarding the choice of example values.

- The vpternlogd instruction in AVX-512 allows complex Boolean logic operations using three inputs.

- The instruction processes data in 512-bit registers, enhancing computational efficiency.

- The author compares the vpternlogd instruction to the Amiga blitter chip, which also used an 8-bit value for logical operations.

- A method for calculating minterm values is provided, applicable to both the Amiga blitter and modern AVX instructions.

- The post humorously suggests a retro influence in Intel's documentation choices.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Weird things I learned while writing an x86 emulator

The article explores writing an x86 and amd64 emulator for Time Travel Debugging, emphasizing x86 encoding, prefixes, flag behaviors, shift instructions, segment overrides, FS and GS segments, TEB structures, CPU configuration, and segment handling nuances in 32-bit and 64-bit modes.

tolower() with AVX-512

Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.

Zen5's AVX512 Teardown and More

AMD's Zen5 architecture enhances AVX512 capabilities with native implementation, achieving 4 x 512-bit throughput, while facing thermal throttling challenges. It shows significant performance gains, especially in high-performance computing.

Zen5's AVX512 Teardown and More (Without Redacted Content)

AMD's Zen5 architecture enhances AVX512 capabilities with full 512-bit execution paths, but faces memory bandwidth limitations affecting high-performance computing. IPC improvements vary, with some workloads achieving up to 98% gains.

AI: What people are saying

The comments on the blog post about the AVX-512 vpternlogd instruction reveal several key themes and insights from readers.

Many commenters appreciate the connection between the AVX instruction and historical hardware like the Amiga blitter chip, sharing personal experiences and nostalgia.
There is a discussion about the practicality and implementation of the instruction in compilers, with some questioning whether compilers can effectively utilize it.
Several users highlight the concept of using lookup tables for Boolean operations, drawing parallels to FPGAs and other technologies.
Some commenters clarify the terminology around "ternary logic," noting that it typically refers to three truth values, while the instruction handles binary logic with three inputs.
Overall, the article is well-received, with many expressing gratitude for the informative content.

27 comments

By @mmozeiko - 3 months

There is a simple way to get that immediate from expression you want to calculate. For example, if you want to calculate following expression:

    (NOT A) OR ((NOT B) XOR (C AND A))

then you simply write

    ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))

Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:

    typedef enum {
      _MM_TERNLOG_A = 0xF0,
      _MM_TERNLOG_B = 0xCC,
      _MM_TERNLOG_C = 0xAA
    } _MM_TERNLOG_ENUM;

For MSVC you define them yourself.

By @Sniffnoy - 3 months

Oh, I thought the title was saying that the instruction doesn't work properly! (The article actually just explains how it works.)

By @Lerc - 3 months

My teenage self did not write "CRAP!" on that page of the hardware manual, but I stared at it for so long trying to figure it out.

In the end I did what pretty much everyone else did, Found the BLTCON0 for Bobs and straight copies and then pretended I newer saw the thing.

I did however get an A+ in computational logic at university years later, so maybe some of the trauma turned out to be beneficial.

By @cubefox - 3 months

About the title: "Ternary logic" usually means "logic with three truth values". But this piece covers a compiler instruction which handles all binary logic gates with three inputs.

By @red_admiral - 3 months

Is this similar to the Windows (since at least 3.1 I think?) BitBlt function, that takes an `op` parameter to decide how to combine the source, destination and mask?

I remember there are names for some of the codes like BLACKNESS for producing black whatever the inputs are, COPY (or something like that) to just copy the source to the destination etc. I always thought BLACKNESS and WHITENESS had a kind of poetic ring to them.

As far as I know, I think this is from Petzold, it's implemented in software but the opcode is actually converted to custom assembly inside the function when you call it, a rare example of self-modifying code in the Windows operating system.

By @kens - 3 months

I'll point out that this is the same way that FPGAs implement arbitrary logic functions, as lookup tables (LUTs).

By @anon2024user - 3 months

Head over to https://www.sandpile.org, and find VPTERNLOG on the 3-byte opcode page https://www.sandpile.org/x86/opc_3.htm and you will not only see Intel's apparent past plan for the variants with byte and word masking (AVX512BITALG2), but also the links from the Ib operand to the ternary logic table page https://www.sandpile.org/x86/ternlog.htm with all 256 cases.

By @abecedarius - 3 months

Re the choice of function "E2" for the example in the docs: it's sort of the most basic, canonical boolean function on 3 inputs, named mux: A if B else C. It's universal -- you don't need to be an Amiga fan to pick it (though for all I know they might've been).

By @fallingsquirrel - 3 months

Another example of packing bitwise ops into an integer is win32's GDI ROP codes: https://learn.microsoft.com/en-us/windows/win32/gdi/ternary-...

By @Findecanor - 3 months

I didn't have the official Amiga hardware manual, but instead the book "Mapping the Amiga". It said the same thing in a slight more verbose way. I don't remember which minterms I used back then but I think I managed to work things out from this book to do shadebobs, bobs, XOR 3D line drawing and other things.

The page in Mapping the Amiga: https://archive.org/details/1993-thomson-randy-rhett-anderso...

By @leogao - 3 months

Nvidia SASS has a similar instruction too (LOP3.LUT)

By @worstspotgain - 3 months

It's nice that they're finally starting to "compress" the instruction space.

To take a related concept further, it would be nice if there were totally unportable, chip-superspecific ways of feeding uops directly, particularly with raw access to the unrenamed register file.

Say you have an inner loop, and a chip is popular. Let your compiler take a swing at it. If it's way faster than the ISA translation, add a special case to the fat binary for a single function.

Alas, it will probably never happen due to security, integrity, and testing costs.

By @unwind - 3 months

As someone who fits the description rather too well (although neither my teenage or current self would ever use a marker in the Hardware Reference, omg) this was really nice and satisfying to read.

In a weird sense it kind of helped me feel that, yes, I would probably understand stuff better if I tried re-learning the Amiga hardware today and also like I got a bit of it for free already! Is there such a thing as being protected from a nerd snipe? "This article was my nerd trench" ... or something. Thanks! :)

By @pwrrr - 3 months

Holy cow. I remember reading that page in the Amiga reference manual, thinking it was utter crap and made up my own way of calculating the value (which worked, lol).

By @makapuf - 3 months

In fact that means that there is a dedicated AVX instruction for Elementary cellular automaton (https://en.wikipedia.org/wiki/Elementary_cellular_automaton).

By @ChuckMcM - 3 months

This is an instruction I would like to implement in RISC-V if it isn't already, (which yeah, I know, isn't very RISC like)

   movei (%r1),(%r2),(%r3),value

Move the contents of memory pointed to by r1, to the contents of memory pointed to by r2, applying the boolean operator <value>, with the memory pointed to by r3. Then increment all three registers by 4 to point to the next word. There was something similar to this in the Intel 82786 graphics chip which had a sort of minimal cpu part that could run simple "programs".

And yeah, I really enjoyed the blitter on the Amiga. It was a really cool bit of hardware.

By @notfed - 3 months

Couldn't every Boolean operation be "busted" as a lookup table?

By @londons_explore - 3 months

Do compilers actually output this instruction?

So many super-clever instructions are next to impossible for compilers to automatically use.

By @pjmlp - 3 months

> The Amiga blitter user manual didn’t help much either. The “Amiga Hardware Reference Manual” from 1989 tried to explain minterm calculation using confusing symbols, which frustrated many young demo makers at the time.

That is super normal logical calculus that any worthwhile CS degree teaches about.

Granted, probably not what a teenager without access to a BBS, or Aminet, would be able to figure out.

By @transfire - 3 months

Great little article! Thank you.

By @ggerules - 3 months

It looks like someone paid attention in their undergraduate Discrete Math class.

By @stevefan1999 - 3 months

If you want to calculate the minterms why don't you just get a K-Map?

By @486sx33 - 3 months

it’s fundamentally just a lookup table

By @hvenev - 3 months

> an obscure instruction

Come on, vpternlog* is not obscure. It subsumes _all_ bitwise instructions, even loading the constant (-1) into a register.

AVX Bitwise ternary logic instruction busted

Related

Do not taunt happy fun branch predictor

Weird things I learned while writing an x86 emulator

tolower() with AVX-512

Zen5's AVX512 Teardown and More

Zen5's AVX512 Teardown and More (Without Redacted Content)

Related

Do not taunt happy fun branch predictor

Weird things I learned while writing an x86 emulator

tolower() with AVX-512

Zen5's AVX512 Teardown and More

Zen5's AVX512 Teardown and More (Without Redacted Content)