Gareth wrote:
As Henry Baker says, Knuth has proposed something very much along these lines: see http://mmix.cs.hm.edu/doc/instructions-en.html#Bit_Operations and look in the section headed "Mix and Match". His MXOR operation is an 8x8 matrix multiplication over GF(2), and his MOR operation is the same with OR instead of XOR for the addition.
It's matrix * matrix rather than matrix * vector, and the size is chosen so that the whole matrix (rather than just one row or column) fits in a (64-bit) machine word.
Since MXOR doesn't exist, what's the most efficient way to emulate it using existing instructions (representing matrices as uint64s)? I tried the obvious place http://www.hackersdelight.org/hdcode.htm to no avail. That contains a nice implementation to transpose such a matrix, but not to multiply matrices. A.B^T is easier to calculate than A.B, I believe. In particular, you can take the eight uint64s, where >>> indicates cyclic shift: C_0 = A & B C_1 = A & (B >>> 8) C_2 = A & (B >>> 16) ... C_7 = A & (B >>> 56) and apply the following operations in parallel (using SSE or AVX): C_i ^= (C_i >> 1) C_i ^= (C_i >> 2) C_i ^= (C_i >> 4) C_i &= 1 which will result in 64 bytes, each of which contains either 0 or 1. We want to pack these bits into a single uint64, like so: C = C_0 | (C_1 << 1) | (C_2 << 2) | ... | (C_7 << 7) and possibly do some annoying permutation on these bits. Just out of interest, *why* isn't MXOR in the Intel instruction set? It doesn't seem too complicated to implement, and it's useful both in cryptography and data analysis (see 'persistent homology'). Best wishes, Adam P. Goucher