[math-fun] bitboard dihedral group (crazy hackers please?)
Since mathfunners have impressed me with their crazy-bit-twiddling capabilities, here's a challenge. In chess programming, they often employ 8x8 "bitboards" -- i.e. 64-bit machine words. A bitboard question which I think has never been fully answered, is this. There are 8 ways, given a 8x8 chessboard, to rotate and flip it (dihedral group D4). For each of the 7 non-identity ways, find the fastest algorithm you can that performs that bit-permutation on a 64-bit word. Henry Warren's book "Hacker's delight" already discussed the 8x8 matrix transpose, which is one of these 7 problems. He ended up with the following incredible algorithm: uint64 TransposeBitBd(BitBd x){ //Adapted from Warren:: Hacker's delight. uint64 t; t = (x ^ (x >> 7)) & U64(0x00AA00AA00AA00AA); x = x ^ t ^ (t << 7); t = (x ^ (x >> 14)) & U64(0x0000CCCC0000CCCC); x = x ^ t ^ (t << 14); t = (x ^ (x >> 28)) & U64(0x00000000F0F0F0F0); x = x ^ t ^ (t << 28); return(x); } Good luck figuring it out. Another pointed out by Jorg Arndt is: the single machine instruction "bswap" (on many processors) is available which permutes the 8 bytes into reverse order -- this causes the board to flip vertically (or horizontally, depending how you'd ordered its bits in a raster). Another (180 degree board rotate) is the bit-reversal permutation, for which an extremely fast and mysterious algorithm is given by Don Knuth and you can go even faster on machines with bswap, we discussed this problem in earlier posts. For 90 and 270 degree board rotations with or without accompanying flips, I can do it in O(N) operations on an NxN bitboard in various fairly obvious ways that involve multiplications and maskings. Is it possible to achieve o(N) somehow? If the squares of the board were numbered 0..63 in a peculiar manner, e.g. the top left quadrant is numbered 0..15 then the other quadrants are numbered 16..31, 32..47, 48..63 in symmetric (rotated) ways, then you could perform a 90, 180, or 270 degree rotation trivially by simply doing a single circular-shift-by-16 (32, 48) hops on the 64-bit word. Probably this strange kind of numbering would be unfriendly for most computer chess purposes (?), but for this purpose it's great. More precisely, I believe there is an ordering of the 64 chessboard squares 0..63 (and there are sub-orderings of the 4x4 board 0..15 and 2x2 board 0..3) with these properties: 90 rotate: add 16 mod 64, aka circular shift by 16 180 rotate: add 32 mod 64 270 rotate: subtract 16 mod 64 For the 4x4: 90 rotate: add 4 mod 16, aka circular shift by 4 180 rotate: add 8 mod 16 270 rotate: subtract 4 mod 16 For the 2x2 where the ordering is 32 01: 90 rotate: add 1 mod 4, aka circular shift by 1 180 rotate: add 2 mod 4 270 rotate: subtract 1 mod 4 flip horizontally: XOR with 01 flip vertically: XOR with 11 And back on the full 8x8: flip board: permute the four 16-bit subwords as ABCD -> BADC or ABCD -> CDAB, and then within each 16-bit subword, perform a 4x4 flip. This in turn is accomplished recursively by the same operation on its four 4-bit subwords, then within each do a 2x2 flip as described above. So, my head is hurting, but it seems as though by means of this kind of numbering scheme, it becomes possible on an NxN bitboard (N=power of 2) to implement the 4 rotation operations in O(1) steps each, and the flip operation in O(logN) steps. I'll only believe the latter when I see more details, so for now regard this as a sketch/conjecture. It might be you cannot "parallelize" it and hence cannot achieve logN, instead getting something sadder like N or sqrt(N) steps. Anyhow, the original dihedral-7 problem is clearly not a triviality. -- Warren D. Smith http://RangeVoting.org
For a computer with N-bit word operations, any permutation of the bits can be constructed as 2 ceiling(log2(N)) - 1 steps. Each step pairs the bits, and conditionally swaps some of the pairs. The swaps are controlled by nuisance- level-hard-to-compute bit masks. The swapping steps use few-instruction sequences like people have been discussing. For a 32-bit word, the pair separations can be 1,2,4,8,16, 8,4,2,1. This would apply to the problem of rotating an 8x8 bit matrix by 90 degrees, though I think there's likely something better for such a regular rearrangement. (Rearrange quarters, recurse.) Two related problems: a) Instead of a permutation, suppose we want a map which makes extra copies of some bits, while losing others. What's a good way to do this? Can it be built similarly to the scheme above? That above scheme uses (compiled conditional) two-bit swap operations, with a small set of bit separations. Suppose we add an additional 2bit->2bit primitive, which copies one input bit into both outputs? b) Telephone crossbar problem: The masks for controlling the bit permutation depend in a complicated way on the desired permutation. A small change in the permutation can require completely different masks. This isn't a big deal if we pre-compile the masks for a particular permutation, but ... Suppose we want to be able to change our permutation dynamically on the fly? Perhaps by adding an extra swap at the beginning or end, or by marking a couple of inputs as 'disconnected' and then reconnecting them differently. Is there a good scheme with this flexibility? Rich ----- Quoting Warren D Smith <warren.wds@gmail.com>:
Since mathfunners have impressed me with their crazy-bit-twiddling capabilities, here's a challenge.
In chess programming, they often employ 8x8 "bitboards" -- i.e. 64-bit machine words.
A bitboard question which I think has never been fully answered, is this. There are 8 ways, given a 8x8 chessboard, to rotate and flip it (dihedral group D4). For each of the 7 non-identity ways, find the fastest algorithm you can that performs that bit-permutation on a 64-bit word.
Henry Warren's book "Hacker's delight" already discussed the 8x8 matrix transpose, which is one of these 7 problems. He ended up with the following incredible algorithm:
uint64 TransposeBitBd(BitBd x){ //Adapted from Warren:: Hacker's delight. uint64 t; t = (x ^ (x >> 7)) & U64(0x00AA00AA00AA00AA); x = x ^ t ^ (t << 7); t = (x ^ (x >> 14)) & U64(0x0000CCCC0000CCCC); x = x ^ t ^ (t << 14); t = (x ^ (x >> 28)) & U64(0x00000000F0F0F0F0); x = x ^ t ^ (t << 28); return(x); }
Good luck figuring it out.
Another pointed out by Jorg Arndt is: the single machine instruction "bswap" (on many processors) is available which permutes the 8 bytes into reverse order -- this causes the board to flip vertically (or horizontally, depending how you'd ordered its bits in a raster).
Another (180 degree board rotate) is the bit-reversal permutation, for which an extremely fast and mysterious algorithm is given by Don Knuth and you can go even faster on machines with bswap, we discussed this problem in earlier posts.
For 90 and 270 degree board rotations with or without accompanying flips, I can do it in O(N) operations on an NxN bitboard in various fairly obvious ways that involve multiplications and maskings. Is it possible to achieve o(N) somehow?
If the squares of the board were numbered 0..63 in a peculiar manner, e.g. the top left quadrant is numbered 0..15 then the other quadrants are numbered 16..31, 32..47, 48..63 in symmetric (rotated) ways, then you could perform a 90, 180, or 270 degree rotation trivially by simply doing a single circular-shift-by-16 (32, 48) hops on the 64-bit word. Probably this strange kind of numbering would be unfriendly for most computer chess purposes (?), but for this purpose it's great.
More precisely, I believe there is an ordering of the 64 chessboard squares 0..63 (and there are sub-orderings of the 4x4 board 0..15 and 2x2 board 0..3) with these properties: 90 rotate: add 16 mod 64, aka circular shift by 16 180 rotate: add 32 mod 64 270 rotate: subtract 16 mod 64
For the 4x4: 90 rotate: add 4 mod 16, aka circular shift by 4 180 rotate: add 8 mod 16 270 rotate: subtract 4 mod 16
For the 2x2 where the ordering is 32 01: 90 rotate: add 1 mod 4, aka circular shift by 1 180 rotate: add 2 mod 4 270 rotate: subtract 1 mod 4 flip horizontally: XOR with 01 flip vertically: XOR with 11
And back on the full 8x8: flip board: permute the four 16-bit subwords as ABCD -> BADC or ABCD -> CDAB, and then within each 16-bit subword, perform a 4x4 flip. This in turn is accomplished recursively by the same operation on its four 4-bit subwords, then within each do a 2x2 flip as described above.
So, my head is hurting, but it seems as though by means of this kind of numbering scheme, it becomes possible on an NxN bitboard (N=power of 2) to implement the 4 rotation operations in O(1) steps each, and the flip operation in O(logN) steps. I'll only believe the latter when I see more details, so for now regard this as a sketch/conjecture. It might be you cannot "parallelize" it and hence cannot achieve logN, instead getting something sadder like N or sqrt(N) steps.
Anyhow, the original dihedral-7 problem is clearly not a triviality.
-- Warren D. Smith http://RangeVoting.org
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
* Warren D Smith <warren.wds@gmail.com> [Apr 18. 2014 18:50]:
[...]
Henry Warren's book "Hacker's delight" already discussed the 8x8 matrix transpose, which is one of these 7 problems. He ended up with the following incredible algorithm:
uint64 TransposeBitBd(BitBd x){ //Adapted from Warren:: Hacker's delight. uint64 t; t = (x ^ (x >> 7)) & U64(0x00AA00AA00AA00AA); x = x ^ t ^ (t << 7); t = (x ^ (x >> 14)) & U64(0x0000CCCC0000CCCC); x = x ^ t ^ (t << 14); t = (x ^ (x >> 28)) & U64(0x00000000F0F0F0F0); x = x ^ t ^ (t << 28); return(x); }
Good luck figuring it out.
Hint: ? n=8; M=matrix(n,n,r,c,n*(r-1)+c-1) [ 0 1 2 3 4 5 6 7] [ 8 9 10 11 12 13 14 15] [16 17 18 19 20 21 22 23] [24 25 26 27 28 29 30 31] [32 33 34 35 36 37 38 39] [40 41 42 43 44 45 46 47] [48 49 50 51 52 53 54 55] [56 57 58 59 60 61 62 63] ? M - M~ \\ tilde is "transpose" in Pari [ 0 -7 -14 -21 -28 -35 -42 -49] [ 7 0 -7 -14 -21 -28 -35 -42] [14 7 0 -7 -14 -21 -28 -35] [21 14 7 0 -7 -14 -21 -28] [28 21 14 7 0 -7 -14 -21] [35 28 21 14 7 0 -7 -14] [42 35 28 21 14 7 0 -7] [49 42 35 28 21 14 7 0] Which suggests (DO see https://oeis.org/draft/A241216 !) the partition [7, 14, 28] right away. Exercise: Transpose 4 x 4 in two steps: ? n=4; M=matrix(n,n,r,c,n*(r-1)+c-1) [ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15] ? M - M~ [0 -3 -6 -9] [3 0 -3 -6] [6 3 0 -3] [9 6 3 0] How many steps does one apparently need for n x n (using min-step m, and 2*m, 4*m, 8*m, ... as far as needed) ?
[...]
Best, jj
* Joerg Arndt <arndt@jjj.de> [Apr 19. 2014 07:37]:
* Warren D Smith <warren.wds@gmail.com> [Apr 18. 2014 18:50]:
[...]
Henry Warren's book "Hacker's delight" already discussed the 8x8 matrix transpose, which is one of these 7 problems. He ended up with the following incredible algorithm:
uint64 TransposeBitBd(BitBd x){ //Adapted from Warren:: Hacker's delight. uint64 t; t = (x ^ (x >> 7)) & U64(0x00AA00AA00AA00AA); x = x ^ t ^ (t << 7); t = (x ^ (x >> 14)) & U64(0x0000CCCC0000CCCC); x = x ^ t ^ (t << 14); t = (x ^ (x >> 28)) & U64(0x00000000F0F0F0F0); x = x ^ t ^ (t << 28); return(x); }
Good luck figuring it out.
[...]
It is also a good idea to look at the binary bit masks: ulong transpose_bit_board(ulong x) { // from "Hacker's delight": transpose8rS64(...) from file transpose8.c ulong t; t = (x ^ (x >> 7)) & 0x00aa00aa00aa00aaUL; x = x ^ t ^ (t << 7); // mask = 0b0000000010101010000000001010101000000000101010100000000010101010UL t = (x ^ (x >> 14)) & 0x0000cccc0000ccccUL; x = x ^ t ^ (t << 14); // mask = 0b0000000000000000110011001100110000000000000000001100110011001100UL t = (x ^ (x >> 28)) & 0x00000000f0f0f0f0UL; x = x ^ t ^ (t << 28); // mask = 0b0000000000000000000000000000000011110000111100001111000011110000UL return x; } Now break the masks up into boards (dots for zeros) ........ 1.1.1.1. ........ 1.1.1.1. ........ 1.1.1.1. ........ 1.1.1.1. ........ ........ 11..11.. 11..11.. ........ ........ 11..11.. 11..11.. ........ ........ ........ ........ 1111.... 1111.... 1111.... 1111.... And that should make things evident. Use a quarter of the first two masks as 16 bit words to get the routine I asked for in my prior message. Btw. the largest n such that an n x n matrix can be transposed in three steps (assuming words of n^2 bits) should be 14. 16 x 16 (256 bit words) sadly need 4. Best, jj
What if Knuth's MMIX instructions MOR and MXOR were available? http://mmix.cs.hm.edu/doc/instructions-en.html#MOR What if we also had similar operations on entire bytes ("BMOR" and "BMXOR")? Just a small matter of wires and gates (and doubtless << IEEE 754 support). We accept quaint instruction sets much as we do the quaint rules of chess. But needn't always, and, I hope, shan't someday.
participants (4)
-
Joerg Arndt -
Marc LeBrun -
rcs@xmission.com -
Warren D Smith