I've investigated Warren's permutation problem in several different guises. I studied how to implement arbitrary permutations in APL using the APL transpose & restructure operations. If we have an n-dimensional array in APL, transpose can exchange indices; if we have a vector in APL, restructure can give it a new interpretation as an n-dimensional array. I also studied how to implement arbitrary permutations on the top k elements of a stack (in a stack machine). And then there's also Warren's bit-permutation problem. The basic problem in all of these permutation problems is straight-forward information theory: how many bits does it take to specify which permutation you're talking about in the first place. If all permutations are equally likely than we have log(n!) ~ O(n*log(n)) bits. So even if you wanted to specify an arbitrary permutation of the 64 bits in a 64-bit word, you would need 296 bits (= log2(64!)). At 10:33 AM 6/8/2012, Warren Smith wrote:
It is not so easy to permute bits, actually.
Say you have a W-bit word (W=64?). You want to permute the lower-significance K bits in some arbitrary known manner. How do you do it? Here is one way:
Step 1: multiply by 100001000010000... where the period is K. Carryless or regular multiply both work here and below.
Step 2: AND with some magic bit mask with one 1-bit per K-chunk chosen to get exactly one copy of each original bit, one per chunk, ordered in arbitrary known order.
Step 3: multiply by the correct magic sparse-1's number so that the correct permuted K bits magically appear in the most-signif K bits of the product word. It is possible here that if you are using regular (not carryless) multiply, you will need about log(K) buffer bits between the top K bits used for output, and the bottom K*K bits used for input.
Step 4: shift right by W-K+BuffSize hops to reposition in original K bit locations.
This works if BuffSize+K*K<=W, so if W=64 we can permute 8 bits with carryless, and maybe somewhat fewer with regular, multiply.
That sucks. I mean, to permute 8 bits I could've just used a 256-byte lookup table.
Looking in the FxtBook, http://www.jjj.de/fxt/fxtbook.pdf section 1.29 fails to answer the question though it does suggest future machine instructions that would.