Re: [math-fun] Unusual ways to count thru all w-bit numbers
From: Mike Stay <metaweta@gmail.com> Klimov and Shamir have a few papers on what they call "T-functions" and describe cheap ones with maximal period.
"In particular, we show that for any n the mapping x ? x + (x? ? C) (mod 2?) is a permutation with a single cycle of length 2? iff both the least significant bit and the third least significant bit in the constant C are 1."
A.Klimov+A.Shamir (their theorem 3): The mapping x --> x + ((x^2) OR C) using mod 2^w arithmetic, is a permutation of length 2^w, i.e. consisting of a single cycle, iff the least-significant and third-least-significant bit of the constant C both are 1. The foremost example would be C=5, and they tried that with W=64 bit long words, finding the top 32 bits of each word passed the NIST pseudorandomness test suite. Nice theorem. However, this function still features "unidirectional info flow" on its bits, hence is a poor psu-random number generator, despite whatever Klimov+Shamir say to the contrary. And the NIST suite has very basic tests only, it is not a serious hurdle like the TestU01 suite. But I do not need any suite at all to see this violates my own obvious randomness test since the lower-signif bits have lower periods. On the other hand the unidirectional info flow property makes it far easier to prove theorems like theirs -- their immense verbiage obscures this simple idea which lies behind all the proofs by both them and many others. (Furthermore, the same idea but thinking mixed-radix rather than binary, enables proving most of the other full-period theorems out there working mod M for other moduli M besides powers of 2.) Another paper by Klimov+Shamir remarks that x --> x + 1 + 4*x*x (mod 2^W) yields a single cycle and may be computed using only 2 machine instructions, one to compute y=x*x, the other to compute x+1+4*y using the x86 "lea" (load effective address) instruction. Bet your compiler won't realize that. The x + ((x^2) OR C) took 3 instructions, which your compiler probably will realize. They say a polynomial map x --> FixedPolynomial(x) (mod 2^W) for W>3 yields a single cycle iff it has a single cycle modulo 8, i.e. works for W=3. A lovely theorem. They claim a fraction 1/64 of random polynomials obey that. They then move on to iterations with vector state, i.e. not just one word. For example the map on 4-vectors of W-bit machine words [x0,x1,x2,x3] --> [x0 XOR s XOR (2*x1*x2), x1 XOR (s AND a0) XOR (2*x2*x3), x2 XOR (s AND a1) XOR (2*x3*x0), x3 XOR (s AND a2) XOR (2*x0*x1) ] they say will yield a single cycle of full length 16^W where a0=x0, a1=a0 AND x1, a2=a1 AND x2, a3=a2 AND x3, s=(a3+C) XOR a3, and C is any odd constant. They have several other examples and they can be implemented to run at very high speeds by using some SSE machine language tricks. It is possible to somewhat disguise the low-period poor behavior of low-signif bits produced by these and other such iterations, by not using their outputs X directly, but rather using f(X), for some appropriately-chosen bijective "clean up function" f(X). A disguise based on XORing various bits, for example, could be used, but good randomness tests such as bit-matrix rank tests should be able to see though such a disguise. A harder disguise to penetrate might be something like "reverse the order of X's bits to get Y, then output a bijective quadratic function of Y."
* Warren D Smith <warren.wds@gmail.com> [Apr 08. 2015 17:38]:
[...]
Another paper by Klimov+Shamir remarks that x --> x + 1 + 4*x*x (mod 2^W) yields a single cycle and may be computed using only 2 machine instructions, one to compute y=x*x, the other to compute x+1+4*y using the x86 "lea" (load effective address) instruction. Bet your compiler won't realize that.
cat foo.cc unsigned foo(unsigned x) { unsigned y = x*x; return x + 1 + 4*y; } g++ -fverbose-asm -S -g -O2 -Wall -Werror -c -o asm.s foo.cc as -alhnd asm.s > asm.lst rm asm.s less asm.lst [...] 9:foo.cc **** unsigned foo(unsigned x) 10:foo.cc **** { 64 .loc 1 10 0 65 .cfi_startproc 66 .LVL0: 67 .LBB2: 11:foo.cc **** unsigned y = x*x; 68 .loc 1 11 0 69 0000 89F8 movl %edi, %eax # x, y 70 0002 0FAFC7 imull %edi, %eax # x, y 71 .LVL1: 12:foo.cc **** return x + 1 + 4*y; 72 .loc 1 12 0 73 0005 8D448701 leal 1(%rdi,%rax,4), %eax #, D.2236 74 .LVL2: 75 .LBE2: 13:foo.cc **** } [...] And compilers do MUCH more sophisticated thinks than this. Best regards, jj
[...]
I wouldn't confuse machine instruction encoding with machine instruction execution. There are some machines that "compile" standardly-encoded instructions on-the-fly into the instruction cache as very different RISC/VLIW instructions.* Thus, a lot of compiler cleverness (e.g., LEA) is wasted because both the clever and the non-clever instruction encodings end up as the same bit patterns (and execution speeds) in the instruction cache. So unless the instructions are never already in the instruction cache at the time that the instruction is actually executed, there isn't much penalty to the non-clever encoding. Note also that many machines do relatively aggressive pre-fetch on instructions, because the penalty for extra speculation on instructions isn't very high. On newer Intel processors with execute-only pages, the processor is free to pre-fetch like crazy. Thus, the only savings for such clever compiler encodings is in the size of the binary file, which -- in these days of 100MB-1GB applications -- is pretty insignificant. * This is yet another reason for separate instruction & data caches, which can cause all sorts of mischief when then get out of sync -- e.g., you can hide malicious code in plain sight (well, in the instruction cache), while the data cache shows the non-malicious code. But this is a discussion for another day. At 08:53 AM 4/8/2015, Joerg Arndt wrote:
less asm.lst [...] 9:foo.cc **** unsigned foo(unsigned x) 10:foo.cc **** { 64 .loc 1 10 0 65 .cfi_startproc 66 .LVL0: 67 .LBB2: 11:foo.cc **** unsigned y = x*x; 68 .loc 1 11 0 69 0000 89F8 movl %edi, %eax # x, y 70 0002 0FAFC7 imull %edi, %eax # x, y 71 .LVL1: 12:foo.cc **** return x + 1 + 4*y; 72 .loc 1 12 0 73 0005 8D448701 leal 1(%rdi,%rax,4), %eax #, D.2236 74 .LVL2: 75 .LBE2: 13:foo.cc **** } [...]
And compilers do MUCH more sophisticated thinks than this.
* Henry Baker <hbaker1@pipeline.com> [Apr 10. 2015 14:07]:
I wouldn't confuse machine instruction encoding with machine instruction execution.
I don't.
There are some machines that "compile" standardly-encoded instructions on-the-fly into the instruction cache as very different RISC/VLIW instructions.* Thus, a lot of compiler cleverness (e.g., LEA) is wasted because both the clever and the non-clever instruction encodings end up as the same bit patterns (and execution speeds) in the instruction cache.
Special instructions like lea are there for a reason. Even if lea would lead to equivalent micro-ops, a faster decoding may lead to faster execution. Without even reading up: lea is obviously an very important instruction and will get special care with regard to speed of execution. With good compilers you can learn which instructions are especially fast by looking at the generated assembler code: it's those instructions that the compiler chooses to use.
So unless the instructions are never already in the instruction cache at the time that the instruction is actually executed, there isn't much penalty to the non-clever encoding.
Note also that many machines do relatively aggressive pre-fetch on instructions, because the penalty for extra speculation on instructions isn't very high. On newer Intel processors with execute-only pages, the processor is free to pre-fetch like crazy.
Not sure I understand this, but "pre-fetching makes other optimizations unimportant" is completely wrong.
Thus, the only savings for such clever compiler encodings is in the size of the binary file, which -- in these days of 100MB-1GB applications -- is pretty insignificant.
*Gasp*, noooo! There are different compiler switches for (small) size and speed, that's for a reason. Unless you want to sacrifice performance by factor anywhere from 3 fold to 20 fold, just read any of Advanced Micro Devices (AMD) Inc.: {Software Optimization Guide for AMD64 Processors}, Publication no.25112, Revision 3.06, \bdate{September-2005}. Advanced Micro Devices (AMD) Inc.: {Software Optimization Guide for AMD Family 10h Processors}, Publication no.40546, Revision 3.1, \bdate{May-2009}. (there is a newer version) Intel Corporation: {Intel 64 and IA-32 architectures optimization reference manual}, \bdate{November-2007}. (there is a newer version) ... or what the maker of your favorite CPU has to offer. I made the "20 fold" up, the greatest ratio I have seen was about 10 thousand (clueless big program versus tiny and clever).
* This is yet another reason for separate instruction & data caches, * which can cause all sorts of mischief when then get out of sync -- * e.g., you can hide malicious code in plain sight (well, in the * instruction cache), while the data cache shows the non-malicious * code. But this is a discussion for another day.
I always assumed the are mechanisms for cache-coherency, am I wrong? Even if there are none: "executable" and "writable" should be mutually exclusive for any memory page (leaving security aside, self-modifying code tends to cripple performance). Best regards, jj
[...]
participants (3)
-
Henry Baker -
Joerg Arndt -
Warren D Smith