Re: [math-fun] Unusual ways to count thru all w-bit numbers

10 Apr 2015

      * Henry Baker <hbaker1@pipeline.com> [Apr 10. 2015 14:07]:
...
I wouldn't confuse machine instruction encoding with machine instruction execution.
I don't.
...
There are some machines that "compile" standardly-encoded
instructions on-the-fly into the instruction cache as very different
RISC/VLIW instructions.* Thus, a lot of compiler cleverness (e.g.,
LEA) is wasted because both the clever and the non-clever
instruction encodings end up as the same bit patterns (and execution
speeds) in the instruction cache.
Special instructions like lea are there for a reason.
Even if lea would lead to equivalent micro-ops,
a faster decoding may lead to faster execution.
Without even reading up: lea is obviously an very
important instruction and will get special care
with regard to speed of execution.

With good compilers you can learn which instructions are especially
fast by looking at the generated assembler code: it's those
instructions that the compiler chooses to use.
...
So unless the instructions are never already in the instruction
cache at the time that the instruction is actually executed, there
isn't much penalty to the non-clever encoding.
Note also that many machines do relatively aggressive pre-fetch on
instructions, because the penalty for extra speculation on
instructions isn't very high.  On newer Intel processors with
execute-only pages, the processor is free to pre-fetch like crazy.
Not sure I understand this, but
  "pre-fetching makes other optimizations unimportant"
is completely wrong.
...
Thus, the only savings for such clever compiler encodings is in the
size of the binary file, which -- in these days of 100MB-1GB
applications -- is pretty insignificant.
*Gasp*, noooo!

There are different compiler switches for (small) size and speed,
that's for a reason.

Unless you want to sacrifice performance
by factor anywhere from 3 fold to 20 fold,
just read any of

  Advanced Micro Devices (AMD) Inc.:
  {Software Optimization Guide for AMD64 Processors},
  Publication no.25112, Revision 3.06, \bdate{September-2005}.

  Advanced Micro Devices (AMD) Inc.:
  {Software Optimization Guide for AMD Family 10h Processors},
  Publication no.40546, Revision 3.1, \bdate{May-2009}.
  (there is a newer version)

  Intel Corporation:
  {Intel 64 and IA-32 architectures optimization reference manual},
   \bdate{November-2007}.
  (there is a newer version)

... or what the maker of your favorite CPU has to offer.

I made the "20 fold" up, the greatest ratio I have
seen was about 10 thousand (clueless big program
versus tiny and clever).
...
* This is yet another reason for separate instruction & data caches,
* which can cause all sorts of mischief when then get out of sync --
* e.g., you can hide malicious code in plain sight (well, in the
* instruction cache), while the data cache shows the non-malicious
* code.  But this is a discussion for another day.
I always assumed the are mechanisms for cache-coherency, am I wrong?
Even if there are none: "executable" and "writable" should be mutually
exclusive for any memory page (leaving security aside, self-modifying
code tends to cripple performance).

Best regards,  jj
...
[...]

Re: [math-fun] Unusual ways to count thru all w-bit numbers

Joerg Arndt