* Henry Baker <hbaker1@pipeline.com> [Apr 10. 2015 14:07]:
I wouldn't confuse machine instruction encoding with machine instruction execution.
I don't.
There are some machines that "compile" standardly-encoded instructions on-the-fly into the instruction cache as very different RISC/VLIW instructions.* Thus, a lot of compiler cleverness (e.g., LEA) is wasted because both the clever and the non-clever instruction encodings end up as the same bit patterns (and execution speeds) in the instruction cache.
Special instructions like lea are there for a reason. Even if lea would lead to equivalent micro-ops, a faster decoding may lead to faster execution. Without even reading up: lea is obviously an very important instruction and will get special care with regard to speed of execution. With good compilers you can learn which instructions are especially fast by looking at the generated assembler code: it's those instructions that the compiler chooses to use.
So unless the instructions are never already in the instruction cache at the time that the instruction is actually executed, there isn't much penalty to the non-clever encoding.
Note also that many machines do relatively aggressive pre-fetch on instructions, because the penalty for extra speculation on instructions isn't very high. On newer Intel processors with execute-only pages, the processor is free to pre-fetch like crazy.
Not sure I understand this, but "pre-fetching makes other optimizations unimportant" is completely wrong.
Thus, the only savings for such clever compiler encodings is in the size of the binary file, which -- in these days of 100MB-1GB applications -- is pretty insignificant.
*Gasp*, noooo! There are different compiler switches for (small) size and speed, that's for a reason. Unless you want to sacrifice performance by factor anywhere from 3 fold to 20 fold, just read any of Advanced Micro Devices (AMD) Inc.: {Software Optimization Guide for AMD64 Processors}, Publication no.25112, Revision 3.06, \bdate{September-2005}. Advanced Micro Devices (AMD) Inc.: {Software Optimization Guide for AMD Family 10h Processors}, Publication no.40546, Revision 3.1, \bdate{May-2009}. (there is a newer version) Intel Corporation: {Intel 64 and IA-32 architectures optimization reference manual}, \bdate{November-2007}. (there is a newer version) ... or what the maker of your favorite CPU has to offer. I made the "20 fold" up, the greatest ratio I have seen was about 10 thousand (clueless big program versus tiny and clever).
* This is yet another reason for separate instruction & data caches, * which can cause all sorts of mischief when then get out of sync -- * e.g., you can hide malicious code in plain sight (well, in the * instruction cache), while the data cache shows the non-malicious * code. But this is a discussion for another day.
I always assumed the are mechanisms for cache-coherency, am I wrong? Even if there are none: "executable" and "writable" should be mutually exclusive for any memory page (leaving security aside, self-modifying code tends to cripple performance). Best regards, jj
[...]