* Warren D Smith <warren.wds@gmail.com> [Apr 20. 2014 07:41]:
[...]
As hardware technology advanced, we had prefetching, simultaneous instruction execution, pipelining, and so on. The nail in the coffin for me (or should it be the stake through the heart?) was when I was told by someone I trusted that modern compilers can produce faster executing code than all but the most expert hand coding of assembly. At that point, I changed over completely, and now I just couldn't care less about instruction sets.
That was me? Background: I wrote a simple 3dim graphics engine some years ago in C++. The machine code (ADM64 bit system) was just 44 _kilo_ bytes (texture mapping, transparency, colored light, ..., all that included). Looking at the assembler code was pretty much a revelation: few humans could possibly do better, and, more importantly, would be willing to do so.
[...]
Warren, what is this better architecture that Intel ran up a flagpole? What should we search for to look it up?
--itanium.
Have you ever worked with an itanium system? No? I thought so.
And by the way, looking into this, it seems ARM and itanium have made some business progress,
itanium is dead. AMD(64) has forced intel to make much better CPUs for us unwashed masses. At work, I have a system with a Intel(R) Xeon(R) CPU E3-1275 V2 @ 3.50GHz The performance is beyond awesome. Just one detail: the CPU has what intel calls a "loop stream detector (abbrev.: LSD)". When a tight loop is executed this is detected and instructions are executed without the need to go through the (pre-)decoder. Thus performance of some of my combinatorial generators is simply doubled, sometimes needing only a few CPU cycles per generated object. Cherry picked examples: ------------------------------------------------------------ // output of demo/comb/composition-nz-subset-lex-demo.cc: // Description: //% Compositions of n into positive parts, subset-lex order. ----- args=32 0 COMPOSITION_NZ_SUBSET_LEX_FIXARRAYS defined. forward: ct=2147483648 ./bin 32 0 2.08s user 0.00s system 99% cpu 2.079 total ==> 2147483648/2.08 == 1032.444061 [M per second]; 1/rate == 3.39 [cycles] ------------------------------------------------------------ // output of demo/comb/mixedradix-subset-lexrev-demo.cc: // Description: //% Mixed radix numbers in reversed subset-lexicographic order. ----- args=8 16 1 MIXEDRADIX_SUBSET_LEXREV_FIXARRAYS is defined. backward: ct=4294967296 ./bin 8 16 1 2.47s user 0.00s system 99% cpu 2.476 total ==> 4294967296/2.47 == 1738.853156 [M per second]; 1/rate == 2.01 [cycles] ------------------------------------------------------------ End cherry pick. Most simple such generators need less than 10 cycles per object ( >= 350 M objects per second, 500 M/sec quite usual), 20 cycles is almost "lame" by this CPU's standards. Currently AMD is falling behind at the performance end and intel is much less forced to both innovate and keep the prices reasonable. This is very bad.
[...]
Best, jj