| Copyright 1996, 1997, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software |
| Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify it |
| under the terms of the GNU Lesser General Public License as published by the |
| Free Software Foundation; either version 3 of the License, or (at your |
| option) any later version. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or |
| FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License |
| for more details. |
| |
| You should have received a copy of the GNU Lesser General Public License along |
| with the GNU MP Library. If not, see http://www.gnu.org/licenses/. |
| |
| |
| |
| |
| |
| This directory contains mpn functions optimized for DEC Alpha processors. |
| |
| ALPHA ASSEMBLY RULES AND REGULATIONS |
| |
| The `.prologue N' pseudo op marks the end of instruction that needs special |
| handling by unwinding. It also says whether $27 is really needed for computing |
| the gp. The `.mask M' pseudo op says which registers are saved on the stack, |
| and at what offset in the frame. |
| |
| Cray T3 code is very very different... |
| |
| "$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6" |
| / "f6" is required. We use the "r6" / "f6" forms, and have m4 defines expand |
| them to "$6" or "$f6" where necessary. |
| |
| "0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is |
| required. The X() macro accomodates this difference. |
| |
| "cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will |
| accept either. We use cvttqc and have an m4 define expand to cvttq/c where |
| necessary. |
| |
| "not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not |
| the Unicos assembler. The full "ornot" must be used. |
| |
| "unop" is not available in Unicos. We make an m4 define to the usual "ldq_u |
| r31,0(r30)", and in fact use that define on all systems since it comes out the |
| same. |
| |
| "!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not |
| available in older alpha assemblers (including gas prior to 2.12), according to |
| the GCC manual, so the assembler macro forms must be used (eg. ldgp). |
| |
| |
| |
| RELEVANT OPTIMIZATION ISSUES |
| |
| EV4 |
| |
| 1. This chip has very limited store bandwidth. The on-chip L1 cache is write- |
| through, and a cache line is transfered from the store buffer to the off- |
| chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, |
| mpn_sub_n, mpn_lshift, and mpn_rshift. |
| |
| 2. Pairing is possible between memory instructions and integer arithmetic |
| instructions. |
| |
| 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these |
| cycles are pipelined. Thus, multiply instructions can be issued at a rate |
| of one each 21st cycle. |
| |
| EV5 |
| |
| 1. The memory bandwidth of this chip is good, both for loads and stores. The |
| L1 cache can handle two loads or one store per cycle, but two cycles after a |
| store, no ld can issue. |
| |
| 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. |
| umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. |
| (Note that published documentation gets these numbers slightly wrong.) |
| |
| 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
| are memory operations. This will take at least |
| ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles |
| We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
| cache cycles, which should be completely hidden in the 19 issue cycles. |
| The computation is inherently serial, with these dependencies: |
| |
| ldq ldq |
| \ /\ |
| (or) addq | |
| |\ / \ | |
| | addq cmpult |
| \ | | |
| cmpult | |
| \ / |
| or |
| |
| I.e., 3 operations are needed between carry-in and carry-out, making 12 |
| cycles the absolute minimum for the 4 limbs. We could replace the `or' with |
| a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that |
| might waste a cycle on EV4. The total depth remain unaffected, since cmov |
| has a latency of 2 cycles. |
| |
| addq |
| / \ |
| addq cmpult |
| | \ |
| cmpult -> cmovne |
| |
| Montgomery has a slightly different way of computing carry that requires one |
| less instruction, but has depth 4 (instead of the current 3). Since the code |
| is currently instruction issue bound, Montgomery's idea should save us 1/2 |
| cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. |
| Unfortunately, this method will not be good for the EV6. |
| |
| 4. addmul_1 and friends: We previously had a scheme for splitting the single- |
| limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, |
| and then use FP operations for every 2nd multiply, and integer operations |
| for every 2nd multiply. |
| |
| But it seems much better to split the single-limb operand in 16-bit chunks, |
| since we save many integer shifts and adds that way. See powerpc64/README |
| for some more details. |
| |
| EV6 |
| |
| Here we have a really parallel pipeline, capable of issuing up to 4 integer |
| instructions per cycle. In actual practice, it is never possible to sustain |
| more than 3.5 integer insns/cycle due to rename register shortage. One integer |
| multiply instruction can issue each cycle. To get optimal speed, we need to |
| pretend we are vectorizing the code, i.e., minimize the depth of recurrences. |
| |
| There are two dependencies to watch out for. 1) Address arithmetic |
| dependencies, and 2) carry propagation dependencies. |
| |
| We can avoid serializing due to address arithmetic by unrolling loops, so that |
| addresses don't depend heavily on an index variable. Avoiding serializing |
| because of carry propagation is trickier; the ultimate performance of the code |
| will be determined of the number of latency cycles it takes from accepting |
| carry-in to a vector point until we can generate carry-out. |
| |
| Most integer instructions can execute in either the L0, U0, L1, or U1 |
| pipelines. Shifts only execute in U0 and U1, and multiply only in U1. |
| |
| CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV |
| split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV |
| should always be placed as the last instruction of an aligned 4 instruction |
| block, or perhaps simply avoided. |
| |
| Perhaps the most important issue is the latency between the L0/U0 and L1/U1 |
| clusters; a result obtained on either cluster has an extra cycle of latency for |
| consumers in the opposite cluster. Because of the dynamic nature of the |
| implementation, it is hard to predict where an instruction will execute. |
| |
| |
| |
| REFERENCES |
| |
| "Alpha Architecture Handbook", version 4, Compaq, October 1998, order number |
| EC-QD2KC-TE. |
| |
| "Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998, |
| order number EC-QP99C-TE. |
| |
| "Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4, |
| Compaq, September 2000, order number DS-0028B-TE. |
| |
| "Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number |
| EC-RJ66A-TE. |
| |
| All of the above are available online from |
| |
| http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html |
| ftp://ftp.compaq.com/pub/products/alphaCPUdocs |
| |
| "Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part |
| number AA-PS31D-TE. |
| |
| "Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp, |
| March 1996, part number AA-PY8AC-TE. |
| |
| The above are available online, |
| |
| http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM |
| |
| (Dunno what h30097 means in this URL, but if it moves try searching for "tru64 |
| online documentation" from the main www.hp.com page.) |
| |
| |
| |
| ---------------- |
| Local variables: |
| mode: text |
| fill-column: 79 |
| End: |