| Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify |
| it under the terms of the GNU Lesser General Public License as published by |
| the Free Software Foundation; either version 3 of the License, or (at your |
| option) any later version. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
| License for more details. |
| |
| You should have received a copy of the GNU Lesser General Public License |
| along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. |
| |
| |
| |
| |
| |
| This directory contains mpn functions for 64-bit V9 SPARC |
| |
| RELEVANT OPTIMIZATION ISSUES |
| |
| Notation: |
| IANY = shift/add/sub/logical/sethi |
| IADDLOG = add/sub/logical/sethi |
| MEM = ld*/st* |
| FA = fadd*/fsub*/f*to*/fmov* |
| FM = fmul* |
| |
| UltraSPARC can issue four instructions per cycle, with these restrictions: |
| * Two IANY instructions, but only one of these may be a shift. If there is a |
| shift and an IANY instruction, the shift must precede the IANY instruction. |
| * One FA. |
| * One FM. |
| * One branch. |
| * One MEM. |
| * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches |
| should not be in slot 4, since that makes the delay insn come from separate |
| bundle. |
| * If two IANY/IADDLOG instructions are to be executed in the same cycle and one |
| of these is setting the condition codes, that instruction must be the second |
| one. |
| |
| To summarize, ignoring branches, these are the bundles that can reach the peak |
| execution speed: |
| |
| insn1 iany iany mem iany iany mem iany iany mem |
| insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany |
| insn3 mem iaddlog iaddlog fa fa fa fm fm fm |
| insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa |
| |
| The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles, |
| depending on the position of the most significant bit of the first source |
| operand. When used for 32x32->64 multiplication, it needs 20 cycles. |
| Furthermore, it stalls the processor while executing. We stay away from that |
| instruction, and instead use floating-point operations. |
| |
| Floating-point add and multiply units are fully pipelined. The latency for |
| UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles. |
| |
| Integer conditional move instructions cannot dual-issue with other integer |
| instructions. No conditional move can issue 1-5 cycles after a load. (This |
| might have been fixed for UltraSPARC-3.) |
| |
| The UltraSPARC-3 pipeline is very simular to he one of UltraSPARC-1/2 , but is |
| somewhat slower. Branches execute slower, and there may be other new stalls. |
| But integer multiply doesn't stall the entire CPU and also has a much lower |
| latency. But it's still not pipelined, and thus useless for our needs. |
| |
| STATUS |
| |
| * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on |
| UltraSPARC-1/2 and 2.65 on UltraSPARC-3. For UltraSPARC-1/2, the IEU0 |
| functional unit is saturated with shifts. |
| |
| * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on |
| UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3. The 4 instruction |
| recurrency is the speed limiter. |
| |
| * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on |
| UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3. On UltraSPARC-1/2, the |
| code sustains 4 instructions/cycle. It might be possible to invent a better |
| way of summing the intermediate 49-bit operands, but it is unlikely that it |
| will save enough instructions to save an entire cycle. |
| |
| The load-use of the u operand is not enough scheduled for good L2 cache |
| performance. The UltraSPARC-1/2 L1 cache is direct mapped, and since we use |
| temporary stack slots that will conflict with the u and r operands, we miss |
| to L2 very often. The load-use of the std/ldx pairs via the stack are |
| perhaps over-scheduled. |
| |
| It would be possible to save two instructions: (1) The mov could be avoided |
| if the std/ldx were less scheduled. (2) The ldx of the r operand could be |
| split into two ld instructions, saving the shifts/masks. |
| |
| It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp |
| operations where rescheduled for this processor's 4-cycle latency. |
| |
| * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1 |
| code. It would be possible to shave one or two cycles from it, with some |
| labour. |
| |
| * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n. This |
| means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on |
| UltraSPARC-3. It would be possible to either match the mpn_addmul_1 |
| performance, or in the worst case use one more instruction group. |
| |
| * US1/US2 cache conflict resolving. The direct mapped L1 date cache of US1/US2 |
| is a problem for mul_1, addmul_1 (and a prospective submul_1). We should |
| allocate a larger cache area, and put the stack temp area in a place that |
| doesn't cause cache conflicts. |