| Copyright 2000, 2001 Free Software Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify |
| it under the terms of the GNU Lesser General Public License as published by |
| the Free Software Foundation; either version 3 of the License, or (at your |
| option) any later version. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
| License for more details. |
| |
| You should have received a copy of the GNU Lesser General Public License |
| along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. |
| |
| |
| |
| |
| AMD K7 MPN SUBROUTINES |
| |
| |
| This directory contains code optimized for the AMD Athlon CPU. |
| |
| The mmx subdirectory has routines using MMX instructions. All Athlons have |
| MMX, the separate directory is just so that configure can omit it if the |
| assembler doesn't support MMX. |
| |
| |
| |
| STATUS |
| |
| Times for the loops, with all code and data in L1 cache. |
| |
| cycles/limb |
| mpn_add/sub_n 1.6 |
| |
| mpn_copyi 0.75 or 1.0 \ varying with data alignment |
| mpn_copyd 0.75 or 1.0 / |
| |
| mpn_divrem_1 17.0 integer part, 15.0 fractional part |
| mpn_mod_1 17.0 |
| mpn_divexact_by3 8.0 |
| |
| mpn_l/rshift 1.2 |
| |
| mpn_mul_1 3.4 |
| mpn_addmul/submul_1 3.9 |
| |
| mpn_mul_basecase 4.42 cycles/crossproduct (approx) |
| mpn_sqr_basecase 2.3 cycles/crossproduct (approx) |
| or 4.55 cycles/triangleproduct (approx) |
| |
| Prefetching of sources hasn't yet been tried. |
| |
| |
| |
| NOTES |
| |
| cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available. |
| |
| Write-allocate L1 data cache means prefetching of destinations is unnecessary. |
| |
| Floating point multiplications can be done in parallel with integer |
| multiplications, but there doesn't seem to be any way to make use of this. |
| |
| Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on |
| the speed of the multiplication routines. The documentation shows mul |
| executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that, |
| to get near 3 cycles code has to be arranged so that nothing else is issued |
| to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other |
| apparently equivalent code takes 5. |
| |
| |
| |
| OPTIMIZATIONS |
| |
| Unrolled loops are used to reduce looping overhead. The unrolling is |
| configurable up to 32 limbs/loop for most routines and up to 64 for some. |
| The K7 has 64k L1 code cache so quite big unrolling is allowable. |
| |
| Computed jumps into the unrolling are used to handle sizes not a multiple of |
| the unrolling. An attractive feature of this is that times increase |
| smoothly with operand size, but it may be that some routines should just |
| have simple loops to finish up, especially when PIC adds between 2 and 16 |
| cycles to get %eip. |
| |
| Position independent code is implemented using a call to get %eip for the |
| computed jumps and a ret is always done, rather than an addl $4,%esp or a |
| popl, so the CPU return address branch prediction stack stays synchronised |
| with the actual stack in memory. |
| |
| Branch prediction, in absence of any history, will guess forward jumps are |
| not taken and backward jumps are taken. Where possible it's arranged that |
| the less likely or less important case is under a taken forward jump. |
| |
| |
| |
| CODING |
| |
| Instructions in general code have been shown grouped if they can execute |
| together, which means up to three direct-path instructions which have no |
| successive dependencies. K7 always decodes three and has out-of-order |
| execution, but the groupings show what slots might be available and what |
| dependency chains exist. |
| |
| When there's vector-path instructions an effort is made to get triplets of |
| direct-path instructions in between them, even if there's dependencies, |
| since this maximizes decoding throughput and might save a cycle or two if |
| decoding is the limiting factor. |
| |
| |
| |
| INSTRUCTIONS |
| |
| adcl direct |
| divl 39 cycles back-to-back |
| lodsl,etc vector |
| loop 1 cycle vector (decl/jnz opens up one decode slot) |
| movd reg vector |
| movd mem direct |
| mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word |
| popl vector (use movl for more than one pop) |
| pushl direct, will pair with a load |
| shrdl %cl vector, 3 cycles, seems to be 3 decode too |
| xorl r,r false read dependency recognised |
| |
| |
| |
| REFERENCES |
| |
| "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number |
| 22007, revision K, February 2002. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf |
| |
| "3DNow Technology Manual", AMD publication number 21928G/0-March 2000. |
| This describes the femms and prefetch instructions. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf |
| |
| "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD |
| publication number 22466, revision D, March 2000. This describes |
| instructions added in the Athlon processor, such as pswapd and the extra |
| prefetch forms. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf |
| |
| "3DNow Instruction Porting Guide", AMD publication number 22621, revision B, |
| August 1999. This has some notes on general Athlon optimizations as well as |
| 3DNow. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf |
| |
| |
| |
| |
| ---------------- |
| Local variables: |
| mode: text |
| fill-column: 76 |
| End: |