gcc/gmp/mpn/ia64/README - native_client/nacl-toolchain - Git at Google

 Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.

 This file is part of the GNU MP Library.

 The GNU MP Library is free software; you can redistribute it and/or modify
 it under the terms of the GNU Lesser General Public License as published by
 the Free Software Foundation; either version 3 of the License, or (at your
 option) any later version.

 The GNU MP Library is distributed in the hope that it will be useful, but
 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
 License for more details.

 You should have received a copy of the GNU Lesser General Public License
 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.


                       IA-64 MPN SUBROUTINES


 This directory contains mpn functions for the IA-64 architecture.


 CODE ORGANIZATION

 	mpn/ia64          itanium-2, and generic ia64

 The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
 we concentrate on.


 CHIP NOTES

 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
 RAW dependencies, with some notable exceptions.  Such "breaks" are typically
 at the end of a bundle, but can be put between operations within some bundle
 types too.

 The Itanium 1 and Itanium 2 implementations can under ideal conditions
 execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
 to do integer operations, while the Itanium 2 allows all 6 to be integer
 operations.

 Taken cloop branches seem to insert a bubble into the pipeline most of the
 time on Itanium 1.

 Loads to the fp registers bypass the L1 cache and thus get extremely long
 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.

 The software pipeline stuff using br.ctop instruction causes delays, since
 many issue slots are taken up by instructions with zero predicates, and
 since many extra instructions are needed to set things up.  These features
 are clearly designed for code density, not speed.

 Misc pipeline limitations (Itanium 1):
 * The getf.sig instruction can only execute in M0.
 * At most four integer instructions/cycle.
 * Nops take up resources like any plain instructions.

 Misc pipeline limitations (Itanium 2):
 * The getf.sig instruction can only execute in M0.
 * Nops take up resources like any plain instructions.


 ASSEMBLY SYNTAX

 .align pads with nops in a text segment, but gas 2.14 and earlier
 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
 it come out as break instructions.  We use the ALIGN() macro in
 mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
 suppresses any .align if the problem is detected by configure.  Lack of
 alignment might hurt performance but will at least be correct.

 foo:: to create a global symbol is not accepted by gas.  Use separate
 ".global foo" and "foo:" instead.

 .global is the standard global directive.  gas accepts .globl, but hpux "as"
 doesn't.

 .proc / .endp generates the appropriate .type and .size information for ELF,
 so the latter directives don't need to be given explicitly.

 .pred.rel "mutex"... is standard for annotating predicate register
 relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.

 .pred directives can't be put on a line with a label, like
 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
 gas is happy with it, and past versions of HP had seemed ok.

 // is the standard comment sequence, but we prefer "C" since it inhibits m4
 macro expansion.  See comments in ia64-defs.m4.


 REGISTER USAGE

 Special:
    r0: constant 0
    r1: global pointer (gp)
    r8: return value
    r12: stack pointer (sp)
    r13: thread pointer (tp)
 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
 Caller-saves but rotating: r32-


 ================================================================
 mpn_add_n, mpn_sub_n:

 The current code runs at 1.25 c/l on Itanium 2.

 ================================================================
 mpn_mul_1:

 The current code runs at 2 c/l on Itanium 2.

 Using a blocked approach, working off of 4 separate places in the operands,
 one could make use of the xma accumulation, and approach 1 c/l.

 	ldf8 [up]
 	xma.l
 	xma.hu
 	stf8  [wrp]

 ================================================================
 mpn_addmul_1:

 The current code runs at 2 c/l on Itanium 2.

 It seems possible to use a blocked approach, as with mpn_mul_1.  We should
 read rp[] to integer registers, allowing for just one getf.sig per cycle.

 	ld8  [rp]
 	ldf8 [up]
 	xma.l
 	xma.hu
 	getf.sig
 	add+add+cmp+cmp
 	st8  [wrp]

 These 10 instructions can be scheduled to approach 1.667 cycles, and with
 the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
 ldfp8 we could approach 1.583 c/l.

 ================================================================
 mpn_submul_1:

 The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
 ldfp8 with all alignment headache that implies.

 ================================================================
 mpn_addmul_N

 For best speed, we need to give up using mpn_addmul_1 as the main multiply
 building block, and instead take multiple v limbs per loop.  For the Itanium
 1, we need to take about 8 limbs at a time for full speed.  For the Itanium
 2, something like mpn_addmul_4 should be enough.

 The add+cmp+cmp+add we use on the other codes is optimal for shortening
 recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
 better.

 /* First load the 8 values from v */
 	ldfp8		v0, v1 = [r35], 16;;
 	ldfp8		v2, v3 = [r35], 16;;
 	ldfp8		v4, v5 = [r35], 16;;
 	ldfp8		v6, v7 = [r35], 16;;

 /* In the inner loop, get a new U limb and store a result limb. */
 	mov		lc = un
 Loop:	ldf8		u0 = [r33], 8
 	ld8		r0 = [r32]
 	xma.l		lp0 = v0, u0, hp0
 	xma.hu		hp0 = v0, u0, hp0
 	xma.l		lp1 = v1, u0, hp1
 	xma.hu		hp1 = v1, u0, hp1
 	xma.l		lp2 = v2, u0, hp2
 	xma.hu		hp2 = v2, u0, hp2
 	xma.l		lp3 = v3, u0, hp3
 	xma.hu		hp3 = v3, u0, hp3
 	xma.l		lp4 = v4, u0, hp4
 	xma.hu		hp4 = v4, u0, hp4
 	xma.l		lp5 = v5, u0, hp5
 	xma.hu		hp5 = v5, u0, hp5
 	xma.l		lp6 = v6, u0, hp6
 	xma.hu		hp6 = v6, u0, hp6
 	xma.l		lp7 = v7, u0, hp7
 	xma.hu		hp7 = v7, u0, hp7
 	getf.sig	l0 = lp0
 	getf.sig	l1 = lp1
 	getf.sig	l2 = lp2
 	getf.sig	l3 = lp3
 	getf.sig	l4 = lp4
 	getf.sig	l5 = lp5
 	getf.sig	l6 = lp6
 	add+cmp+add	xx, l0, r0
 	add+cmp+add	acc0, acc1, l1
 	add+cmp+add	acc1, acc2, l2
 	add+cmp+add	acc2, acc3, l3
 	add+cmp+add	acc3, acc4, l4
 	add+cmp+add	acc4, acc5, l5
 	add+cmp+add	acc5, acc6, l6
 	getf.sig	acc6 = lp7
 	st8		[r32] = xx, 8
 	br.cloop Loop

 	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
 	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
 	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
 	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
 	11+21 memops+intops at max 4/cycle	8 cycles/limb8

 ================================================================
 mpn_lshift, mpn_rshift

 The current code runs at 1 cycle/limb on Itanium 2.

 Using 63 separate loops, we could use the double-word shrp instruction.
 That instruction has a plain single-cycle latency.  We need 63 loops since
 this instruction only accept immediate count.  That would lead to a somewhat
 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
 each cycle plus shl/shr going down I1 for a further limb every second
 cycle).

 ================================================================
 mpn_copyi, mpn_copyd

 The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
 cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
 scheduling isn't great.  It might be best to actually use modulo scheduled
 loops, since that will allow us to do better load-use scheduling without too
 much unrolling.

 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
 2, according to tests/devel/try.  Cache bank conflicts?


 REFERENCES

 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
 Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
 includes an Itanium optimization guide.

 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
 document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
 etc.

 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
 248801-004, 2000-2002.  Describes assembly instruction syntax and other
 directives.

 Itanium Software Conventions and Runtime Architecture Guide, Intel document
 245358-003, May 2001.  Describes calling conventions, including stack
 unwinding requirements.

 Intel Itanium Processor Reference Manual for Software Optimization, Intel
 document 245473-003, November 2001.

 Intel Itanium-2 Processor Reference Manual for Software Development and
 Optimization, Intel document 251110-003, May 2004.

 All the above documents can be found online at

     http://developer.intel.com/design/itanium/manuals.htm


 ----------------
 Local variables:
 mode: text
 fill-column: 76
 End:
	Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.

	This file is part of the GNU MP Library.

	The GNU MP Library is free software; you can redistribute it and/or modify
	it under the terms of the GNU Lesser General Public License as published by
	the Free Software Foundation; either version 3 of the License, or (at your
	option) any later version.

	The GNU MP Library is distributed in the hope that it will be useful, but
	WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
	or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
	License for more details.

	You should have received a copy of the GNU Lesser General Public License
	along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.



	IA-64 MPN SUBROUTINES


	This directory contains mpn functions for the IA-64 architecture.


	CODE ORGANIZATION

	mpn/ia64 itanium-2, and generic ia64

	The code here has been optimized primarily for Itanium 2. Very few Itanium 1
	chips were ever sold, and Itanium 2 is more powerful, so the latter is what
	we concentrate on.



	CHIP NOTES

	The IA-64 ISA keeps instructions three and three in 128 bit bundles.
	Programmers/compilers need to put explicit breaks `;;' when there are WAW or
	RAW dependencies, with some notable exceptions. Such "breaks" are typically
	at the end of a bundle, but can be put between operations within some bundle
	types too.

	The Itanium 1 and Itanium 2 implementations can under ideal conditions
	execute two bundles per cycle. The Itanium 1 allows 4 of these instructions
	to do integer operations, while the Itanium 2 allows all 6 to be integer
	operations.

	Taken cloop branches seem to insert a bubble into the pipeline most of the
	time on Itanium 1.

	Loads to the fp registers bypass the L1 cache and thus get extremely long
	latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.

	The software pipeline stuff using br.ctop instruction causes delays, since
	many issue slots are taken up by instructions with zero predicates, and
	since many extra instructions are needed to set things up. These features
	are clearly designed for code density, not speed.

	Misc pipeline limitations (Itanium 1):
	* The getf.sig instruction can only execute in M0.
	* At most four integer instructions/cycle.
	* Nops take up resources like any plain instructions.

	Misc pipeline limitations (Itanium 2):
	* The getf.sig instruction can only execute in M0.
	* Nops take up resources like any plain instructions.


	ASSEMBLY SYNTAX

	.align pads with nops in a text segment, but gas 2.14 and earlier
	incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
	it come out as break instructions. We use the ALIGN() macro in
	mpn/ia64/ia64-defs.m4 when it might be executed across. That macro
	suppresses any .align if the problem is detected by configure. Lack of
	alignment might hurt performance but will at least be correct.

	foo:: to create a global symbol is not accepted by gas. Use separate
	".global foo" and "foo:" instead.

	.global is the standard global directive. gas accepts .globl, but hpux "as"
	doesn't.

	.proc / .endp generates the appropriate .type and .size information for ELF,
	so the latter directives don't need to be given explicitly.

	.pred.rel "mutex"... is standard for annotating predicate register
	relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't.

	.pred directives can't be put on a line with a label, like
	".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
	gas is happy with it, and past versions of HP had seemed ok.

	// is the standard comment sequence, but we prefer "C" since it inhibits m4
	macro expansion. See comments in ia64-defs.m4.


	REGISTER USAGE

	Special:
	r0: constant 0
	r1: global pointer (gp)
	r8: return value
	r12: stack pointer (sp)
	r13: thread pointer (tp)
	Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
	Caller-saves but rotating: r32-


	================================================================
	mpn_add_n, mpn_sub_n:

	The current code runs at 1.25 c/l on Itanium 2.

	================================================================
	mpn_mul_1:

	The current code runs at 2 c/l on Itanium 2.

	Using a blocked approach, working off of 4 separate places in the operands,
	one could make use of the xma accumulation, and approach 1 c/l.

	ldf8 [up]
	xma.l
	xma.hu
	stf8 [wrp]

	================================================================
	mpn_addmul_1:

	The current code runs at 2 c/l on Itanium 2.

	It seems possible to use a blocked approach, as with mpn_mul_1. We should
	read rp[] to integer registers, allowing for just one getf.sig per cycle.

	ld8 [rp]
	ldf8 [up]
	xma.l
	xma.hu
	getf.sig
	add+add+cmp+cmp
	st8 [wrp]

	These 10 instructions can be scheduled to approach 1.667 cycles, and with
	the 4 cycle latency of xma, this means we need at least 3 blocks. Using
	ldfp8 we could approach 1.583 c/l.

	================================================================
	mpn_submul_1:

	The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires
	ldfp8 with all alignment headache that implies.

	================================================================
	mpn_addmul_N

	For best speed, we need to give up using mpn_addmul_1 as the main multiply
	building block, and instead take multiple v limbs per loop. For the Itanium
	1, we need to take about 8 limbs at a time for full speed. For the Itanium
	2, something like mpn_addmul_4 should be enough.

	The add+cmp+cmp+add we use on the other codes is optimal for shortening
	recurrencies (1 cycle) but the sequence takes up 4 execution slots. When
	recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
	better.

	/* First load the 8 values from v */
	ldfp8 v0, v1 = [r35], 16;;
	ldfp8 v2, v3 = [r35], 16;;
	ldfp8 v4, v5 = [r35], 16;;
	ldfp8 v6, v7 = [r35], 16;;

	/* In the inner loop, get a new U limb and store a result limb. */
	mov lc = un
	Loop: ldf8 u0 = [r33], 8
	ld8 r0 = [r32]
	xma.l lp0 = v0, u0, hp0
	xma.hu hp0 = v0, u0, hp0
	xma.l lp1 = v1, u0, hp1
	xma.hu hp1 = v1, u0, hp1
	xma.l lp2 = v2, u0, hp2
	xma.hu hp2 = v2, u0, hp2
	xma.l lp3 = v3, u0, hp3
	xma.hu hp3 = v3, u0, hp3
	xma.l lp4 = v4, u0, hp4
	xma.hu hp4 = v4, u0, hp4
	xma.l lp5 = v5, u0, hp5
	xma.hu hp5 = v5, u0, hp5
	xma.l lp6 = v6, u0, hp6
	xma.hu hp6 = v6, u0, hp6
	xma.l lp7 = v7, u0, hp7
	xma.hu hp7 = v7, u0, hp7
	getf.sig l0 = lp0
	getf.sig l1 = lp1
	getf.sig l2 = lp2
	getf.sig l3 = lp3
	getf.sig l4 = lp4
	getf.sig l5 = lp5
	getf.sig l6 = lp6
	add+cmp+add xx, l0, r0
	add+cmp+add acc0, acc1, l1
	add+cmp+add acc1, acc2, l2
	add+cmp+add acc2, acc3, l3
	add+cmp+add acc3, acc4, l4
	add+cmp+add acc4, acc5, l5
	add+cmp+add acc5, acc6, l6
	getf.sig acc6 = lp7
	st8 [r32] = xx, 8
	br.cloop Loop

	49 insn at max 6 insn/cycle: 8.167 cycles/limb8
	11 memops at max 2 memops/cycle: 5.5 cycles/limb8
	16 fpops at max 2 fpops/cycle: 8 cycles/limb8
	21 intops at max 4 intops/cycle: 5.25 cycles/limb8
	11+21 memops+intops at max 4/cycle 8 cycles/limb8

	================================================================
	mpn_lshift, mpn_rshift

	The current code runs at 1 cycle/limb on Itanium 2.

	Using 63 separate loops, we could use the double-word shrp instruction.
	That instruction has a plain single-cycle latency. We need 63 loops since
	this instruction only accept immediate count. That would lead to a somewhat
	silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
	each cycle plus shl/shr going down I1 for a further limb every second
	cycle).

	================================================================
	mpn_copyi, mpn_copyd

	The current code runs at 0.5 c/l on Itanium 2. But that is just for L1
	cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use
	scheduling isn't great. It might be best to actually use modulo scheduled
	loops, since that will allow us to do better load-use scheduling without too
	much unrolling.

	Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
	2, according to tests/devel/try. Cache bank conflicts?



	REFERENCES

	Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
	Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1
	includes an Itanium optimization guide.

	Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
	document 245370-003, May 2001. Describes C type sizes, dynamic linking,
	etc.

	Intel Itanium Architecture Assembly Language Reference Guide, Intel document
	248801-004, 2000-2002. Describes assembly instruction syntax and other
	directives.

	Itanium Software Conventions and Runtime Architecture Guide, Intel document
	245358-003, May 2001. Describes calling conventions, including stack
	unwinding requirements.

	Intel Itanium Processor Reference Manual for Software Optimization, Intel
	document 245473-003, November 2001.

	Intel Itanium-2 Processor Reference Manual for Software Development and
	Optimization, Intel document 251110-003, May 2004.

	All the above documents can be found online at

	http://developer.intel.com/design/itanium/manuals.htm


	----------------
	Local variables:
	mode: text
	fill-column: 76
	End: