chromium / external / github.com / python / cpython / 6e62eb2e70a9f2a8099735989a58e8c8cfb4a2f2 / . / Objects / listsort.txt

Intro | |

----- | |

This describes an adaptive, stable, natural mergesort, modestly called | |

timsort (hey, I earned it <wink>). It has supernatural performance on many | |

kinds of partially ordered arrays (less than lg(N!) comparisons needed, and | |

as few as N-1), yet as fast as Python's previous highly tuned samplesort | |

hybrid on random arrays. | |

In a nutshell, the main routine marches over the array once, left to right, | |

alternately identifying the next run, then merging it into the previous | |

runs "intelligently". Everything else is complication for speed, and some | |

hard-won measure of memory efficiency. | |

Comparison with Python's Samplesort Hybrid | |

------------------------------------------ | |

+ timsort can require a temp array containing as many as N//2 pointers, | |

which means as many as 2*N extra bytes on 32-bit boxes. It can be | |

expected to require a temp array this large when sorting random data; on | |

data with significant structure, it may get away without using any extra | |

heap memory. This appears to be the strongest argument against it, but | |

compared to the size of an object, 2 temp bytes worst-case (also expected- | |

case for random data) doesn't scare me much. | |

It turns out that Perl is moving to a stable mergesort, and the code for | |

that appears always to require a temp array with room for at least N | |

pointers. (Note that I wouldn't want to do that even if space weren't an | |

issue; I believe its efforts at memory frugality also save timsort | |

significant pointer-copying costs, and allow it to have a smaller working | |

set.) | |

+ Across about four hours of generating random arrays, and sorting them | |

under both methods, samplesort required about 1.5% more comparisons | |

(the program is at the end of this file). | |

+ In real life, this may be faster or slower on random arrays than | |

samplesort was, depending on platform quirks. Since it does fewer | |

comparisons on average, it can be expected to do better the more | |

expensive a comparison function is. OTOH, it does more data movement | |

(pointer copying) than samplesort, and that may negate its small | |

comparison advantage (depending on platform quirks) unless comparison | |

is very expensive. | |

+ On arrays with many kinds of pre-existing order, this blows samplesort out | |

of the water. It's significantly faster than samplesort even on some | |

cases samplesort was special-casing the snot out of. I believe that lists | |

very often do have exploitable partial order in real life, and this is the | |

strongest argument in favor of timsort (indeed, samplesort's special cases | |

for extreme partial order are appreciated by real users, and timsort goes | |

much deeper than those, in particular naturally covering every case where | |

someone has suggested "and it would be cool if list.sort() had a special | |

case for this too ... and for that ..."). | |

+ Here are exact comparison counts across all the tests in sortperf.py, | |

when run with arguments "15 20 1". | |

Column Key: | |

*sort: random data | |

\sort: descending data | |

/sort: ascending data | |

3sort: ascending, then 3 random exchanges | |

+sort: ascending, then 10 random at the end | |

%sort: ascending, then randomly replace 1% of elements w/ random values | |

~sort: many duplicates | |

=sort: all equal | |

!sort: worst case scenario | |

First the trivial cases, trivial for samplesort because it special-cased | |

them, and trivial for timsort because it naturally works on runs. Within | |

an "n" block, the first line gives the # of compares done by samplesort, | |

the second line by timsort, and the third line is the percentage by | |

which the samplesort count exceeds the timsort count: | |

n \sort /sort =sort | |

------- ------ ------ ------ | |

32768 32768 32767 32767 samplesort | |

32767 32767 32767 timsort | |

0.00% 0.00% 0.00% (samplesort - timsort) / timsort | |

65536 65536 65535 65535 | |

65535 65535 65535 | |

0.00% 0.00% 0.00% | |

131072 131072 131071 131071 | |

131071 131071 131071 | |

0.00% 0.00% 0.00% | |

262144 262144 262143 262143 | |

262143 262143 262143 | |

0.00% 0.00% 0.00% | |

524288 524288 524287 524287 | |

524287 524287 524287 | |

0.00% 0.00% 0.00% | |

1048576 1048576 1048575 1048575 | |

1048575 1048575 1048575 | |

0.00% 0.00% 0.00% | |

The algorithms are effectively identical in these cases, except that | |

timsort does one less compare in \sort. | |

Now for the more interesting cases. Where lg(x) is the logarithm of x to | |

the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for | |

the best any comparison-based sorting algorithm can do on average (across | |

all permutations). When a method gets significantly below that, it's | |

either astronomically lucky, or is finding exploitable structure in the | |

data. | |

n lg(n!) *sort 3sort +sort %sort ~sort !sort | |

------- ------- ------ ------- ------- ------ ------- -------- | |

32768 444255 453096 453614 32908 452871 130491 469141 old | |

448885 33016 33007 50426 182083 65534 new | |

0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new | |

65536 954037 972699 981940 65686 973104 260029 1004607 | |

962991 65821 65808 101667 364341 131070 | |

1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% | |

131072 2039137 2101881 2091491 131232 2092894 554790 2161379 | |

2057533 131410 131361 206193 728871 262142 | |

2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% | |

262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 | |

4377402 262437 262459 416347 1457945 524286 | |

1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% | |

524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 | |

9278734 524580 524633 837947 2916107 1048574 | |

1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% | |

1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 | |

19606028 1048958 1048941 1694896 5832445 2097150 | |

1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% | |

Discussion of cases: | |

*sort: There's no structure in random data to exploit, so the theoretical | |

limit is lg(n!). Both methods get close to that, and timsort is hugging | |

it (indeed, in a *marginal* sense, it's a spectacular improvement -- | |

there's only about 1% left before hitting the wall, and timsort knows | |

darned well it's doing compares that won't pay on random data -- but so | |

does the samplesort hybrid). For contrast, Hoare's original random-pivot | |

quicksort does about 39% more compares than the limit, and the median-of-3 | |

variant about 19% more. | |

3sort, %sort, and !sort: No contest; there's structure in this data, but | |

not of the specific kinds samplesort special-cases. Note that structure | |

in !sort wasn't put there on purpose -- it was crafted as a worst case for | |

a previous quicksort implementation. That timsort nails it came as a | |

surprise to me (although it's obvious in retrospect). | |

+sort: samplesort special-cases this data, and does a few less compares | |

than timsort. However, timsort runs this case significantly faster on all | |

boxes we have timings for, because timsort is in the business of merging | |

runs efficiently, while samplesort does much more data movement in this | |

(for it) special case. | |

~sort: samplesort's special cases for large masses of equal elements are | |

extremely effective on ~sort's specific data pattern, and timsort just | |

isn't going to get close to that, despite that it's clearly getting a | |

great deal of benefit out of the duplicates (the # of compares is much less | |

than lg(n!)). ~sort has a perfectly uniform distribution of just 4 | |

distinct values, and as the distribution gets more skewed, samplesort's | |

equal-element gimmicks become less effective, while timsort's adaptive | |

strategies find more to exploit; in a database supplied by Kevin Altis, a | |

sort on its highly skewed "on which stock exchange does this company's | |

stock trade?" field ran over twice as fast under timsort. | |

However, despite that timsort does many more comparisons on ~sort, and | |

that on several platforms ~sort runs highly significantly slower under | |

timsort, on other platforms ~sort runs highly significantly faster under | |

timsort. No other kind of data has shown this wild x-platform behavior, | |

and we don't have an explanation for it. The only thing I can think of | |

that could transform what "should be" highly significant slowdowns into | |

highly significant speedups on some boxes are catastrophic cache effects | |

in samplesort. | |

But timsort "should be" slower than samplesort on ~sort, so it's hard | |

to count that it isn't on some boxes as a strike against it <wink>. | |

+ Here's the highwater mark for the number of heap-based temp slots (4 | |

bytes each on this box) needed by each test, again with arguments | |

"15 20 1": | |

2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort | |

32768 16384 0 0 6256 0 10821 12288 0 16383 | |

65536 32766 0 0 21652 0 31276 24576 0 32767 | |

131072 65534 0 0 17258 0 58112 49152 0 65535 | |

262144 131072 0 0 35660 0 123561 98304 0 131071 | |

524288 262142 0 0 31302 0 212057 196608 0 262143 | |

1048576 524286 0 0 312438 0 484942 393216 0 524287 | |

Discussion: The tests that end up doing (close to) perfectly balanced | |

merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort | |

also ends up doing balanced merges, but systematically benefits a lot from | |

the preliminary pre-merge searches described under "Merge Memory" later. | |

%sort approaches having a balanced merge at the end because the random | |

selection of elements to replace is expected to produce an out-of-order | |

element near the midpoint. \sort, /sort, =sort are the trivial one-run | |

cases, needing no merging at all. +sort ends up having one very long run | |

and one very short, and so gets all the temp space it needs from the small | |

temparray member of the MergeState struct (note that the same would be | |

true if the new random elements were prefixed to the sorted list instead, | |

but not if they appeared "in the middle"). 3sort approaches N//3 temp | |

slots twice, but the run lengths that remain after 3 random exchanges | |

clearly has very high variance. | |

A detailed description of timsort follows. | |

Runs | |

---- | |

count_run() returns the # of elements in the next run. A run is either | |

"ascending", which means non-decreasing: | |

a0 <= a1 <= a2 <= ... | |

or "descending", which means strictly decreasing: | |

a0 > a1 > a2 > ... | |

Note that a run is always at least 2 long, unless we start at the array's | |

last element. | |

The definition of descending is strict, because the main routine reverses | |

a descending run in-place, transforming a descending run into an ascending | |

run. Reversal is done via the obvious fast "swap elements starting at each | |

end, and converge at the middle" method, and that can violate stability if | |

the slice contains any equal elements. Using a strict definition of | |

descending ensures that a descending run contains distinct elements. | |

If an array is random, it's very unlikely we'll see long runs. If a natural | |

run contains less than minrun elements (see next section), the main loop | |

artificially boosts it to minrun elements, via a stable binary insertion sort | |

applied to the right number of array elements following the short natural | |

run. In a random array, *all* runs are likely to be minrun long as a | |

result. This has two primary good effects: | |

1. Random data strongly tends then toward perfectly balanced (both runs have | |

the same length) merges, which is the most efficient way to proceed when | |

data is random. | |

2. Because runs are never very short, the rest of the code doesn't make | |

heroic efforts to shave a few cycles off per-merge overheads. For | |

example, reasonable use of function calls is made, rather than trying to | |

inline everything. Since there are no more than N/minrun runs to begin | |

with, a few "extra" function calls per merge is barely measurable. | |

Computing minrun | |

---------------- | |

If N < 64, minrun is N. IOW, binary insertion sort is used for the whole | |

array then; it's hard to beat that given the overheads of trying something | |

fancier (see note BINSORT). | |

When N is a power of 2, testing on random data showed that minrun values of | |

16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost | |

in binary insertion sort clearly hurt, and at 8 the increase in the number | |

of function calls clearly hurt. Picking *some* power of 2 is important | |

here, so that the merges end up perfectly balanced (see next section). We | |

pick 32 as a good value in the sweet range; picking a value at the low end | |

allows the adaptive gimmicks more opportunity to exploit shorter natural | |

runs. | |

Because sortperf.py only tries powers of 2, it took a long time to notice | |

that 32 isn't a good choice for the general case! Consider N=2112: | |

>>> divmod(2112, 32) | |

(66, 0) | |

>>> | |

If the data is randomly ordered, we're very likely to end up with 66 runs | |

each of length 32. The first 64 of these trigger a sequence of perfectly | |

balanced merges (see next section), leaving runs of lengths 2048 and 64 to | |

merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 | |

compares, but it's still more compares than necessary, and-- mergesort's | |

bugaboo relative to samplesort --a lot more data movement (O(N) copies just | |

to get 64 elements into place). | |

If we take minrun=33 in this case, then we're very likely to end up with 64 | |

runs each of length 33, and then all merges are perfectly balanced. Better! | |

What we want to avoid is picking minrun such that in | |

q, r = divmod(N, minrun) | |

q is a power of 2 and r>0 (then the last merge only gets r elements into | |

place, and r < minrun is small compared to N), or q a little larger than a | |

power of 2 regardless of r (then we've got a case similar to "2112", again | |

leaving too little work for the last merge to do). | |

Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a | |

power of 2, or if that isn't possible, is close to, but strictly less than, | |

a power of 2. This is easier to do than it may sound: take the first 6 | |

bits of N, and add 1 if any of the remaining bits are set. In fact, that | |

rule covers every case in this section, including small N and exact powers | |

of 2; merge_compute_minrun() is a deceptively simple function. | |

The Merge Pattern | |

----------------- | |

In order to exploit regularities in the data, we're merging on natural | |

run lengths, and they can become wildly unbalanced. That's a Good Thing | |

for this sort! It means we have to find a way to manage an assortment of | |

potentially very different run lengths, though. | |

Stability constrains permissible merging patterns. For example, if we have | |

3 consecutive runs of lengths | |

A:10000 B:20000 C:10000 | |

we dare not merge A with C first, because if A, B and C happen to contain | |

a common element, it would get out of order wrt its occurrence(s) in B. The | |

merging must be done as (A+B)+C or A+(B+C) instead. | |

So merging is always done on two consecutive runs at a time, and in-place, | |

although this may require some temp memory (more on that later). | |

When a run is identified, its length is passed to found_new_run() to | |

potentially merge runs on a stack of pending runs. We would like to delay | |

merging as long as possible in order to exploit patterns that may come up | |

later, but we like even more to do merging as soon as possible to exploit | |

that the run just found is still high in the memory hierarchy. We also can't | |

delay merging "too long" because it consumes memory to remember the runs that | |

are still unmerged, and the stack has a fixed size. | |

The original version of this code used the first thing I made up that didn't | |

obviously suck ;-) It was loosely based on invariants involving the Fibonacci | |

sequence. | |

It worked OK, but it was hard to reason about, and was subtle enough that the | |

intended invariants weren't actually preserved. Researchers discovered that | |

when trying to complete a computer-generated correctness proof. That was | |

easily-enough repaired, but the discovery spurred quite a bit of academic | |

interest in truly good ways to manage incremental merging on the fly. | |

At least a dozen different approaches were developed, some provably having | |

near-optimal worst case behavior with respect to the entropy of the | |

distribution of run lengths. Some details can be found in bpo-34561. | |

The code now uses the "powersort" merge strategy from: | |

"Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods | |

That Optimally Adapt to Existing Runs" | |

J. Ian Munro and Sebastian Wild | |

The code is pretty simple, but the justification is quite involved, as it's | |

based on fast approximations to optimal binary search trees, which are | |

substantial topics on their own. | |

Here we'll just cover some pragmatic details: | |

The `powerloop()` function computes a run's "power". Say two adjacent runs | |

begin at index s1. The first run has length n1, and the second run (starting | |

at index s1+n1, called "s2" below) has length n2. The list has total length n. | |

The "power" of the first run is a small integer, the depth of the node | |

connecting the two runs in an ideal binary merge tree, where power 1 is the | |

root node, and the power increases by 1 for each level deeper in the tree. | |

The power is the least integer L such that the "midpoint interval" contains | |

a rational number of the form J/2**L. The midpoint interval is the semi- | |

closed interval: | |

((s1 + n1/2)/n, (s2 + n2/2)/n] | |

Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and | |

(s2 + n2/2)/n are computed to infinite precision in binary, the power L is | |

the first position at which the 2**-L bit differs between the expansions. | |

Since the left end of the interval is less than the right end, the first | |

differing bit must be a 0 bit in the left quotient and a 1 bit in the right | |

quotient. | |

`powerloop()` emulates these divisions, 1 bit at a time, using comparisons, | |

subtractions, and shifts in a loop. | |

You'll notice the paper uses an O(1) method instead, but that relies on two | |

things we don't have: | |

- An O(1) "count leading zeroes" primitive. We can find such a thing as a C | |

extension on most platforms, but not all, and there's no uniform spelling | |

on the platforms that support it. | |

- Integer division on an integer type twice as wide as needed to hold the | |

list length. But the latter is Py_ssize_t for us, and is typically the | |

widest native signed integer type the platform supports. | |

But since runs in our algorithm are almost never very short, the once-per-run | |

overhead of `powerloop()` seems lost in the noise. | |

Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all, | |

shift integers of that width left by 1. How do we know that won't spill into | |

the sign bit? The trick is that we have some slop. `n` (the total list | |

length) is the number of list elements, which is at most 4 times (on a 32-box, | |

with 4-byte pointers) smaller than than the largest size_t. So at least the | |

leading two bits of the integers we're using are clear. | |

Since we can't compute a run's power before seeing the run that follows it, | |

the most-recently identified run is never merged by `found_new_run()`. | |

Instead a new run is only used to compute the 2nd-most-recent run's power. | |

Then adjacent runs are merged so long as their saved power (tree depth) is | |

greater than that newly computed power. When found_new_run() returns, only | |

then is a new run pushed on to the stack of pending runs. | |

A key invariant is that powers on the run stack are strictly decreasing | |

(starting from the run at the top of the stack). | |

Note that even powersort's strategy isn't always truly optimal. It can't be. | |

Computing an optimal merge sequence can be done in time quadratic in the | |

number of runs, which is very much slower, and also requires finding & | |

remembering _all_ the runs' lengths (of which there may be billions) in | |

advance. It's remarkable, though, how close to optimal this strategy gets. | |

Curious factoid: of all the alternatives I've seen in the literature, | |

powersort's is the only one that's always truly optimal for a collection of 3 | |

run lengths (for three lengths A B C, it's always optimal to first merge the | |

shorter of A and C with B). | |

Merge Memory | |

------------ | |

Merging adjacent runs of lengths A and B in-place, and in linear time, is | |

difficult. Theoretical constructions are known that can do it, but they're | |

too difficult and slow for practical use. But if we have temp memory equal | |

to min(A, B), it's easy. | |

If A is smaller (function merge_lo), copy A to a temp array, leave B alone, | |

and then we can do the obvious merge algorithm left to right, from the temp | |

area and B, starting the stores into where A used to live. There's always a | |

free area in the original area comprising a number of elements equal to the | |

number not yet merged from the temp array (trivially true at the start; | |

proceed by induction). The only tricky bit is that if a comparison raises an | |

exception, we have to remember to copy the remaining elements back in from | |

the temp area, lest the array end up with duplicate entries from B. But | |

that's exactly the same thing we need to do if we reach the end of B first, | |

so the exit code is pleasantly common to both the normal and error cases. | |

If B is smaller (function merge_hi, which is merge_lo's "mirror image"), | |

much the same, except that we need to merge right to left, copying B into a | |

temp array and starting the stores at the right end of where B used to live. | |

A refinement: When we're about to merge adjacent runs A and B, we first do | |

a form of binary search (more on that later) to see where B[0] should end up | |

in A. Elements in A preceding that point are already in their final | |

positions, effectively shrinking the size of A. Likewise we also search to | |

see where A[-1] should end up in B, and elements of B after that point can | |

also be ignored. This cuts the amount of temp memory needed by the same | |

amount. | |

These preliminary searches may not pay off, and can be expected *not* to | |

repay their cost if the data is random. But they can win huge in all of | |

time, copying, and memory savings when they do pay, so this is one of the | |

"per-merge overheads" mentioned above that we're happy to endure because | |

there is at most one very short run. It's generally true in this algorithm | |

that we're willing to gamble a little to win a lot, even though the net | |

expectation is negative for random data. | |

Merge Algorithms | |

---------------- | |

merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo | |

deals with runs where A <= B, and merge_hi where A > B. They don't know | |

whether the data is clustered or uniform, but a lovely thing about merging | |

is that many kinds of clustering "reveal themselves" by how many times in a | |

row the winning merge element comes from the same run. We'll only discuss | |

merge_lo here; merge_hi is exactly analogous. | |

Merging begins in the usual, obvious way, comparing the first element of A | |

to the first of B, and moving B[0] to the merge area if it's less than A[0], | |

else moving A[0] to the merge area. Call that the "one pair at a time" | |

mode. The only twist here is keeping track of how many times in a row "the | |

winner" comes from the same run. | |

If that count reaches MIN_GALLOP, we switch to "galloping mode". Here | |

we *search* B for where A[0] belongs, and move over all the B's before | |

that point in one chunk to the merge area, then move A[0] to the merge | |

area. Then we search A for where B[0] belongs, and similarly move a | |

slice of A in one chunk. Then back to searching B for where A[0] belongs, | |

etc. We stay in galloping mode until both searches find slices to copy | |

less than MIN_GALLOP elements long, at which point we go back to one-pair- | |

at-a-time mode. | |

A refinement: The MergeState struct contains the value of min_gallop that | |

controls when we enter galloping mode, initialized to MIN_GALLOP. | |

merge_lo() and merge_hi() adjust this higher when galloping isn't paying | |

off, and lower when it is. | |

Galloping | |

--------- | |

Still without loss of generality, assume A is the shorter run. In galloping | |

mode, we first look for A[0] in B. We do this via "galloping", comparing | |

A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding | |

the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most | |

roughly lg(B) comparisons, and, unlike a straight binary search, favors | |

finding the right spot early in B (more on that later). | |

After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 | |

consecutive elements, and a straight binary search requires exactly k-1 | |

additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we | |

copy all the B's up to that point in one chunk, and then copy A[0]. Note | |

that no matter where A[0] belongs in B, the combination of galloping + binary | |

search finds it in no more than about 2*lg(B) comparisons. | |

If we did a straight binary search, we could find it in no more than | |

ceiling(lg(B+1)) comparisons -- but straight binary search takes that many | |

comparisons no matter where A[0] belongs. Straight binary search thus loses | |

to galloping unless the run is quite long, and we simply can't guess | |

whether it is in advance. | |

If data is random and runs have the same length, A[0] belongs at B[0] half | |

the time, at B[1] a quarter of the time, and so on: a consecutive winning | |

sub-run in B of length k occurs with probability 1/2**(k+1). So long | |

winning sub-runs are extremely unlikely in random data, and guessing that a | |

winning sub-run is going to be long is a dangerous game. | |

OTOH, if data is lopsided or lumpy or contains many duplicates, long | |

stretches of winning sub-runs are very likely, and cutting the number of | |

comparisons needed to find one from O(B) to O(log B) is a huge win. | |

Galloping compromises by getting out fast if there isn't a long winning | |

sub-run, yet finding such very efficiently when they exist. | |

I first learned about the galloping strategy in a related context; see: | |

"Adaptive Set Intersections, Unions, and Differences" (2000) | |

Erik D. Demaine, Alejandro LÃ³pez-Ortiz, J. Ian Munro | |

and its followup(s). An earlier paper called the same strategy | |

"exponential search": | |

"Optimistic Sorting and Information Theoretic Complexity" | |

Peter McIlroy | |

SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp | |

467-474, Austin, Texas, 25-27 January 1993. | |

and it probably dates back to an earlier paper by Bentley and Yao. The | |

McIlroy paper in particular has good analysis of a mergesort that's | |

probably strongly related to this one in its galloping strategy. | |

Galloping with a Broken Leg | |

--------------------------- | |

So why don't we always gallop? Because it can lose, on two counts: | |

1. While we're willing to endure small per-merge overheads, per-comparison | |

overheads are a different story. Calling Yet Another Function per | |

comparison is expensive, and gallop_left() and gallop_right() are | |

too long-winded for sane inlining. | |

2. Galloping can-- alas --require more comparisons than linear one-at-time | |

search, depending on the data. | |

#2 requires details. If A[0] belongs before B[0], galloping requires 1 | |

compare to determine that, same as linear search, except it costs more | |

to call the gallop function. If A[0] belongs right before B[1], galloping | |

requires 2 compares, again same as linear search. On the third compare, | |

galloping checks A[0] against B[3], and if it's <=, requires one more | |

compare to determine whether A[0] belongs at B[2] or B[3]. That's a total | |

of 4 compares, but if A[0] does belong at B[2], linear search would have | |

discovered that in only 3 compares, and that's a huge loss! Really. It's | |

an increase of 33% in the number of compares needed, and comparisons are | |

expensive in Python. | |

index in B where # compares linear # gallop # binary gallop | |

A[0] belongs search needs compares compares total | |

---------------- ----------------- -------- -------- ------ | |

0 1 1 0 1 | |

1 2 2 0 2 | |

2 3 3 1 4 | |

3 4 3 1 4 | |

4 5 4 2 6 | |

5 6 4 2 6 | |

6 7 4 2 6 | |

7 8 4 2 6 | |

8 9 5 3 8 | |

9 10 5 3 8 | |

10 11 5 3 8 | |

11 12 5 3 8 | |

... | |

In general, if A[0] belongs at B[i], linear search requires i+1 comparisons | |

to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. | |

The advantage of galloping is unbounded as i grows, but it doesn't win at | |

all until i=6. Before then, it loses twice (at i=2 and i=4), and ties | |

at the other values. At and after i=6, galloping always wins. | |

We can't guess in advance when it's going to win, though, so we do one pair | |

at a time until the evidence seems strong that galloping may pay. MIN_GALLOP | |

is 7, and that's pretty strong evidence. However, if the data is random, it | |

simply will trigger galloping mode purely by luck every now and again, and | |

it's quite likely to hit one of the losing cases next. On the other hand, | |

in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it | |

"should be" then. So the MergeState struct keeps a min_gallop variable | |

that merge_lo and merge_hi adjust: the longer we stay in galloping mode, | |

the smaller min_gallop gets, making it easier to transition back to | |

galloping mode (if we ever leave it in the current merge, and at the | |

start of the next merge). But whenever the gallop loop doesn't pay, | |

min_gallop is increased by one, making it harder to transition back | |

to galloping mode (and again both within a merge and across merges). For | |

random data, this all but eliminates the gallop penalty: min_gallop grows | |

large enough that we almost never get into galloping mode. And for cases | |

like ~sort, min_gallop can fall to as low as 1. This seems to work well, | |

but in all it's a minor improvement over using a fixed MIN_GALLOP value. | |

Galloping Complication | |

---------------------- | |

The description above was for merge_lo. merge_hi has to merge "from the | |

other end", and really needs to gallop starting at the last element in a run | |

instead of the first. Galloping from the first still works, but does more | |

comparisons than it should (this is significant -- I timed it both ways). For | |

this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) | |

functions have a "hint" argument, which is the index at which galloping | |

should begin. So galloping can actually start at any index, and proceed at | |

offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index. | |

In the code as I type it's always called with either 0 or n-1 (where n is | |

the # of elements in a run). It's tempting to try to do something fancier, | |

melding galloping with some form of interpolation search; for example, if | |

we're merging a run of length 1 with a run of length 10000, index 5000 is | |

probably a better guess at the final result than either 0 or 9999. But | |

it's unclear how to generalize that intuition usefully, and merging of | |

wildly unbalanced runs already enjoys excellent performance. | |

~sort is a good example of when balanced runs could benefit from a better | |

hint value: to the extent possible, this would like to use a starting | |

offset equal to the previous value of acount/bcount. Doing so saves about | |

10% of the compares in ~sort. However, doing so is also a mixed bag, | |

hurting other cases. | |

Comparing Average # of Compares on Random Arrays | |

------------------------------------------------ | |

[NOTE: This was done when the new algorithm used about 0.1% more compares | |

on random data than does its current incarnation.] | |

Here list.sort() is samplesort, and list.msort() this sort: | |

""" | |

import random | |

from time import clock as now | |

def fill(n): | |

from random import random | |

return [random() for i in range(n)] | |

def mycmp(x, y): | |

global ncmp | |

ncmp += 1 | |

return cmp(x, y) | |

def timeit(values, method): | |

global ncmp | |

X = values[:] | |

bound = getattr(X, method) | |

ncmp = 0 | |

t1 = now() | |

bound(mycmp) | |

t2 = now() | |

return t2-t1, ncmp | |

format = "%5s %9.2f %11d" | |

f2 = "%5s %9.2f %11.2f" | |

def drive(): | |

count = sst = sscmp = mst = mscmp = nelts = 0 | |

while True: | |

n = random.randrange(100000) | |

nelts += n | |

x = fill(n) | |

t, c = timeit(x, 'sort') | |

sst += t | |

sscmp += c | |

t, c = timeit(x, 'msort') | |

mst += t | |

mscmp += c | |

count += 1 | |

if count % 10: | |

continue | |

print "count", count, "nelts", nelts | |

print format % ("sort", sst, sscmp) | |

print format % ("msort", mst, mscmp) | |

print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) | |

drive() | |

""" | |

I ran this on Windows and kept using the computer lightly while it was | |

running. time.clock() is wall-clock time on Windows, with better than | |

microsecond resolution. samplesort started with a 1.52% #-of-comparisons | |

disadvantage, fell quickly to 1.48%, and then fluctuated within that small | |

range. Here's the last chunk of output before I killed the job: | |

count 2630 nelts 130906543 | |

sort 6110.80 1937887573 | |

msort 6002.78 1909389381 | |

1.80 1.49 | |

We've done nearly 2 billion comparisons apiece at Python speed there, and | |

that's enough <wink>. | |

For random arrays of size 2 (yes, there are only 2 interesting ones), | |

samplesort has a 50%(!) comparison disadvantage. This is a consequence of | |

samplesort special-casing at most one ascending run at the start, then | |

falling back to the general case if it doesn't find an ascending run | |

immediately. The consequence is that it ends up using two compares to sort | |

[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be | |

taught how to deal with mixtures of ascending and descending runs | |

efficiently in all cases. | |

NOTES | |

----- | |

BINSORT | |

A "binary insertion sort" is just like a textbook insertion sort, but instead | |

of locating the correct position of the next item via linear (one at a time) | |

search, an equivalent to Python's bisect.bisect_right is used to find the | |

correct position in logarithmic time. Most texts don't mention this | |

variation, and those that do usually say it's not worth the bother: insertion | |

sort remains quadratic (expected and worst cases) either way. Speeding the | |

search doesn't reduce the quadratic data movement costs. | |

But in CPython's case, comparisons are extraordinarily expensive compared to | |

moving data, and the details matter. Moving objects is just copying | |

pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary | |

user-supplied Python code), but even in simple cases (like 3 < 4) _all_ | |

decisions are made at runtime: what's the type of the left comparand? the | |

type of the right? do they need to be coerced to a common type? where's the | |

code to compare these types? And so on. Even the simplest Python comparison | |

triggers a large pile of C-level pointer dereferences, conditionals, and | |

function calls. | |

So cutting the number of compares is almost always measurably helpful in | |

CPython, and the savings swamp the quadratic-time data movement costs for | |

reasonable minrun values. | |

LEFT OR RIGHT | |

gallop_left() and gallop_right() are akin to the Python bisect module's | |

bisect_left() and bisect_right(): they're the same unless the slice they're | |

searching contains a (at least one) value equal to the value being searched | |

for. In that case, gallop_left() returns the position immediately before the | |

leftmost equal value, and gallop_right() the position immediately after the | |

rightmost equal value. The distinction is needed to preserve stability. In | |

general, when merging adjacent runs A and B, gallop_left is used to search | |

thru B for where an element from A belongs, and gallop_right to search thru A | |

for where an element from B belongs. | |

REGION OF UNCERTAINTY | |

Two kinds of confusion seem to be common about the claim that after finding | |

a k such that | |

B[2**(k-1) - 1] < A[0] <= B[2**k - 1] | |

then a binary search requires exactly k-1 tries to find A[0]'s proper | |

location. For concreteness, say k=3, so B[3] < A[0] <= B[7]. | |

The first confusion takes the form "OK, then the region of uncertainty is at | |

indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 = | |

3"; or the region is viewed as a Python slice and the objection is "but that's | |

the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already | |

compared A[0] against B[3] and against B[7], so A[0]'s correct location is | |

already known wrt _both_ endpoints. What remains is to find A[0]'s correct | |

location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the | |

slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1 | |

inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has | |

(2**k-1)-1 - 2**(k-1) + 1 = | |

2**k-1 - 2**(k-1) = | |

2*2**(k-1)-1 - 2**(k-1) = | |

(2-1)*2**(k-1) - 1 = | |

2**(k-1) - 1 | |

elements. | |

The second confusion: "k-1 = 2 binary searches can find the correct location | |

among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we | |

could make this more efficient by arranging for the region of uncertainty to | |

span 2**(k-1) elements." Resolution: that confuses "elements" with | |

"locations". In a slice with N elements, there are N+1 _locations_. In the | |

example, with the region of uncertainty B[4], B[5], B[6], there are 4 | |

locations: before B[4], between B[4] and B[5], between B[5] and B[6], and | |

after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) | |

locations. That's why k-1 binary searches are necessary and sufficient. | |

OPTIMIZATION OF INDIVIDUAL COMPARISONS | |

As noted above, even the simplest Python comparison triggers a large pile of | |

C-level pointer dereferences, conditionals, and function calls. This can be | |

partially mitigated by pre-scanning the data to determine whether the data is | |

homogeneous with respect to type. If so, it is sometimes possible to | |

substitute faster type-specific comparisons for the slower, generic | |

PyObject_RichCompareBool. |