From Pickwiki
Jump to navigationJump to search

A simple treatise on sorting

The first rule of sorting in a multi-value database is "let the database do it". The second rule is "if you have to do it yourself, don't use dynamic arrays". Given the increasing power of machines today, it isn't actually that much of a bind to sort yourself, but it always makes life a lot easier if you can either sort on @ID or declare an index and sort on that. That way, you can either ask the database for a sorted keylist, or ask for a keylist and sort it yourself. Either way, you don't hammer the system. But assuming you do have to sort it yourself, what's the best way?

Bubble Sort

The simplest sort of all is that group of techniques commonly known as "the bubble sort". Actually, there's quite a variety of these, but they all rely on comparing neighbouring elements, and swap them if necessary. Although regularly slated as "inefficient", the best of them are extremely fast if used in appropriate circumstances. Those circumstances are, primarily, when the data is pretty much sorted already. Given this they are regularly known to outperform a quick-sort!

The most basic of bubble sorts simply starts at the top, and compares each element with the one below it until it hits the bottom. If there are N elements, it does the entire scan N times, by which means you can mathematically prove the resulting array must be sorted.

The first, most obvious, optimisation is to add a "low water mark". On each pass, you know that the next biggest item has sunk to its final position, so there's no point scanning beyond the last swap of the previous pass. But that's no help if the smallest item is initially at the bottom. So the next optimisation is to alternate downward and upward scans, and introduce a "high water mark". An alternative optimisation is to do an upwards scan every time your downward scan finds an item that is out of position to move it to its correct place. There's not much more that you can do to the bubble sort. Given that your data is mostly in sorted order to begin with, this is one of the best sorts to use. Most MV'ers, however, probably tend to use an insertion sort, which is a slightly different technique to achieve the same effect as the last optimisation above.

Insertion Sort

The insertion sort is uniquely easy to use in MV. But it relies on dynamic arrays, and rapidly becomes expensive if your array grows too big. An insertion sort can either be done "in place", or by building a new array. You step through the old array item by item, doing a sorted locate in the new array and inserting it in the correct place. If you're doing an "in place" sort, you know the new place can never be greater than the old place. Something like

         DEL ARRAY<II>

Shell Sort

The shell sort (my favourite) is another variant on the same theme. But rather than compare neighbouring items, you start by comparing items as far apart as practicable, progressively narrowing the gap and making finer and finer corrections. Each pass starts at the top, and on hitting an out-of-place element, makes a nested reverse pass to bubble it up to its correct position. So although basically very similar to a bubble sort, it makes fewer passes, and each pass is shorter.

Simple variants use an interval of N/x for the first pass, dividing by x on each pass where N is the total number of elements, and x is either 2 or 3. More complicated variants seek to optimise this by calculating x^n or x^n-1 where n is as large as possible as their first interval before dividing by x on each pass.

Quick Sort

Similarly, the quick sort (reputedly the fastest general purpose sort) also uses a "divide and conquer" approach. A pivot value is chosen - it may be the element in the middle of the array, or the average of the first and last elements, or whatever. The sort then proceeds to make sure that all the values above the pivot are smaller, and all the values below it are larger. One way to accomplish this is to start from the top of the array, and when you find a value larger than the pivot, you swap them. Then you start from the bottom looking for a value that is smaller, and swap them. Then you continue from where you left off coming down ...

When you meet in the middle you now know that your pivot is in the correct position, and all the other elements are correct relative to it. So you recursively call the sort again for those elements above the pivot, and for those elements below.

Heap Sort

The last sort in general use nowadays is the heap sort. This could actually be well adapted to "grab the next highest value" or whatever and give a rapid response to a calling program without having to presort the entire array first. It's actually quite a weird sort at first glance, because it bubbles the biggest element to the top ...

For the first pass down the array, each element N is compared in position N/2, and if the lower element is larger, they swap and the element now in N/2 is compared with the element in (N/2)/2 ... The end result is that the largest element is at the top.

This is now swapped for the element at the bottom. The bottom element is now sorted and is ignored for the rest of the sort. The element at the top (N=1) is now totally out of position. The two elements 2N and 2N+1 are compared, and the element at N is swapped with the larger. It is then compared with the new elements 2N and 2N+1 and swapped again, until it can't go any further. Once again, the biggest element is now at the top, and can be swapped for the new bottom element (the one above the last bottom) to complete the cycle until the array is completely sorted.

This sort is unlikely now to be typically used as a sort, but if you want to grab the smallest (or largest) item and process it, then grab the next and so on, it is very much the best way to get a quick response time, responding with the first item in a single pass through the array, and then needing only a partial pass for each further item. An implementation of this, which returns one item per call, is here