Sort

### sort

Computer science literature is packed full of sorting algorithms, and all of them seem to operate on arrays. Everybody knows the Sorting Facts Of Life:

* Bubblesort, Insertion Sort and Selection Sort are bad;
* Shellsort is better but nowhere near the theoretical O(N log N) limit;
* Quicksort is great when it works, but unreliable;
* Heapsort is reliably good, but unstable, and also about a factor of 4 slower than Quicksort's best case.
* Mergesort is reliably good but requires O(N) auxiliary space;

Nobody tells you what to do if you want to sort something other than an array. Binary trees and their ilk are all ready-sorted, but what about linked lists?

It turns out that Mergesort works even better on linked lists than it does on arrays. It avoids the need for the auxiliary space, and becomes a simple, reliably O(N log N) sorting algorithm. And as an added bonus, it's stable too.

#### k-way merge sort

External Sort/Merge Algorithms

This chapter discusses external sort/merge algorithms used for sorting large data files that could not fit into the main memory of the computer. The algorithms presented are: two-way sort/merge, the balanced two-way sort/merge, the balanced k-way sort/merge, and the polyphase sort/merge. The advantages of sorting records include the following: sorting transactions to match the key order of the master file can reduce the time required to locate a matching master record; sorted transactions can improve the maintenance of master files with other types of file organization; and many reports generated from files require that the file be sorted, as it is more presentable and readable when sorted.

Two-Way Sort/Merge

A sort/merge algorithm involves two steps:

1. The records in the file to be sorted are divided into several groups, called run, and each run fits into main memory. An internal sort is applied to each run, and the resulting sorted runs are distributed to two external files.
2. One run at a time from each of the external files created in step 1 merge into larger runs of sorted records. The result is stored in a third external file. The data are distributed from the third file back into the first two files, and the merge continues until all records are in one large run.

Suppose we have a file of records with the following keys to be sorted in ascending order:

50 110 95 10 100 36 153 40 120 60 70 130 22 140 80

Let us assume that the size of the run is 3 records/run. By using step 1, we can divide the keys into groups of 3 as follows: (50,110,95), (10,100,36), (153,40,120), (60,70,130), (22,140,80). The number of groups indicates that we have 5 runs. These 5 runs are then loaded into 2 files: File 1 contains run 1(50,110,95), run 3(153,40,120), and run 5(22,140,80); File 2 contains run 2(10,100,36) and run 4(60,70,130). All in sorted order. The distribution of runs in each file is alternated in order to have the same amount of runs in each file, at worst case one file will have one more run than the other, just like our example above.

In step 2 we will be merging the two files together one run at a time into a third file, say file 3. Run 1 in file 1 is merged with run 2 in file 2 to produce the first run in file 3 with 6 sorted records. Run 3 in file 1 is merged with run 4 in file 2 to produce the second run in file 3 also with 6 sorted records. Now file 2 is empty and file 1 contains run 5. Run 5 is merged into file 3 to make the third run of 3 sorted records.

Now we are going repeat step 1 by distributing each run in file 3 into file 1 and file 2, and then merge the two files into file 3 again. These steps are repeated until we have only one run in file 3. The number of passes through the two-way sort/merge is the ceiling of lg NR, where NR means number of runs and lg means log base 2. Please see figure 6.2 and 6.3 on page 168 and 169 respectively to see how the algorithm works graphically. The algorithm is presented on page 169 to 171.

Balanced Two-Way Sort/Merge

Balanced two-way sort/merge algorithm is nothing more than an improvement to the two-way sort/merge algorithm. By increasing the number files used, the balanced two-way sort/merge algorithm merges two input files and stores the merged runs on two output files alternately. As a result, the algorithm eliminates the need to redistribute the runs to the two input files before it can start merging. This algorithm merges runs from file 1 and file 2 and stores the merged runs into file 3 and file 4, it then merges the runs in file 3 and file 4 and stores the merged runs into file 1 and file 2, again the algorithm continues until all the records are merged into one run. Please see Figure 6.4 and 6.5 on page 172 and 173. The number of passes through this algorithm is the same as the number of passes through the two-way sort/merge algorithm, that is the ceiling of lg NR.

Balanced k-Way Sort/Merge

This also is an improvement on the previous algorithm. By increasing the number of input and output files, the number of runs in each file is decreased, with the result that fewer merges have to be performed. However, the merging phase is more complicated with k input files because a k-way merge needs to distribute merged runs into k output files.

Suppose we have the same file that was described above with 15 records and we were to sort the file using the balanced k-way sort/merge algorithm. The distribution of the records into runs is similar to the one used in the two-way sort/merge algorithm. However the merge phase will be merging into 3 files, that is, run 1 from file 1 and run 2 from file 2 and run 3 from file 3 are merged into file 4 to produce 1 run of sorted records. Run 4 from file 1 and run 5 from 2 are then merged into file 5 to produce 1 run of sorted records. Now, the two output files (file 4 and 5) will now become input files (just like in balanced two-way sort/merge) and the two runs will be merged into file 1 to produce 1 run of sorted records and the sort/merge is done. The example just described can be called the Balanced 3-Way Sort/Merge as we have 3 input files and 3 output files, even though file 6 was never used, as we did not have enough runs to use file 6. Please see Figure 6.6 on page 177 for the graphical representation of this algorithm. The advantage of this algorithm is that the sort/merge phase is more faster than the algorithm that were described above, but the disadvantage is the excess of use of files (disk space). The number of passes required in this algorithm is the ceiling lgk NR of k is the number of files to be merged and it is also the number of output files.

#### Given some x computers (modern computers with good RAM, CPU)

There are several numbers on each computer (may be in billions)

Find the Median of this.

I seen now that the correct solution would minimise calls across the network because that impacts performance more than the work done by each individual node.

I would suggest the following algorithm:

1. Sort the arrays

2. Poll each machine for max and min values and determine the global MAX and MIN values

4. Set m = 1/2(MAX + MIN)

5. Compare partitions in the manner you describe.

6. If the partitions are equal then m is the median, else if the median is greater than m then set MIN = m (else set MAX = m) and goto step 4.

A less efficient but simpler algorithm than Manu's is to use a distributed binary search.

Choose one computer to be the master. The rest are slaves:
1. Master computer sends a message to all slaves, "What are your min and max elements?" It stores the global min and max (the min of the mins, and the max of the maxes).
2. Master computer picks a number x. To start with, x = (max+min)/2
3. Master asks all slaves how many of its array elements are (a) less than x and (b) greater than x. It sums these up to get the total of all numbers less than/greater than x in all arrays. Call them less_x and more_x.
If less_x < more_x, then set x = (x + max) / 2, min = old x;
Else set x = (x + min) / 2; max = old x;
4. Repeat step 3 until max <= min.

The process will terminate after log(n) rounds, since (like normal binary search) each round eliminates half the possible range of values. Each round takes n time, since the slaves have to scan all their values, so total time complexity is O(n*log(n)) compared to Manu's O(n).

http://discuss.techinterview.org/default.asp?interview.11.444752.8

An interesting extension: What if the timing is asynchronous? We've been assuming that when the master sends a message to all the slaves, they all respond at about the same time. But in practice, there may be "stragglers": slaves that respond much more slowly than others because they're overloaded, on a slow network link, on the other side of the data center, etc. So how can we modify the algorithm to deal with responses that stream in continuously instead of arriving all at once?

#### merge k sorted lists efficiently

Take the first element of each list and create a heap (of size k). Pop the smallest element. Find the array from the element came (let's say it came from list number i). Take the next element from list i and push it in heap. For each element that go in the merged list, we spent log(k) time. So the time complexity is O(N*logk) where N is total number of the elements in all the K lists.

#### Suppose we have N companies, and we want to eventually merge them into

one big company. How many ways are there to merge?

waystomerger(N) = (N 2) * ways to merge(N-1);

since there are (N 2) or Nc2 ways of choosing any 2 out of the N companies to merge
if we follow the suit till 2 companies are left. we get

waystomerge(N) = (N 2) * (N-1 2) * … (2 ,2) = (n! * (n-1)! )/ 2^n

Picking 2 from x has x! / (2!((x-2)!)) ways of happening
Which equals x * x -1 / 2
int ways_of_picking(int num)
{
if(num==2)
{
return 1;
}
else
{
return num * (num - 1) * ways_of_picking(x-1) / 2
}
}