CONVEY

When More Parallelism != More Performance

2022 January 11 22:38 stuartscott 10665¤ 1309174¤

After reading @Kostadin's Why Amdahl's Law is Important article, I wanted to explore the concepts in more detail by parallelizing a serial algorithm.

Bubble Sort is a serial sorting algorithm which iterates through an array comparing each adjacent pair of elements, if they are out of order they will be swapped. This is repeated until the entire array is in sorted order. Bubble Sort can not and does not take advantage of multiple cores, but it has the funny title of being the fastest sorting algorithm, providing the data is already sorted.

Here is Bubble Sort in Go:


func main() {
	array := []int{1, 5, 2, 6, 3, 7, 4, 9, 8, 0}
	log.Println(array)
	BubbleSort(array, func(a, b int) bool {
		return array[a] < array[b]
	})
	log.Println(array)
}

func BubbleSort(array []int, less func(int, int) bool) {
	limit := len(array)
	// Repeats until no swaps occur.
	for swapped := true; swapped; {
		swapped = false
		// Iterate through all pairs.
		for i := 0; i < limit-1; {
			j := i + 1
			// Compare index with adjacent.
			if !less(i, j) {
				array[i], array[j] = array[j], array[i]
				swapped = true
			}
			i = j
		}
	}
}

If the hardware running the Bubble Sort had two CPU cores, couldn't we modify the algorithm so the first core checks one pair of adjacent elements while the second core checked a different pair?

Foam Sort is an attempt to parallelize Bubble Sort - just as foam consists of many bubbles, Foam Sort consists of multiple Bubble Sorts executing in parallel. Well, that was the idea anyway. Note: this is not a new idea, it was originally presented by Habermann in 1972 and has the descriptive, though uninspiring, name of Odd-Even Transposition Sort.

Here is Foam Sort in Go:


func main() {
	array := []int{1, 5, 2, 6, 3, 7, 4, 9, 8, 0}
	log.Println(array)
	FoamSort(array, func(a, b int) bool {
		return array[a] < array[b]
	})
	log.Println(array)
}

func FoamSort(array []int, less func(int, int) bool) {
	limit := len(array)
	jobs := make(chan int, limit/2)
	results := make(chan bool, limit/2)
	// Repeatedly takes an index from job queue.
	// Compares, and optionally swaps, with adjacent index.
	work := func() {
		for j := range jobs {
			a := j
			b := j + 1
			if less(a, b) {
				results <- false
			} else {
				array[a], array[b] = array[b], array[a]
				results <- true
			}
		}
	}
	// Spawns a worker thread for each available process.
	for w := 0; w < runtime.GOMAXPROCS(0); w++ {
		go work()
	}
	// Adds indices of non-overlapping pairs to job queue.
	// Waits for all results.
	sort := func(start int) (result bool) {
		for i := start; i < limit-1; i += 2 {
			jobs <- i
		}
		for i := start; i < limit-1; i += 2 {
			result = <-results || result
		}
		return
	}
	// Sorts even, then odd, pairs.
	// Repeats until no swaps occur.
	for swapped := true; swapped; {
		swapped = sort(0)
		swapped = sort(1) || swapped
	}
	// Closes queues so worker threads can terminate.
	close(jobs)
	close(results)
}

Is Foam Sort the best new sorter on the block?

Short answer; no.

To understand the longer answer, let's compare Foam Sort with Bubble Sort and Standard Sort (the optimized Quick Sort algorithm implemented in the Go standard library) in each of the three cases; Best - the data is already sorted, Worst - the data is sorted in reverse order, and Random - the data is unsorted.

The results below show how the algorithms compare when sorting an array of 1000 integers on a MacBook Pro with 4 worker threads.


$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    4484	    227880 ns/op
BenchmarkBubbleSort_Best_1K-4       	  169204	      6207 ns/op
BenchmarkStandardSort_Best_1K-4     	   29431	     40083 ns/op
BenchmarkFoamSort_Worst_1K-4        	      10	 110654902 ns/op
BenchmarkBubbleSort_Worst_1K-4      	     318	   3710267 ns/op
BenchmarkStandardSort_Worst_1K-4    	   28496	     41871 ns/op
BenchmarkFoamSort_Random_1K-4       	      10	 102962330 ns/op
BenchmarkBubbleSort_Random_1K-4     	     294	   4333471 ns/op
BenchmarkStandardSort_Random_1K-4   	    8998	    132668 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	13.392s

The results above show that in the Best case, Foam Sort is ~37x slower than Bubble Sort, in the Worst case it is ~30x slower, and in the Random case it is ~24x slower!

How could this happen?! The key issue here is that Foam Sort parallelizes the comparison operation of the sort, which is not the most expensive part. The main cost of Bubble Sort comes from the shear amount of comparison operations it must do, which in the best case is N, and in the worst case N^2. Reducing the amount of comparisons will do more for the performance than reducing the cost of each individual comparison, which is why the Standard Sort is so much quicker in most cases - it simply does fewer comparisons.

What if the comparison operations were more expensive?

We can simulate this by adding a small sleep into the comparison operation;


func(a, b int) bool {
	time.Sleep(100 * time.Nanosecond)
	return array[a] < array[b]
}

$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    3717	    293859 ns/op
BenchmarkBubbleSort_Best_1K-4       	    1878	    577005 ns/op
BenchmarkStandardSort_Best_1K-4     	     277	   4195029 ns/op
BenchmarkFoamSort_Worst_1K-4        	       8	 139885513 ns/op
BenchmarkBubbleSort_Worst_1K-4      	       3	 482107601 ns/op
BenchmarkStandardSort_Worst_1K-4    	     272	   4268562 ns/op
BenchmarkFoamSort_Random_1K-4       	       8	 137650085 ns/op
BenchmarkBubbleSort_Random_1K-4     	       3	 485127224 ns/op
BenchmarkStandardSort_Random_1K-4   	     229	   5092066 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	19.632s

The results above are slightly more in Foam Sort's favour; in the Best case, it is ~2x faster than Bubble Sort, in the Worst case it ~3.4x faster, and in the Random case ~3.5x faster! If you had expected the speed-up to be 4x due to the number of worker threads, then I invite you to re-read @Kostadin's Why Amdahl's Law is Important article.

What happens if we just throw more cores at the problem?

I went to my cloud provider of choice, Digital Ocean, and purchased their top-of-the-line droplet with 32 dedicated CPU cores!

The results were poor.


$ go test -bench=. ./...
goos: linux
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel Xeon Processor (Cascadelake)
BenchmarkFoamSort_Best_1K-32          	    2278	    506672 ns/op
BenchmarkBubbleSort_Best_1K-32        	  219459	      5479 ns/op
BenchmarkStandardSort_Best_1K-32      	   37087	     32299 ns/op
BenchmarkFoamSort_Worst_1K-32         	       6	 195969930 ns/op
BenchmarkBubbleSort_Worst_1K-32       	     415	   2901585 ns/op
BenchmarkStandardSort_Worst_1K-32     	   35242	     33854 ns/op
BenchmarkFoamSort_Random_1K-32        	       6	 197362338 ns/op
BenchmarkBubbleSort_Random_1K-32      	     406	   2949955 ns/op
BenchmarkStandardSort_Random_1K-32    	   10000	    107531 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	16.011s

When using the normal comparison function, Foam Sort was on average 75x slower that Bubble Sort. But even when using the 'expensive' comparison function, Form Sort's Best case speed-up was barely more than 2x that of Bubble Sort.


$ go test -bench=. ./...
goos: linux
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel Xeon Processor (Cascadelake)
BenchmarkFoamSort_Best_1K-32          	    1672	    718890 ns/op
BenchmarkBubbleSort_Best_1K-32        	    3628	    335208 ns/op
BenchmarkStandardSort_Best_1K-32      	     441	   2683244 ns/op
BenchmarkFoamSort_Worst_1K-32         	       4	 278975445 ns/op
BenchmarkBubbleSort_Worst_1K-32       	       4	 303722158 ns/op
BenchmarkStandardSort_Worst_1K-32     	     452	   2721583 ns/op
BenchmarkFoamSort_Random_1K-32        	       4	 273567180 ns/op
BenchmarkBubbleSort_Random_1K-32      	       4	 306366861 ns/op
BenchmarkStandardSort_Random_1K-32    	     379	   3126928 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	16.490s

The ability for an algorithm to scale and make use of the available hardware resources depends on many factors. The parallelism of the algorithm is critical - a serial algorithm will never speed up no matter how many cores the are. But the data, and how it is shared by the algorithm, is also important due to the physical constraints of the hardware.

Main memory operates at a much lower frequency than the CPUs so reading and writing to main memory is avoided as much as possible by holding copies of the data in layers of increasingly smaller and faster caches. Some of the larger caches might be shared by multiple CPU cores, while the smallest are typically only used by a single core. Regardless, the data in these caches needs to be kept consistent across all of them, which is where Cache Coherency comes in.

The caches communicate and coordinate with each other to ensure they hold the correct data, and that changes made by one CPU core are visible to all others cores. In the case of sorting, many pieces of data are compared and swapped with each other, so every core needs access to every piece of data. Which means these caches have to work very hard to maintain consistency across the cores, and if one core tries to read a value that another core has just updated, then the reader will need to wait for the caches to synchronize.

In the case of Foam Sort, any potential speed up from executing the comparisons in parallel is dwarfed by the overhead of spawning, scheduling, and coordinating multiple worker threads and keeping the caches consistent while they operate.

In other situations like computer graphics, each operation typically works in isolation on its own subset of the data and so each cache only needs to hold that subset and won't need to coordinate with the other caches as much. This is why you see so much parallelism in GPU design - where a typical CPU has 2-64 cores, a GPU can have hundreds and sometimes even thousands.

While this adventure in parallelism did not yield a new super-fast sorting algorithm, it was fun. I hope you enjoyed it too. If you're interested in exploring the code, or learning more about the benchmarks checkout the project on GitHub.

Sort: Cost Yield Time