设计 / 图形架构高级初学者中级开发 C++Linux

如何使用 Intel® MPI 库实现并行“稳定”排序并将其部署到多节点计算集群。

Intel

5.00/5 (1投票)

2020年5月8日

CPOL

13分钟阅读

5109

在本文中，我将基于之前讨论过的并行“稳定”排序，重点介绍使用 Intel® MPI 库交付现代并行代码的几个方面，该库可以提供更优的性能加速和效率。

下载源代码 - 140 KB

引言

在本文中，我将基于之前讨论过的并行“稳定”排序，重点介绍使用 Intel® MPI 库交付现代并行代码的几个方面，该库可以提供更优的性能加速和效率。

具体来说，我将讨论如何使用 Intel® MPI 库来扩展并行“稳定”排序工作负载的执行，使其跨越一组并行运行的并发进程，从而极大地提高排序的整体性能。

本文的读者将了解到一种算法，该算法可以将整个排序过程在多个并行启动的进程之间进行工作分配，这些进程可以在单个计算节点内（例如，节点内）运行，也可以跨越一个由多个节点组成的计算集群运行，每个节点都消耗其自己的 CPU 和系统内存资源。以下算法主要基于结合著名的快速排序和归并排序算法的思想，对被分割成子数组的大数据执行并行排序，每个进程独立地对子数组进行排序，然后将这些已排序的子数组合并成一个有序的数据数组。

除了解释具体算法外，我还将讨论使用 Intel® MPI 库在实际并行排序大数据时提供进程间互连和集体通信的方面。具体来说，我将深入解释如何使用 MPI 库的 MPI_Bcast、MPI_Scatter 和 MPI_Gather 例程在运行的进程之间发送特定数据，以及执行各种数据操作任务，例如将数据拆分成由每个进程并行处理的块。

稍后，在本文中，我将讨论如何使用 John McCalpin (Blackbelt) 在他的博客中提出并深入讨论的方法来评估执行 MPI 程序并行多节点排序的执行时间。在此，我将解释如何收集执行的挂钟时间数据，并使用 MPI 的单边通信 (RMA) 在进程之间共享这些数据，以估算程序的总执行时间。

最后，我将讨论如何在 Intel® DevCloud 上构建和运行以下 MPI 程序，以及如何将其性能加速与传统 std::sort(…) 和 qsort(…) 函数的性能进行比较。

跨多个 MPI 进程扩展的并行“稳定”排序

执行跨多个并行进程扩展的并行“稳定”排序的整个想法可以表述为以下算法：

生成一个包含 N 个待排序对象的数组
将整个数组分割成 k 个对象子集
并行排序每个对象子集
将所有 k 个子集合并成一个有序的对象数组

根据上述算法，由秩为 0 的根进程生成一个对象数组。之后，根进程将整个数组分割成大小相等的子数组。然后，根进程（秩 0）将所有子数组发送给一组并行启动的进程。反过来，组中的每个进程接收其自己的子数组，并执行并行“稳定”排序。由于排序是由一组进程完成的，每个进程将有序的子数组发送回根进程，根进程最终将所有子数组合并成一个已排序的对象数组。在这种情况下，子数组的数量等于正在运行的进程数，显然，每个子数组中的对象数量可以评估为整个数组的大小除以正在排序的实际子数组的数量。

为了合并所有子数组，我们将使用一个与著名的归并排序非常相似的算法。与经典归并排序不同，以下算法将用于合并多个（例如，两个以上）子数组到一个完整的对象数组中。

以下算法可以表述为：

1.	Let A[0..N] – an array of objects, 
        S[0..N] – a resultant array of sorted objects, N – the overall number of objects;
2.	Initialize the variable _First as the index of the first object in the array S;
3.	Initialize the variable _Last as the index of the last object in the array S;
4.	Initialize the variable i as the index of the second sub-array (i = 1);
5.	Copy the first sub-array (i = 0) to the resultant array S;
6.	For each i-th sub-array of objects in A[0..N], do the following:
        1. Compute the indices {xs, xe} of the first 
           and last objects in the i-th sub-array;
        2. Compute the size of the merged array S: L = (xe - xs) + (_Last - _First);
        3. Merge the arrays A[xs..xe] and S[_First.._Last], assigning the result to 
            the resultant array S of the new size L, as follows: Merge(A[xs..xe], 
            S[_First.._Last]) -> S;
        4. Update the index variable _Last with the value of the merged array size L 
           (_Last = L);

根据上述算法，我们首先需要用第一个子数组 A[0..N] 初始化已排序对象的 S 结果数组。之后，我们还必须初始化两个索引变量，分别称为 _First 和 _Last。在这种情况下，变量 _First 被赋值为结果数组 S 中第一个对象的索引（_First = 0），变量 _Last 被赋值为数组 S 中最后一个对象的索引。通常，最后一个对象的索引等于第一个子数组的大小（例如，_Last = chunk_size）。由于结果数组 S 和特定索引已初始化，我们执行一个循环，在每次迭代中，该循环维护当前子数组 A[xs..xe] 中第一个和最后一个对象的索引 xs 和 xe。此外，它计算结果数组的新大小 L，作为第 i 个子数组的大小与结果数组 S 的大小之和。最后，它将这两个子数组 A[xs..xe] 和 S[_First..Last] 合并，并将结果累积到目标数组 S 中，以便新大小为 L 的数组 S 将包含这两个子数组中所有对象的有序序列。正如您可能已经注意到的，在合并每个子数组和结果数组 S 的过程中，数组 S 的总大小正在增加。在每次循环迭代中，我们需要重新计算结果数组 S 中最后一个对象的索引，该索引等于先前获得的新的总体大小 L 的值（_Last = L）。此过程将继续进行，直到 A[0..N] 中的所有子数组都合并到已排序对象的最终结果数组中。

不幸的是，由于数据流依赖性问题，此过程无法并行运行。通常，合并过程的总开销微不足道，因为它基本上只用于合并数量非常少的子数组（等于正在运行的进程数）。这反过来又不会影响排序本身。合并多个子数组的过程还需要额外的 O(N) 空间，其中 N 是被排序数组中的总对象数。合并过程由秩为 0 的根进程完全执行，前提是所有子数组都已由组中的每个进程独立排序。

下面的代码实现了正在讨论的子数组合并算法。

		sorted = (gen::ITEM*)std::malloc(size * sizeof(gen::ITEM));
		std::memcpy((void*)sorted, (const void*)array_target, sizeof(gen::ITEM) * chunk_size);

		std::size_t _First = 0, _Last = chunk_size;
		for (int rank = 1; rank < world_size; rank++)
		{
			std::size_t xs = rank * (size / world_size);
			std::size_t xe = (rank + 1) * (size / world_size);

			std::size_t merged_size = (xe - xs) + (_Last - _First);

			gen::ITEM* merged = \
				(gen::ITEM*)std::malloc(sizeof(gen::ITEM) * merged_size);

			parallel_sort_mpi_impl::merge(sorted, array_target, \
				merged, _First, _Last - 1, xs, xe - 1,
				[&](const gen::ITEM& item1, const gen::ITEM& item2) {
					return (item1.key < item2.key); },
				[&](const gen::ITEM& item1, const gen::ITEM& item2) {
						return (item1.key == item2.key) && (item1.value < item2.value); });

			std::memcpy((void*)sorted, (const void*)merged, \
				sizeof(gen::ITEM) * merged_size);

			std::free(merged);

			_Last = merged_size;
		}

作为上述算法的一部分，我们将使用另一种算法来合并两个子数组到迄今为止的结果对象数组中。在这种情况下，我们将实际使用标准的经典归并排序算法来实现此目的。

以下算法可以表述为：

1.	Let A1[0..N], A2[0..M] – the first and second sub-arrays to be merged, 
        M[0..N+M] – the resultant array of objects;
2.	Initialize the variable pos as the index of the first object in the resultant array M;
3.	Initialize the variable i as the index of the first object in the sub-array A1;
4.	Initialize the variable j as the index of the first object in the sub-array A2;
5.	Compute the overall size L of the two sub-arrays being merged (L = N + M);
6.	During each k-th iteration of the loop k = [0..L], do the following:
        1. If A1[i].key < A2[j].key or A1[i].value < A2[j].value, assign the A1[i] 
            object at current position of the resultant array M (M[pos] = A1[i]) 
            and increment the indices pos and i by 1;
        2. If A1[i].key >= A2[j].key or A1[i].value >= A2[j].value, assign the A2[j] 
            object at the current position of the resultant array M (M[pos] = A2[j]) 
            and increment the indices pos and j by 1;
7.	Append all objects in A1[i..N] to the resultant array M;
8.	Append all objects in A2[j..M] to the resultant array M;

要合并两个数组，我们需要初始化目标数组 M 中第一个对象的索引（pos = 0），以及子数组 A1 和 A2 中第一个对象的索引（i = 0）和（j = 0）。此外，我们必须计算结果数组 M 的大小 L，即子数组 A1 和 A2 的大小之和。然后，它执行正好 k = [0..L] 次循环迭代，在每次迭代中，它都会评估一个条件，将子数组 A1 中的第 i 个对象的键与子数组 A2 中的第 j 个对象的键进行比较（A1[i].key < A2[j].key）。类似地，它还比较这些对象的值，例如（A1[i].value < A2[j].value）。如果 A1[i] 对象的键小于 A2[j] 对象的键且条件为“true”，则将 A1[i] 对象分配到结果数组 M 中的 pos 索引处（M[pos] = A[i]）。否则，它会将 A2[j] 对象分配到数组 M 中的 pos 索引处，而不是（M[pos] = A[j]）。最后，在循环执行结束时，它将 A1 中所有剩余的对象附加到结果数组 M。然后，它对 A2 中所有剩余的对象执行完全相同的操作，将它们附加到结果数组 M，以便在合并过程结束时，我们将获得一个合并后的数组，其中包含来自子数组 A1 和 A2 的已排序对象，分别。

下面列出了实现归并排序算法的 C++11 代码。

	template<class _CompKeys, class _CompVals>
	void merge(const gen::ITEM* array1, const gen::ITEM* array2, \
		gen::ITEM*& merged, std::size_t _First1, std::size_t _Last1, \
		std::size_t _First2, std::size_t _Last2, _CompKeys comp_keys, _CompVals comp_vals)
	{
		if ((_Last1 <= _First1) ||
			(_Last2 <= _First2)) return;

		std::size_t pos = 0L;
		std::size_t i = _First1, j = _First2;

		std::size_t merged_size = \
			((_Last1 - _First1) + 1) + ((_Last2 - _First2) + 1);

		for (std::size_t index = 0; index < merged_size; index++)
		{
			if ((i <= _Last1) &&
				((comp_keys(array1[i], array2[j])) ||
				(comp_vals(array1[i], array2[j])))) {
				merged[pos++] = array1[i++];
			}
			else if ((j <= _Last2) &&
				((!comp_keys(array1[i], array2[j])) &&
				(!comp_vals(array1[i], array2[j])))) {
				merged[pos++] = array2[j++];
			}
		}

		while (i <= _Last1)
			merged[pos++] = array1[i++];
		while (j <= _Last2)
			merged[pos++] = array2[j++];
	}

与之前讨论的算法一样，合并两个子数组的过程无法并行运行，因为每次循环迭代的结果在很大程度上取决于所有先前迭代结果，这是一个数据流依赖性问题。此外，以下算法需要额外的 O(N+M) 空间来将合并后的数组存储在单独的临时缓冲区中。

使用 Intel® MPI 库实现并行“稳定”排序

任何 MPI 程序的执行都始于 MPI 执行环境的初始化。这通常通过调用特定的 MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided) 例程来完成，该例程由每个启动的进程集体执行。此例程实际上将每个正在运行的进程添加到 MPI 执行上下文中。此路由接受参数数量，如指向向量参数的指针及其计数，以及提供的线程支持级别。

	// Initialize the MPI environment
	MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

此外，此例程还会执行几项后台任务，例如初始化 MPI 通信器，以便每个正在运行的进程能够从组中的其他进程发送和接收消息，从而在全球范围内共享正在并行启动的所有进程中的数据。

执行初始化后，每个进程调用 MPI_Comm_size(MPI_COMM_WORLD, &world_size) 和 MPI_Comm_rank(MPI_COMM_WORLD, &world_rank) 例程来获取正在运行的进程数以及每个进程的秩参数值。

	// Get the number of processes
	int world_size;
	MPI_Comm_size(MPI_COMM_WORLD, &world_size);

	// Get the rank of the process
	int world_rank;
	MPI_Comm_rank(MPI_COMM_WORLD, &world_rank)

接下来，每个进程集体分配一个内存缓冲区来存储待排序对象的数组，使用 MPI_Alloc_mem(size * sizeof(gen::ITEM), MPI_INFO_NULL, &array) 例程。此外，它创建两个 MPI 窗口用于运行进程之间的单边通信（RMA）。具体来说，这些窗口用于远程访问每个进程访问时间戳数组。在这种情况下，这些数组仅用于存储每个正在运行的进程的执行挂钟时间开始和结束的时间戳。下面将讨论一种估算 MPI 程序执行时间的方法。

	gen::ITEM* array = nullptr;
	std::double_t* p_timestamps_start = nullptr;
	std::double_t* p_timestamps_finish = nullptr;

	MPI_Alloc_mem(size * sizeof(gen::ITEM), MPI_INFO_NULL, &array);

	MPI_Alloc_mem(world_size * sizeof(std::double_t), MPI_INFO_NULL, &p_timestamps_start);
	MPI_Alloc_mem(world_size * sizeof(std::double_t), MPI_INFO_NULL, &p_timestamps_finish);

	MPI_Win_create(p_timestamps_start, world_size * sizeof(std::double_t), 
    sizeof(std::double_t), MPI_INFO_NULL, MPI_COMM_WORLD, &win_ts1);
	MPI_Win_create(p_timestamps_finish, world_size * sizeof(std::double_t), 
    sizeof(std::double_t), MPI_INFO_NULL, MPI_COMM_WORLD, &win_ts2);

	std::memset((void*)p_timestamps_start, 0x00, sizeof(std::double_t) * world_size);
	std::memset((void*)p_timestamps_finish, 0x00, sizeof(std::double_t) * world_size);

分配特定缓冲区后，秩为 0 的根进程生成一个待排序的对象数组。

	if (world_rank == 0) {

		std::string logo = "\nParallel Stable Sort v.3.00 (MPI Multi-Node Cluster) 
                            by Arthur V. Ratz";
		std::cout << logo << "\n\n";

		std::cout << "Device : " << 
        device_queue.get_device().get_info<info::device::name>() << std::endl << std::endl;

		std::cout << "Generating an array of objects... (size = " << size << ")\n";

		gen::generate_objects(array, size);

		std::vector<gen::ITEM> array_copy;
		array_copy.assign(array, array + size);

		std::cout << "sorting an array... (sort type: array of objects)\n";

		// Obtain the value of walltime prior to performing the sequential sorting
		time_s = MPI_Wtime();

		// Perform the sequental sort by using std::sort function
		internal::sequential_stable_sort(array_copy.begin(), array_copy.end(),
			[&](const gen::ITEM& item1, const gen::ITEM& item2) 
                                        { return item1.key < item2.key;  },
			[&](const gen::ITEM& item1, const gen::ITEM& item2) 
                                        { return item1.value < item2.value;  });

		// Obtain the value of walltime after to performing the sequential sorting
		time_f = MPI_Wtime();

		// Compute the overall execution walltime
		std_sort_time_elapsed = time_f - time_s;

		array_copy.clear();

		std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(4)
			<< "execution time (std::sort): " << std_sort_time_elapsed << " ms ";
	}

由于待排序的对象数组已生成，根进程通过调用 MPI_Bcast(&size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD); 将变量 size 的值广播给所有其他进程。这基本上是为了确定每个进程排序的子数组的大小。然后，每个进程分配一个本地缓冲区来存储要排序的子数组。然后，进程集体调用 MPI_Scatter(&array[0], chunk_size, dt_items, &array_chunk[0], chunk_size, dt_items, 0, MPI_COMM_WORLD)，该函数将根进程生成的整个对象数组分割开，并将每个子数组分散到并行运行的特定进程。

最后，每个进程调用 internal::parallel_stable_sort(…) 函数来执行其自身子数组的排序。由于每个子数组的排序都是并行执行的，我们必须提供一个隐式屏障来同步组中所有进程的执行。这通常通过调用 MPI_Barrier(MPI_COMM_WORLD) 例程来完成，该例程会阻止所有进程，直到每个进程完成排序过程并到达屏障。

由于每个子数组已由特定进程排序，每个进程集体调用 MPI_Gather(&array_chunk[0], chunk_size, dt_items, &array_target[0], chunk_size, dt_items, 0, MPI_COMM_WORLD) 将所有独立排序的子数组收集到一个对象数组中，并将它们发送回秩为 0 的根进程。

	MPI_Bcast(&size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD);

	std::size_t chunk_size = size / world_size;

	gen::ITEM* array_chunk = \
		(gen::ITEM*)std::malloc(chunk_size * sizeof(gen::ITEM));

	MPI_Scatter(&array[0], chunk_size, dt_items, &array_chunk[0],
		chunk_size, dt_items, 0, MPI_COMM_WORLD);

	std::vector<gen::ITEM> chunk_v;
	chunk_v.assign(array_chunk, array_chunk + chunk_size);

	p_timestamps_start[world_rank] = MPI_Wtime();

	internal::parallel_stable_sort(chunk_v, device_queue,
		[&](const gen::ITEM& item1, const gen::ITEM& item2) 
                                    { return item1.key < item2.key;  },
		[&](const gen::ITEM& item1, const gen::ITEM& item2) 
                                    { return item1.value < item2.value;  });

	p_timestamps_finish[world_rank] = MPI_Wtime();
	
	std::memcpy((void*)array_chunk, (const void*)& chunk_v[0], \
		sizeof(gen::ITEM) * chunk_size);

	MPI_Barrier(MPI_COMM_WORLD);

	gen::ITEM* array_target = nullptr;
	if (world_rank == 0)
	{
		array_target = \
			 (gen::ITEM*)std::malloc(size * sizeof(gen::ITEM));
	}

	MPI_Gather(&array_chunk[0], chunk_size, dt_items, \
		&array_target[0], chunk_size, dt_items, 0, MPI_COMM_WORLD);

在此代码中，每个并行进程调用 internal::parallel_stable_sort(...) 函数以并行方式对其自身对象子数组进行排序。这反过来又调用了嵌套并行。此参考还提到了并行快速排序或归并排序作为具有嵌套并行性的例程。在这种情况下，实际排序是由并行执行的一组进程在外部级别完成的。同时，每个进程在内部级别并行地执行特定子数组的排序。使用和实现嵌套并行性概念可以极大地提高排序本身的并行性能。

在此 MPI 程序执行结束时，秩为 0 的根进程通过执行下面列出的以下代码片段，将所有子数组合并成一个已排序对象的完整数组。

	if (world_rank == 0)
	{
		time_s = MPI_Wtime();

		sorted = (gen::ITEM*)std::malloc(size * sizeof(gen::ITEM));
		std::memcpy((void*)sorted, (const void*)array_target, 
                                    sizeof(gen::ITEM) * chunk_size);

		std::size_t _First = 0, _Last = chunk_size;
		for (int rank = 1; rank < world_size; rank++)
		{
			std::size_t xs = rank * (size / world_size);
			std::size_t xe = (rank + 1) * (size / world_size);

			std::size_t merged_size = (xe - xs) + (_Last - _First);

			gen::ITEM* merged = \
				(gen::ITEM*)std::malloc(sizeof(gen::ITEM) * merged_size);

			parallel_sort_mpi_impl::merge(sorted, array_target, \
				merged, _First, _Last - 1, xs, xe - 1,
				[&](const gen::ITEM& item1, const gen::ITEM& item2) {
					return (item1.key < item2.key); },
				[&](const gen::ITEM& item1, const gen::ITEM& item2) {
						return (item1.key == item2.key) && 
                               (item1.value < item2.value); });

			std::memcpy((void*)sorted, (const void*)merged, \
				sizeof(gen::ITEM) * merged_size);

			std::free(merged);

			_Last = merged_size;
		}

		time_f = MPI_Wtime();
		merge_time_diff = time_f - time_s;
	}

估算 MPI 程序执行时间

为了估算 MPI 程序的执行时间，我们将使用 John McCalpin (Intel Blackbelt) 最初提出的方法，根据该方法，我们分配两个大小等于正在运行的进程数的缓冲区。第一个缓冲区将用于存储组中每个进程的执行挂钟时间，第二个缓冲区用于存储执行结束挂钟时间。在每个进程执行期间，它会调用 MPI_Wtime() 例程以在执行执行实际并行排序的代码之前和之后获取当前的挂钟时间值。然后，它将这些值分别存储到第一个和第二个缓冲区中，其位置等于进程秩。

p_timestamps_start[world_rank] = MPI_Wtime();
// Sort the sub-array of objects
p_timestamps_finish[world_rank] = MPI_Wtime();

请注意，这两个缓冲区通过 MPI 的单边通信 (RMA) 在所有进程之间共享，使用先前初始化的特定 MPI 窗口。在 p_timestamps_start 和 p_timestamps_finish 缓冲区中的相应位置分配开始和结束挂钟执行时间值后，我们会更新特定的 MPI 窗口，以便存储在这些缓冲区中的数据可供所有进程访问，包括秩为 0 的根进程。

 MPI_Win_lock_all(0, win_ts1);
for (int rank = 0; rank < world_size; rank++)
  MPI_Put(&p_timestamps_start[world_rank], 1, MPI_DOUBLE, rank, 
          world_rank, 1, MPI_DOUBLE, win_ts1);
MPI_Win_flush_local(world_rank, win_ts1);
MPI_Win_unlock_all(win_ts1);

 MPI_Win_lock_all(0, win_ts2);
for (int rank = 0; rank < world_size; rank++)
  MPI_Put(&p_timestamps_finish[world_rank], 1, MPI_DOUBLE, rank, 
          world_rank, 1, MPI_DOUBLE, win_ts2);
MPI_Win_flush_local(world_rank, win_ts2);
MPI_Win_unlock_all(win_ts2);

在 MPI 程序执行结束时，根进程使用以下算法计算总执行时间。它通常在 p_timestamps_start 缓冲区中存储的所有进程的开始执行时间的最小值。类似地，它在 p_timestamps_finish 缓冲区中存储的结束执行时间的最大值。最后，它计算这些最小开始和最大结束挂钟时间值的跨度，这就是 MPI 程序的总执行时间。

	if (world_rank == 0)
	{
		std::vector<std::double_t> pv_timestamps_start;
		pv_timestamps_start.assign(p_timestamps_start, p_timestamps_start + world_size);

		std::vector<std::double_t> pv_timestamps_finish;
		pv_timestamps_finish.assign(p_timestamps_finish, p_timestamps_finish + world_size);

		std::double_t mpi_time_start = \
			* std::min_element(dpstd::execution::par_unseq,
				pv_timestamps_start.begin(), pv_timestamps_start.end());
		std::double_t mpi_time_finish = \
			* std::max_element(dpstd::execution::par_unseq,
				pv_timestamps_finish.begin(), pv_timestamps_finish.end());

		std::double_t mpi_time_diff = mpi_time_finish - mpi_time_start;
		mpi_sort_time_elapsed = mpi_time_diff + merge_time_diff;
	}

在 Intel® DevCloud 中构建和运行示例 MPI 程序

要构建和运行示例 MPI 程序，我们需要从本文页面底部提供的链接下载项目存档。然后，我们必须使用 Jupyter Notebook* 将存档上传到 Intel® DevCloud，并使用 Linux* 终端中的以下命令提取其内容。

tar -xvf parallel_stable_sort_mpi.tar.gz

之后，我们必须使用以下 bash 脚本来构建和运行项目。

./build_run.sh

或者，我们可以使用以下命令手动构建项目。

mpiicpc -o parallel_stable_sort_mpi  -cxx=dpcpp -fsycl -lOpenCL 
        -lsycl -ltbb -lmpi parallel_stable_sort_mpi.cpp

最后，我们可以使用下面列出的命令运行 MPI 程序可执行文件。

mpirun -np 4 ./parallel_stable_sort_mpi

此命令通过启动 np = 4 个并行进程在单个节点内运行 MPI 程序。

此外，无需构建项目，因为下载的存档包含一个预先构建的 MPI 程序可执行文件，可以使用上述命令轻松运行。

评估性能

我在 Intel DevCloud 多节点集群中测试了并行排序 MPI 程序的性能。根据性能评估结果，本文讨论的 MPI 程序执行的并行排序比传统的 std::sort(...) 函数的性能要快得多。以下是本文讨论的 MPI 程序执行的并行排序性能的一些事实：

对象大小 >= 10⁷	快 2x-8x 倍
对象大小 >= 10⁹	快 8x-11x 倍

然而，在多于一个计算节点（例如，多节点集群）上通过一组启动的进程执行并行排序可能会导致显著的性能下降，因为实际通信 fabric 的带宽，如 OpenFabrics Interfaces* (OFI)-capable 网络、Intel® True Scale Fabric、Intel® Omni-Path Architecture、InfiniBand* 和 Ethernet（通过 OFI API），比单个计算节点内的系统内存带宽慢很多倍。

事实上，这就是为什么强烈建议使用现代 Intel® 10gbE 网络 fabric 在多节点计算集群中执行大数据排序的原因。

历史

2020 年 5 月 8 日 - 本文的最终修订版已发布。