Best way to parallelize a non-IO task using many physical cores: Equal work per thread, or maximum number of threads?

huangapple 未分类评论67阅读模式
标题翻译

Best way to parallelize a non-IO task using many physical cores: Equal work per thread, or maximum number of threads?

问题

我们有多种任务,可以分为相等工作单元并且不涉及磁盘或网络I/O。

假设我们要处理70张图像。我们可以独立地处理每张图像。过去我们会这样做:

  • 工作单元数量:70
  • 核心数量:16
  • 每个线程的工作量 = ceil(70/16) = 5
  • 线程数量 = 70 / 5 = 14个线程。

你会注意到这个算法在如此高的核心数量机器上似乎“浪费”了两个核心。所以有人提出了可能会更好的方案:

  • 线程数量 = 核心数量 = 16
  • 每个线程的工作量 = 5(对于6个线程)或者4(对于10个线程)

经过进一步的分析,这似乎可能没有好处:

  1. 如果任务受限于CPU,运行所有内容的墙钟时间仅取决于具有最多工作的线程,仍然是5。
  2. 如果任务受限于内存,添加更多线程只会增加争用。更糟糕的是,最终只会剩下5个线程在运行,这可能无法再饱和内存总线。
  3. 算法2不会为操作系统或后台工作留下任何空余核心。

请注意,我们不想为每个图像分配一个任务对象。虽然在这里无关紧要,但在其他情况下,我们可能有很多图像,不想为每个图像都支付分配一个对象的费用。在这两种算法中,我们只需要为每个线程分配一个捕获两个整数的可运行对象。我们在固定大小的ThreadPoolExecutor中执行可运行对象的实际执行。

不过我想知道在这种情况下其他实现,如ForkJoinPool或Grand Central Dispatch会做什么。

#问题:#

  • 有理由偏好第二种算法吗?它们是什么?

  • 是否有共识认为其中一种“平均而言”比另一种更好?为什么?

我们正在使用Java进行编程,但我认为这个问题适用于任何具有共享内存线程的编程语言。

英文翻译

We have a variety of tasks which are separable into units of equal work and do not involve disk or network I/O.

Imagine we're trying to process 70 images. We can process each image independently. In the past we'd do that like this:

  • Number of work units: 70
  • Number of cores: 16
  • Work per thread = ceil(70/16) = 5
  • Number of threads = 70 / 5 = 14 threads.

You'll notice that this algorithm appears to "waste" two cores on such a high core count machine. So it was proposed that this might be better:

  • Number of threads = Number of cores = 16
  • Work per thread = 5 (for 6 threads) or 4 (for 10 threads)

Upon further analysis, this seems like it might not be beneficial:

  1. If the tasks are CPU bound, the wall-time to run everything only
    depends on the thread with the most work, which is still 5.
  2. If the tasks are memory bound, adding more threads will just increase contention. Worse, we'll be left with only 5 threads running at the end, which might not be able to saturate the memory bus anymore.
  3. Algorithm 2 doesn't leave any spare cores for OS or background work.

Note that we do not want to allocate a task object for each image. It doesn't matter here, but in other cases we may have many images and don't want to have to pay to allocate an object for every one. In either of these two algorithms, we only have to allocate a Runnable that captures two integers per thread. We perform the actual execution of the Runnable in a fixed-size ThreadPoolExecutor.

I do wonder what other implementations like ForkJoinPool or Grand Central Dispatch might do in this case, though.

Questions:

  • Are there reasons to prefer the second algorithm over the first? What are they?

  • Is there a consensus that one is better than the other "on average"? Why?

We're programming in Java but I believe this question applies to any language with shared memory threads.

huangapple
  • 本文由 发表于 2020年3月4日 07:29:32
  • 转载请务必保留本文链接:https://java.coder-hub.com/60517017.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定