如何在从hadoop1迁移到hadoop2后恢复hadoop Map reduce作业的性能。

huangapple 未分类评论44阅读模式
英文:

How to restore performance of hadoop Map reduce job after migrating from hadoop1 to hadoop2

问题

Hadoop MapReduce 作业性能(执行作业的时间)在从 Hadoop 1.0.3 迁移到 Hadoop 2.8.5 后出现了下降(5分钟 -> 15分钟)。

详细信息如下:

我在 AWS EMR 环境中执行 Hadoop MapReduce 作业。

Hadoop 1.0.3 环境详情:
AMI 版本:2.4.11
Hadoop 版本:1.0.3

EMR 作业的第一步(仅有 1 步)需要 5 分钟 才能运行,使用包含 1 个主节点和 1 个核心节点(AWS 术语)。在 Hadoop 仪表板上,我有一个包含单个作业的应用程序。

  • 作业中的 Mapper 任务数:524
  • 作业中的 Reducer 任务数:7
  • 机器配置(R3.2xlarge:8VCPU、61Gib RAM、160GB SSD)

Hadoop 2.8.5 环境详情:

在 Hadoop 2.8.5 环境中,相同的 MapReduce 作业需要 约 15 分钟 才能运行,所有配置都相同(1 个主节点,1 个核心节点)。

  • 作业中的 Mapper 任务数:524
  • 作业中的 Reducer 任务数:{3,7} // 尝试过 3 个和 7 个 Reducer
  • 机器配置(R5.2xlarge:8VCPU、64Gib RAM、350GB EBS)

配置值

  • yarn.scheduler.minimum-allocation-mb =32

  • yarn.scheduler.maximum-allocation-mb = 57344

  • 关于使用 Hadoop 2.8.5 MapReduce 作业的其他信息

    • 经过时间:15 分钟 5 秒
    • 诊断信息:无
    • 平均 Mapper 时间:7 秒
    • 平均 Shuffle 时间:10 分钟 51 秒
    • 平均 Merge 时间:0 秒
    • 平均 Reducer 时间:0 秒

我尝试过的方法:
对以下设置进行了调整,但是在执行作业的时间方面没有在任何情况下改变。 下面是一个测试情况的值:

  • mapreduce.map.java.opts = -Xmx5734m
  • mapreduce.reduce.java.opts = Xmx11468m

下面我提到了尝试的不同组合

  • mapreduce.map.memory.mb = {4163, 9163 , 7163}
  • mapreduce.reduce.memory.mb = {2584, 6584 , 3584}

由于 Hadoop2 中存在资源管理器的架构变化,我尝试了一些实验,但我可能遗漏了什么。我在 Hadoop 中的熟练程度:初学者

英文:

Hadoop Map reduce job performance(time to execute job) degraded(5min->15min) after migrating from hadoop 1.0.3 -> hadoop 2.8.5

Details below:

I have Hadoop Map reduce job executing in AWS EMR environment.

Hadoop 1.0.3 Environment details:
AMI Version: 2.4.11
Hadoop Version: 1.0.3

Step 1(only 1 step) of EMR job takes 5 minutes to run with testing instance consisting of 1 master and 1 core(aws terminology). In hadoop dashboard I have my application consisting of a single job.

  • Nuber of Mapper tasks in job:524
  • Number of reducer tasks in job:7
  • Machine configs(R3.2xlarge: 8VCPU, 61Gib RAM, 160GB SSD)

Hadoop 2.8.5 Environment details:

In Hadoop 2.8.5 environment, the same mapreduce job takes ~15 minutes to run with all the same configs (1master, 1 core)

  • Number of Mapper tasks in job:524
  • Number of reducer tasks in job:{3,7} // Tried with both 3 and 7 reducers
  • Machine configs(R5.2xlarge: 8VCPU, 64Gib RAM, 350GB EBS)

Config values

  • yarn.scheduler.minimum-allocation-mb =32

  • yarn.scheduler.maximum-allocation-mb = 57344

  • Other info about the job run with Hadoop 2.8.5 MR job

    • Elapsed: 15mins, 5sec
    • Diagnostics:
    • Average Map Time 7sec
    • Average Shuffle Time 10mins, 51sec
    • Average Merge Time 0sec
    • Average Reduce Time 0sec

What I have tried:
Tweaked around following settings but performance in terms of time to execute the job is not changing in any scenario. Sharing values of one of the scenarios tested

  • mapreduce.map.java.opts = -Xmx5734m
  • mapreduce.reduce.java.opts = Xmx11468m

Below I am mentioning different combinations tried

  • mapreduce.map.memory.mb = {4163, 9163 , 7163}
  • mapreduce.reduce.memory.mb = {2584, 6584 , 3584}

Since there is a resource manager architectural change in hadoop2, I experimented around it but is there anything I may be missing. My Proficiency level in Hadoop: Beginner

答案1

得分: 0

问题是Hadoop map-reduce中的小文件问题。在Hadoop V1.0.3中,这个问题被JVM容器的重复使用所掩盖(mapred.job.reuse.jvm.num.tasks)。

然而,在Hadoop V2中,不允许重复使用JVM容器。使用Uber模式也不可行,因为它会使所有map任务在ApplicationMaster容器中按顺序运行。

使用CombineTextInputFormat.setMaxInputSplitSize(job, bytes)解决了小文件问题,因为它基于提供的字节数创建了一个逻辑拆分。

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

英文:

The issue was the small file problem in Hadoop map-reduce. In Hadoop V1.0.3, this problem was being over-shadowed by resuing the JVM containers(mapred.job.reuse.jvm.num.tasks).

However, in Hadoop V2, reusing the JVM container is not allowed. Using Uber mode is also not viable as it will run all the map tasks in the ApplicationMaster container sequentially.

Using CombineTextInputFormat.setMaxInputSplitSize(job, bytes) solved the small file problem as it created a logical split based on the number of bytes provided as an argument.

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

huangapple
  • 本文由 发表于 2020年4月4日 21:49:31
  • 转载请务必保留本文链接:https://java.coder-hub.com/61029085.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定