2020年4月4日 21:49:31未分类评论64阅读模式

英文:

How to restore performance of hadoop Map reduce job after migrating from hadoop1 to hadoop2

问题

Hadoop MapReduce 作业性能（执行作业的时间）在从 Hadoop 1.0.3 迁移到 Hadoop 2.8.5 后出现了下降（5分钟 -> 15分钟）。

详细信息如下：

我在 AWS EMR 环境中执行 Hadoop MapReduce 作业。

Hadoop 1.0.3 环境详情：
AMI 版本：2.4.11
Hadoop 版本：1.0.3

EMR 作业的第一步（仅有 1 步）需要 5 分钟 才能运行，使用包含 1 个主节点和 1 个核心节点（AWS 术语）。在 Hadoop 仪表板上，我有一个包含单个作业的应用程序。

作业中的 Mapper 任务数：524
作业中的 Reducer 任务数：7
机器配置（R3.2xlarge：8VCPU、61Gib RAM、160GB SSD）

Hadoop 2.8.5 环境详情：

在 Hadoop 2.8.5 环境中，相同的 MapReduce 作业需要 约 15 分钟 才能运行，所有配置都相同（1 个主节点，1 个核心节点）。

作业中的 Mapper 任务数：524
作业中的 Reducer 任务数：{3,7} // 尝试过 3 个和 7 个 Reducer
机器配置（R5.2xlarge：8VCPU、64Gib RAM、350GB EBS）

配置值

yarn.scheduler.minimum-allocation-mb =32
yarn.scheduler.maximum-allocation-mb = 57344
关于使用 Hadoop 2.8.5 MapReduce 作业的其他信息
- 经过时间：15 分钟 5 秒
- 诊断信息：无
- 平均 Mapper 时间：7 秒
- 平均 Shuffle 时间：10 分钟 51 秒
- 平均 Merge 时间：0 秒
- 平均 Reducer 时间：0 秒

我尝试过的方法：
对以下设置进行了调整，但是在执行作业的时间方面没有在任何情况下改变。 下面是一个测试情况的值：

mapreduce.map.java.opts = -Xmx5734m
mapreduce.reduce.java.opts = Xmx11468m

下面我提到了尝试的不同组合

mapreduce.map.memory.mb = {4163, 9163 , 7163}
mapreduce.reduce.memory.mb = {2584, 6584 , 3584}

由于 Hadoop2 中存在资源管理器的架构变化，我尝试了一些实验，但我可能遗漏了什么。我在 Hadoop 中的熟练程度：初学者

英文:

Hadoop Map reduce job performance(time to execute job) degraded(5min->15min) after migrating from hadoop 1.0.3 -> hadoop 2.8.5

Details below:

I have Hadoop Map reduce job executing in AWS EMR environment.

Hadoop 1.0.3 Environment details:
AMI Version: 2.4.11
Hadoop Version: 1.0.3

Step 1(only 1 step) of EMR job takes 5 minutes to run with testing instance consisting of 1 master and 1 core(aws terminology). In hadoop dashboard I have my application consisting of a single job.

Nuber of Mapper tasks in job:524
Number of reducer tasks in job:7
Machine configs(R3.2xlarge: 8VCPU, 61Gib RAM, 160GB SSD)

Hadoop 2.8.5 Environment details:

In Hadoop 2.8.5 environment, the same mapreduce job takes ~15 minutes to run with all the same configs (1master, 1 core)

Number of Mapper tasks in job:524
Number of reducer tasks in job:{3,7} // Tried with both 3 and 7 reducers
Machine configs(R5.2xlarge: 8VCPU, 64Gib RAM, 350GB EBS)

Config values

yarn.scheduler.minimum-allocation-mb =32
yarn.scheduler.maximum-allocation-mb = 57344
Other info about the job run with Hadoop 2.8.5 MR job
- Elapsed: 15mins, 5sec
- Diagnostics:
- Average Map Time 7sec
- Average Shuffle Time 10mins, 51sec
- Average Merge Time 0sec
- Average Reduce Time 0sec

What I have tried:
Tweaked around following settings but performance in terms of time to execute the job is not changing in any scenario. Sharing values of one of the scenarios tested

mapreduce.map.java.opts = -Xmx5734m
mapreduce.reduce.java.opts = Xmx11468m

Below I am mentioning different combinations tried

mapreduce.map.memory.mb = {4163, 9163 , 7163}
mapreduce.reduce.memory.mb = {2584, 6584 , 3584}

Since there is a resource manager architectural change in hadoop2, I experimented around it but is there anything I may be missing. My Proficiency level in Hadoop: Beginner

答案1

得分: 0

问题是Hadoop map-reduce中的小文件问题。在Hadoop V1.0.3中，这个问题被JVM容器的重复使用所掩盖（mapred.job.reuse.jvm.num.tasks）。

然而，在Hadoop V2中，不允许重复使用JVM容器。使用Uber模式也不可行，因为它会使所有map任务在ApplicationMaster容器中按顺序运行。

使用CombineTextInputFormat.setMaxInputSplitSize(job, bytes)解决了小文件问题，因为它基于提供的字节数创建了一个逻辑拆分。

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

英文:

The issue was the small file problem in Hadoop map-reduce. In Hadoop V1.0.3, this problem was being over-shadowed by resuing the JVM containers(mapred.job.reuse.jvm.num.tasks).

However, in Hadoop V2, reusing the JVM container is not allowed. Using Uber mode is also not viable as it will run all the map tasks in the ApplicationMaster container sequentially.

Using CombineTextInputFormat.setMaxInputSplitSize(job, bytes) solved the small file problem as it created a logical split based on the number of bytes provided as an argument.

https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

专注分享java语言的经验与见解，让所有开发者获益！

如何在从hadoop1迁移到hadoop2后恢复hadoop Map reduce作业的性能。

问题

答案1

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论