问题

我有几个（5-10个）不超过5KB的配置文件在S3中。这些文件可以通过使用AWS S3或使用RDD来进行读取。因此，如果有10个文件，则会创建10个RDD对象，并使用collect()将其转换为列表。

由于RDD是分布式的，是否建议使用aws-s3 Java SDK来代替RDD进行读取？

英文:

I have few (5-10), under 5KB config files in S3. Either these files can be read by using AWS S3 or by using RDD. So if there are 10 files, 10 RDD object is created, and used collect() to turn this into list.

Since RDD is distributed, is it advisable to read using aws-s3 Java SDK instead of RDD?

答案1

得分: 0

你应该始终优先将配置文件传递给Spark驱动程序，然后使用Python的open命令本身或者如果你正在使用AWS Glue，可以使用Java来读取它们。

如果你正在使用EMR或者你的本地集群，那么你可以使用boto3来读取文件，然后将其传递给驱动程序或者根据需要进行处理。

英文:

You should always prefer passing the config files to the spark driver then reading them using python open command itself or java if you are using aws glue.

If you are using EMR or your native cluster then you can use boto3 to read the file and either pass it to the driver or process accordingly.

专注分享java语言的经验与见解，让所有开发者获益！

Spark：通过S3 aws-sdk进行读取，或作为RDD读取

问题

答案1

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论