英文:
Read Parquet File with illegal characters (Apache-Avro)
问题
我有一些使用PyArrow在Python中编写的Parquet文件。现在我想使用Java程序读取它们。我尝试了以下方法,使用了Apache Avro:
import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;
public class Main {
private static Path path = new Path("D:\\pathToFile\\review.parquet");
public static void main(String[] args) throws IllegalArgumentException {
try {
Configuration conf = new Configuration();
Schema schema = SchemaBuilder.record("lineitem")
.fields()
.name("reviewID")
.aliases("review_id$str")
.type().stringType()
.noDefault()
.endRecord();
conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
.withConf(conf)
.build();
GenericRecord r;
while (null != (r = reader.read())) {
r.getSchema().getField("reviewID").addAlias("review_id$str");
Object review_id = r.get("review_id$str");
String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
System.out.println("review_id: " + review_id_str);
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
}
我的Parquet文件包含列,列名中包含符号[
, ]
, .
, \
和$
。(在这种情况下,Parquet文件包含一个名为review_id$str
的列,我想要读取其值)。然而,这些字符在Avro中无效(参见:https://avro.apache.org/docs/current/spec.html#names)。因此,我尝试使用别名(参见:http://avro.apache.org/docs/current/spec.html#Aliases)。尽管现在我不再收到“无效字符错误”,但我仍然无法获取值,即使列包含值,也没有任何输出被打印出来。
它只打印出:
review_id: -
review_id: -
review_id: -
review_id: -
...
而预期的输出应该是:
review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...
我是否使用别名的方式不正确?在这种情况下是否可能使用别名?如果是这样,请解释一下如何修复。谢谢。
2021年更新:
最终,我决定不使用Java来完成这个任务。我坚持使用了Python中的PyArrow解决方案,它完美地运行正常。
英文:
I have some Parquet files written in Python using PyArrow. Now I want to read them using a Java program. I tried the following, using Apache Avro:
import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;
public class Main {
private static Path path = new Path("D:\\pathToFile\\review.parquet");
public static void main(String[] args) throws IllegalArgumentException {
try {
Configuration conf = new Configuration();
Schema schema = SchemaBuilder.record("lineitem")
.fields()
.name("reviewID")
.aliases("review_id$str")
.type().stringType()
.noDefault()
.endRecord();
conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
.withConf(conf)
.build();
GenericRecord r;
while (null != (r = reader.read())) {
r.getSchema().getField("reviewID").addAlias("review_id$str");
Object review_id = r.get("review_id$str");
String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
System.out.println("review_id: " + review_id_str);
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
}
My Parquet File contains columns whose name contain the symbols [
, ]
, .
, \
and $
. (In this case, the Parquet file contains a column review_id$str
, whose values I want to read). However, these characters are invalid in Avro (see: https://avro.apache.org/docs/current/spec.html#names). Therefore, I tried to use Aliases (see: http://avro.apache.org/docs/current/spec.html#Aliases). Even though now I don't get any "Invalid Character Errors", I am still unable to get the values, i.e. nothing is getting printed even though the column contains values.
It only prints:
review_id: -
review_id: -
review_id: -
review_id: -
...
And expected would be:
review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...
Am I using the Aliases wrong? Is it even possible to use aliases in this situation? If so, please explain me how I can fix it. Thank you.
Update 2021:
In the end, I decided not to use Java for this task. I stuck to my solution in Python using PyArrow which works perfectly fine.
专注分享java语言的经验与见解,让所有开发者获益!
评论