问题

我有一些使用PyArrow在Python中编写的Parquet文件。现在我想使用Java程序读取它们。我尝试了以下方法，使用了Apache Avro：

import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;

public class Main {
    private static Path path = new Path("D:\\pathToFile\\review.parquet");

    public static void main(String[] args) throws IllegalArgumentException {
        try {
            Configuration conf = new Configuration();

            Schema schema = SchemaBuilder.record("lineitem")
                    .fields()
                        .name("reviewID")
                        .aliases("review_id$str")
                        .type().stringType()
                        .noDefault()
                    .endRecord();
            conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());

            ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
                .withConf(conf)
                .build();

            GenericRecord r;
            while (null != (r = reader.read())) {
                r.getSchema().getField("reviewID").addAlias("review_id$str");

                Object review_id = r.get("review_id$str");
                String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
                System.out.println("review_id: " + review_id_str);
            }
        } catch (IOException e) {
            System.out.println("Error reading parquet file.");
            e.printStackTrace();
        }
    }
}

我的Parquet文件包含列，列名中包含符号[, ], ., \和$。（在这种情况下，Parquet文件包含一个名为review_id$str的列，我想要读取其值）。然而，这些字符在Avro中无效（参见：https://avro.apache.org/docs/current/spec.html#names）。因此，我尝试使用别名（参见：http://avro.apache.org/docs/current/spec.html#Aliases）。尽管现在我不再收到“无效字符错误”，但我仍然无法获取值，即使列包含值，也没有任何输出被打印出来。

它只打印出：

review_id: -
review_id: -
review_id: -
review_id: -
...

而预期的输出应该是：

review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...

我是否使用别名的方式不正确？在这种情况下是否可能使用别名？如果是这样，请解释一下如何修复。谢谢。

2021年更新：
最终，我决定不使用Java来完成这个任务。我坚持使用了Python中的PyArrow解决方案，它完美地运行正常。

英文:

I have some Parquet files written in Python using PyArrow. Now I want to read them using a Java program. I tried the following, using Apache Avro:

import java.io.IOException;

import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;

public class Main {
	
	private static Path path = new Path(&quot;D:\\pathToFile\\review.parquet&quot;);

    public static void main(String[] args) throws IllegalArgumentException {
        try {
            Configuration conf = new Configuration();

            Schema schema = SchemaBuilder.record(&quot;lineitem&quot;)
                    .fields()
                    	.name(&quot;reviewID&quot;)
                    	.aliases(&quot;review_id$str&quot;)
                    	.type().stringType()
                    	.noDefault()
                    .endRecord();
                conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());           
                
            ParquetReader&lt;GenericRecord&gt; reader = AvroParquetReader.&lt;GenericRecord&gt;builder(path)
                .withConf(conf)
                .build();
            
            GenericRecord r;
            while (null != (r = reader.read())) {
            	
            	r.getSchema().getField(&quot;reviewID&quot;).addAlias(&quot;review_id$str&quot;);
           
            	Object review_id = r.get(&quot;review_id$str&quot;);
            	String review_id_str = review_id != null ? (&quot;&#39;&quot; + review_id.toString() + &quot;&#39;&quot;) : &quot;-&quot;;
            	System.out.println(&quot;review_id: &quot; + review_id_str);
        
            }
        } catch (IOException e) {
            System.out.println(&quot;Error reading parquet file.&quot;);
            e.printStackTrace();
        }
    }
}

My Parquet File contains columns whose name contain the symbols [, ], ., \ and $. (In this case, the Parquet file contains a column review_id$str, whose values I want to read). However, these characters are invalid in Avro (see: https://avro.apache.org/docs/current/spec.html#names). Therefore, I tried to use Aliases (see: http://avro.apache.org/docs/current/spec.html#Aliases). Even though now I don't get any "Invalid Character Errors", I am still unable to get the values, i.e. nothing is getting printed even though the column contains values.

It only prints:

review_id: -
review_id: -
review_id: -
review_id: -
...

And expected would be:

review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...

Am I using the Aliases wrong? Is it even possible to use aliases in this situation? If so, please explain me how I can fix it. Thank you.

Update 2021:
In the end, I decided not to use Java for this task. I stuck to my solution in Python using PyArrow which works perfectly fine.

专注分享java语言的经验与见解，让所有开发者获益！

使用非法字符读取 Parquet 文件（Apache Avro）

问题

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论