使用非法字符读取 Parquet 文件(Apache Avro)

huangapple 未分类评论48阅读模式
英文:

Read Parquet File with illegal characters (Apache-Avro)

问题

我有一些使用PyArrow在Python中编写的Parquet文件。现在我想使用Java程序读取它们。我尝试了以下方法,使用了Apache Avro:

import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;

public class Main {
    private static Path path = new Path("D:\\pathToFile\\review.parquet");

    public static void main(String[] args) throws IllegalArgumentException {
        try {
            Configuration conf = new Configuration();

            Schema schema = SchemaBuilder.record("lineitem")
                    .fields()
                        .name("reviewID")
                        .aliases("review_id$str")
                        .type().stringType()
                        .noDefault()
                    .endRecord();
            conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());

            ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
                .withConf(conf)
                .build();

            GenericRecord r;
            while (null != (r = reader.read())) {
                r.getSchema().getField("reviewID").addAlias("review_id$str");

                Object review_id = r.get("review_id$str");
                String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
                System.out.println("review_id: " + review_id_str);
            }
        } catch (IOException e) {
            System.out.println("Error reading parquet file.");
            e.printStackTrace();
        }
    }
}

我的Parquet文件包含列,列名中包含符号[, ], ., \$。(在这种情况下,Parquet文件包含一个名为review_id$str的列,我想要读取其值)。然而,这些字符在Avro中无效(参见:https://avro.apache.org/docs/current/spec.html#names)。因此,我尝试使用别名(参见:http://avro.apache.org/docs/current/spec.html#Aliases)。尽管现在我不再收到“无效字符错误”,但我仍然无法获取值,即使列包含值,也没有任何输出被打印出来。

它只打印出:

review_id: -
review_id: -
review_id: -
review_id: -
...

而预期的输出应该是:

review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...

我是否使用别名的方式不正确?在这种情况下是否可能使用别名?如果是这样,请解释一下如何修复。谢谢。

2021年更新:
最终,我决定不使用Java来完成这个任务。我坚持使用了Python中的PyArrow解决方案,它完美地运行正常。

英文:

I have some Parquet files written in Python using PyArrow. Now I want to read them using a Java program. I tried the following, using Apache Avro:

import java.io.IOException;

import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;

public class Main {
	
	private static Path path = new Path(&quot;D:\\pathToFile\\review.parquet&quot;);

    public static void main(String[] args) throws IllegalArgumentException {
        try {
            Configuration conf = new Configuration();

            Schema schema = SchemaBuilder.record(&quot;lineitem&quot;)
                    .fields()
                    	.name(&quot;reviewID&quot;)
                    	.aliases(&quot;review_id$str&quot;)
                    	.type().stringType()
                    	.noDefault()
                    .endRecord();
                conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());           
                
            ParquetReader&lt;GenericRecord&gt; reader = AvroParquetReader.&lt;GenericRecord&gt;builder(path)
                .withConf(conf)
                .build();
            
            GenericRecord r;
            while (null != (r = reader.read())) {
            	
            	r.getSchema().getField(&quot;reviewID&quot;).addAlias(&quot;review_id$str&quot;);
           
            	Object review_id = r.get(&quot;review_id$str&quot;);
            	String review_id_str = review_id != null ? (&quot;&#39;&quot; + review_id.toString() + &quot;&#39;&quot;) : &quot;-&quot;;
            	System.out.println(&quot;review_id: &quot; + review_id_str);
        
            }
        } catch (IOException e) {
            System.out.println(&quot;Error reading parquet file.&quot;);
            e.printStackTrace();
        }
    }
}

My Parquet File contains columns whose name contain the symbols [, ], ., \ and $. (In this case, the Parquet file contains a column review_id$str, whose values I want to read). However, these characters are invalid in Avro (see: https://avro.apache.org/docs/current/spec.html#names). Therefore, I tried to use Aliases (see: http://avro.apache.org/docs/current/spec.html#Aliases). Even though now I don't get any "Invalid Character Errors", I am still unable to get the values, i.e. nothing is getting printed even though the column contains values.

It only prints:

review_id: -
review_id: -
review_id: -
review_id: -
...

And expected would be:

review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...

Am I using the Aliases wrong? Is it even possible to use aliases in this situation? If so, please explain me how I can fix it. Thank you.

Update 2021:
In the end, I decided not to use Java for this task. I stuck to my solution in Python using PyArrow which works perfectly fine.

huangapple
  • 本文由 发表于 2020年5月29日 19:49:09
  • 转载请务必保留本文链接:https://java.coder-hub.com/62085293.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定