使用Java中的parquet-avro库读取使用pyarrow编写的parquet文件

huangapple 未分类评论51阅读模式
英文:

Using parquet-avro library in Java to read parquet file written using pyarrow

问题

I am writing a dataframe to parquet using pyarrow in python.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame(
        {
         "numbers": [1, 2, 3],
         "colors": ["red", "white", "blue"],
         "dates":["2019-12-16", "2019-12-16", "2019-12-16"],
         "codes": [None, None, None]
        }
    )
table = pa.Table.from_pandas(df)
pq.write_table(table, "filename")

Parquet file when read in Java (or in Sublime Text configured with parquet-tools) is:

{"numbers":1,"codes":"R1"}
{"numbers":2,"codes":"G1"}
{"numbers":3,"codes":"B1"}

The code I am using to read the parquet is this:

class Parquet {
  private List<SimpleGroup> data;
  private List<Type> schema;

  public Parquet(List<SimpleGroup> data, List<Type> schema) {
    this.data = data;
    this.schema = schema;
  }

  public List<SimpleGroup> getData() {
    return data;
  }

  public List<Type> getSchema() {
    return schema;
  }
}

public static Parquet getParquetData(String filePath) throws IOException {
    List<SimpleGroup> simpleGroups = new ArrayList<>();
    ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), new Configuration()));
    MessageType schema = reader.getFooter().getFileMetaData().getSchema();
    List<Type> fields = schema.getFields();
    PageReadStore pages;
    while ((pages = reader.readNextRowGroup()) != null) {
      long rows = pages.getRowCount();
      MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
      RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

      for (int i = 0; i < rows; i++) {
        SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
        simpleGroups.add(simpleGroup);
      }
    }
    reader.close();
    return new Parquet(simpleGroups, fields);
  }
}

While debugging I found that though schema has all the columns, in data we see only non-null columns.

使用Java中的parquet-avro库读取使用pyarrow编写的parquet文件

Has anybody seen this behavior? Is there any option in the parquet-avro library to not have this 'optimization'?

Thanks

英文:

I am writing a dataframe to parquet using pyarrow in python.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame(
        {
         &quot;numbers&quot;: [1, 2, 3],
         &quot;colors&quot;: [&quot;red&quot;, &quot;white&quot;, &quot;blue&quot;],
         &quot;dates&quot;:[&#39;2019-12-16&#39;, &#39;2019-12-16&#39;, &#39;2019-12-16&#39;],
         &quot;codes&quot;: [None, None, None]
        }
    )
table = pa.Table.from_pandas(df)
pq.write_table(table, &quot;filename&quot;)

Parquet file when read in Java (or in Sublime Text configured with parquet-tools) is:

{&quot;numbers&quot;:1,&quot;codes&quot;:&quot;R1&quot;}
{&quot;numbers&quot;:2,&quot;codes&quot;:&quot;G1&quot;}
{&quot;numbers&quot;:3,&quot;codes&quot;:&quot;B1&quot;}

The code I am using to read the parquet is this:

class Parquet {
  private List&lt;SimpleGroup&gt; data;
  private List&lt;Type&gt; schema;

  public Parquet(List&lt;SimpleGroup&gt; data, List&lt;Type&gt; schema) {
    this.data = data;
    this.schema = schema;
  }

  public List&lt;SimpleGroup&gt; getData() {
    return data;
  }

  public List&lt;Type&gt; getSchema() {
    return schema;
  }
}

 public static Parquet getParquetData(String filePath) throws IOException {
    List&lt;SimpleGroup&gt; simpleGroups = new ArrayList&lt;&gt;();
    ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), new Configuration()));
    MessageType schema = reader.getFooter().getFileMetaData().getSchema();
    List&lt;Type&gt; fields = schema.getFields();
    PageReadStore pages;
    while ((pages = reader.readNextRowGroup()) != null) {
      long rows = pages.getRowCount();
      MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
      RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

      for (int i = 0; i &lt; rows; i++) {
        SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
        simpleGroups.add(simpleGroup);
      }
    }
    reader.close();
    return new Parquet(simpleGroups, fields);
  }
}

While debugging I found that though schema has all the columns, in data we see only non-null columns.
使用Java中的parquet-avro库读取使用pyarrow编写的parquet文件

Has anybody seen this behavior? Is there any option in the parquet-avro library to not have this 'optimization'?

Thanks

huangapple
  • 本文由 发表于 2020年4月9日 19:45:17
  • 转载请务必保留本文链接:https://java.coder-hub.com/61120411.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定