问题

I am writing a dataframe to parquet using pyarrow in python.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame(
        {
         "numbers": [1, 2, 3],
         "colors": ["red", "white", "blue"],
         "dates":["2019-12-16", "2019-12-16", "2019-12-16"],
         "codes": [None, None, None]
        }
    )
table = pa.Table.from_pandas(df)
pq.write_table(table, "filename")

Parquet file when read in Java (or in Sublime Text configured with parquet-tools) is:

{"numbers":1,"codes":"R1"}
{"numbers":2,"codes":"G1"}
{"numbers":3,"codes":"B1"}

The code I am using to read the parquet is this:

class Parquet {
  private List<SimpleGroup> data;
  private List<Type> schema;

  public Parquet(List<SimpleGroup> data, List<Type> schema) {
    this.data = data;
    this.schema = schema;
  }

  public List<SimpleGroup> getData() {
    return data;
  }

  public List<Type> getSchema() {
    return schema;
  }
}

public static Parquet getParquetData(String filePath) throws IOException {
    List<SimpleGroup> simpleGroups = new ArrayList<>();
    ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), new Configuration()));
    MessageType schema = reader.getFooter().getFileMetaData().getSchema();
    List<Type> fields = schema.getFields();
    PageReadStore pages;
    while ((pages = reader.readNextRowGroup()) != null) {
      long rows = pages.getRowCount();
      MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
      RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

      for (int i = 0; i < rows; i++) {
        SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
        simpleGroups.add(simpleGroup);
      }
    }
    reader.close();
    return new Parquet(simpleGroups, fields);
  }
}

While debugging I found that though schema has all the columns, in data we see only non-null columns.

使用Java中的parquet-avro库读取使用pyarrow编写的parquet文件

Has anybody seen this behavior? Is there any option in the parquet-avro library to not have this 'optimization'?

Thanks

英文:

I am writing a dataframe to parquet using pyarrow in python.

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame(
        {
         &quot;numbers&quot;: [1, 2, 3],
         &quot;colors&quot;: [&quot;red&quot;, &quot;white&quot;, &quot;blue&quot;],
         &quot;dates&quot;:[&#39;2019-12-16&#39;, &#39;2019-12-16&#39;, &#39;2019-12-16&#39;],
         &quot;codes&quot;: [None, None, None]
        }
    )
table = pa.Table.from_pandas(df)
pq.write_table(table, &quot;filename&quot;)

Parquet file when read in Java (or in Sublime Text configured with parquet-tools) is:

{&quot;numbers&quot;:1,&quot;codes&quot;:&quot;R1&quot;}
{&quot;numbers&quot;:2,&quot;codes&quot;:&quot;G1&quot;}
{&quot;numbers&quot;:3,&quot;codes&quot;:&quot;B1&quot;}

The code I am using to read the parquet is this:

class Parquet {
  private List&lt;SimpleGroup&gt; data;
  private List&lt;Type&gt; schema;

  public Parquet(List&lt;SimpleGroup&gt; data, List&lt;Type&gt; schema) {
    this.data = data;
    this.schema = schema;
  }

  public List&lt;SimpleGroup&gt; getData() {
    return data;
  }

  public List&lt;Type&gt; getSchema() {
    return schema;
  }
}

 public static Parquet getParquetData(String filePath) throws IOException {
    List&lt;SimpleGroup&gt; simpleGroups = new ArrayList&lt;&gt;();
    ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), new Configuration()));
    MessageType schema = reader.getFooter().getFileMetaData().getSchema();
    List&lt;Type&gt; fields = schema.getFields();
    PageReadStore pages;
    while ((pages = reader.readNextRowGroup()) != null) {
      long rows = pages.getRowCount();
      MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
      RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

      for (int i = 0; i &lt; rows; i++) {
        SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
        simpleGroups.add(simpleGroup);
      }
    }
    reader.close();
    return new Parquet(simpleGroups, fields);
  }
}

While debugging I found that though schema has all the columns, in data we see only non-null columns.

Has anybody seen this behavior? Is there any option in the parquet-avro library to not have this 'optimization'?

Thanks

专注分享java语言的经验与见解，让所有开发者获益！

使用Java中的parquet-avro库读取使用pyarrow编写的parquet文件

问题

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论