2020年7月24日 18:54:19未分类评论61阅读模式

英文:

How can I skip sync markers when comparing two avro files filled up with similar data?

问题

以下是翻译好的内容：

请问是否有人能够提供建议，关于如何比较两个包含相同数据的 Avro 文件？
我的应用程序每天将数据库数据（可能是静态数据）序列化为 Avro 格式。目的是比较新生成的文件与先前版本的文件。
这是由 Java 驱动的。目前我采取逐行比较的方法。这几乎完全满足了我的需求。唯一的问题是 Avro 对象容器文件在 Avro 文件头和文件数据块的末尾都包含 16 字节的同步标记。这些同步标记会自动生成，用于每个新的 Avro 文件。
下面是从网络上获取的 Avro 文件示例：

Objavro.codecnullavro.schema&#242;{&quot;type&quot;:&quot;record&quot;,&quot;name&quot;:&quot;twitter_schema&quot;,&quot;namespace&quot;:&quot;com.miguno.avro&quot;,&quot;fields&quot;:[{&quot;name&quot;:&quot;username&quot;,&quot;type&quot;:&quot;string&quot;,&quot;doc&quot;:&quot;用户在 Twitter.com 上的账户名&quot;},{&quot;name&quot;:&quot;tweet&quot;,&quot;type&quot;:&quot;string&quot;,&quot;doc&quot;:&quot;用户的 Twitter 消息内容&quot;},{&quot;name&quot;:&quot;timestamp&quot;,&quot;type&quot;:&quot;long&quot;,&quot;doc&quot;:&quot;以毫秒为单位的 Unix 纪元时间&quot;}],&quot;doc:&quot;:&quot;用于存储 Twitter 消息的基本架构&quot;}&#236;7&#234;,Hz[&#197;&#236;&#200;&#200;migunoFRock: Nerf 石头、剪刀，挺好的。&#178;ž&#238;
BlizzardCSF按预期运行。Terran 有点 IMBA。&#226;&#243;&#238;
&#236;7&#234;,Hz[&#197;&#236;&#200;

可以看到 ì7ê,Hz[ÅìÈ 是同步标记，会影响我的逻辑。这导致在相同数据上创建的两个 Avro 文件并不相同。

英文:

Could somebody please suggest on how can I compare two avro files which contain identical data?
My application serializes DB data (which is presumably static) to avro on daily basis. Intention is to compare newly generated files with their previous versions.
This is driven by Java. Currently I'm following an approach of row-to-row comparing. It suits my needs almost perfectly. The only problem is that avro Object Container Files contain 16-byte sync markers at the end of both avro file header and file data block. These sync markers are generated automatically for each new avro file.
An example of avro file taken from web is below:

Objavro.codecnullavro.schema&#242;{&quot;type&quot;:&quot;record&quot;,&quot;name&quot;:&quot;twitter_schema&quot;,&quot;namespace&quot;:&quot;com.miguno.avro&quot;,&quot;fields&quot;:[{&quot;name&quot;:&quot;username&quot;,&quot;type&quot;:&quot;string&quot;,&quot;doc&quot;:&quot;Name of the user account on Twitter.com&quot;},{&quot;name&quot;:&quot;tweet&quot;,&quot;type&quot;:&quot;string&quot;,&quot;doc&quot;:&quot;The content of the user&#39;s Twitter message&quot;},{&quot;name&quot;:&quot;timestamp&quot;,&quot;type&quot;:&quot;long&quot;,&quot;doc&quot;:&quot;Unix epoch time in milliseconds&quot;}],&quot;doc:&quot;:&quot;A basic schema for storing Twitter messages&quot;}&#236;7&#234;,Hz[&#197;&#236;&#200;&#200;migunoFRock: Nerf paper, scissors is fine.&#178;ž&#238;
BlizzardCSFWorks as intended.  Terran is IMBA.&#226;&#243;&#238;
&#236;7&#234;,Hz[&#197;&#236;&#200;

As could be seen ì7ê,Hz[ÅìÈ are sync markers which cause problems to my logic.
This makes two avro files created on the same data not to be identical.

答案1

得分: 0

使用DataFileWriter写入Avro文件时，可以在create方法中手动指定同步标记（sync marker）。如果在应用程序的不同运行之间使用固定的同步标记，如果对象没有更改，文件应该是相同的。

英文:

When writing Avro files with the DataFileWriter, you can manually specify a sync marker in the create method. If you use a fixed sync marker in your application between runs, the files should be identical if the objects haven't changed.

专注分享java语言的经验与见解，让所有开发者获益！

如何在比较两个填充有相似数据的 Avro 文件时跳过同步标记？

问题

答案1

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论