如何在比较两个填充有相似数据的 Avro 文件时跳过同步标记?

huangapple 未分类评论44阅读模式
英文:

How can I skip sync markers when comparing two avro files filled up with similar data?

问题

以下是翻译好的内容:

请问是否有人能够提供建议,关于如何比较两个包含相同数据的 Avro 文件?
我的应用程序每天将数据库数据(可能是静态数据)序列化为 Avro 格式。目的是比较新生成的文件与先前版本的文件。
这是由 Java 驱动的。目前我采取逐行比较的方法。这几乎完全满足了我的需求。唯一的问题是 Avro 对象容器文件在 Avro 文件头和文件数据块的末尾都包含 16 字节的同步标记。这些同步标记会自动生成,用于每个新的 Avro 文件。
下面是从网络上获取的 Avro 文件示例:

Objavro.codecnullavro.schemaò{"type":"record","name":"twitter_schema","namespace":"com.miguno.avro","fields":[{"name":"username","type":"string","doc":"用户在 Twitter.com 上的账户名"},{"name":"tweet","type":"string","doc":"用户的 Twitter 消息内容"},{"name":"timestamp","type":"long","doc":"以毫秒为单位的 Unix 纪元时间"}],"doc:":"用于存储 Twitter 消息的基本架构"}ì7ê,Hz[ÅìÈÈmigunoFRock: Nerf 石头、剪刀,挺好的。²žî
BlizzardCSF按预期运行。Terran 有点 IMBA。âóî
ì7ê,Hz[ÅìÈ

可以看到 ì7ê,Hz[ÅìÈ 是同步标记,会影响我的逻辑。这导致在相同数据上创建的两个 Avro 文件并不相同。

英文:

Could somebody please suggest on how can I compare two avro files which contain identical data?
My application serializes DB data (which is presumably static) to avro on daily basis. Intention is to compare newly generated files with their previous versions.
This is driven by Java. Currently I'm following an approach of row-to-row comparing. It suits my needs almost perfectly. The only problem is that avro Object Container Files contain 16-byte sync markers at the end of both avro file header and file data block. These sync markers are generated automatically for each new avro file.
An example of avro file taken from web is below:

Objavro.codecnullavro.schemaò{"type":"record","name":"twitter_schema","namespace":"com.miguno.avro","fields":[{"name":"username","type":"string","doc":"Name of the user account on Twitter.com"},{"name":"tweet","type":"string","doc":"The content of the user's Twitter message"},{"name":"timestamp","type":"long","doc":"Unix epoch time in milliseconds"}],"doc:":"A basic schema for storing Twitter messages"}ì7ê,Hz[ÅìÈÈmigunoFRock: Nerf paper, scissors is fine.²žî
BlizzardCSFWorks as intended.  Terran is IMBA.âóî
ì7ê,Hz[ÅìÈ

As could be seen ì7ê,Hz[ÅìÈ are sync markers which cause problems to my logic.
This makes two avro files created on the same data not to be identical.

答案1

得分: 0

使用DataFileWriter写入Avro文件时,可以在create方法中手动指定同步标记(sync marker)。如果在应用程序的不同运行之间使用固定的同步标记,如果对象没有更改,文件应该是相同的。

英文:

When writing Avro files with the DataFileWriter, you can manually specify a sync marker in the create method. If you use a fixed sync marker in your application between runs, the files should be identical if the objects haven't changed.

huangapple
  • 本文由 发表于 2020年7月24日 18:54:19
  • 转载请务必保留本文链接:https://java.coder-hub.com/63072149.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定