如何从使用itext的PDF中提取具有内置编码的PRStream?

huangapple 未分类评论62阅读模式
标题翻译

How to extract PRStream from a pdf having built-in encoding using itext?

问题

我需要替换原始 PDF 中的文本并创建一个新的 PDF。为此,我在 Java 中使用 itext 库。到目前为止,我只有包含 ANSI 编码的 PDF。因此,我会运行以下代码行:

PdfReader reader = new PdfReader(SOURCE_PDF);
PdfDictionary page = reader.getPageN(1);
byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(reader, 1);
String dd = new String(pageContentInput, BaseFont.CP1252);

BaseFont.CP1252 帮助我解码了编码,我会得到文本在字符串 "dd" 中。如果我使用 CP1252,结果会是类似这样的内容:<$##!*??!$$>Tj,在 ANSI 的情况下是 <somecharactersabcd>Tj。

此外,我不仅需要页面上的文本,还需要整个格式,即包括 Tj、Tf 等,以便我可以创建具有相同格式的新 PDF。这就是为什么我使用 getContentBytesForPage 方法。

如何从具有内置编码的 PDF 中获取 PDF 文本流?

英文翻译

I need to replace the text in the original pdf and create a new one. For that I am using itext library in java. Till now I only had PDFs having ANSI encoding. So I would run the following lines :

            PdfReader reader = new PdfReader(SOURCE_PDF);
		    PdfDictionary page = reader.getPageN(1);
		    byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(reader, 1);
            String dd = new String(pageContentInput, BaseFont.CP1252);

BaseFont.CP1252 helped me to decode the encoding & I would get the text in the string "dd". If I use CP1252 the outcome is somewhat like this <$##!*??!$$>Tj which in ANSI case is <somecharactersabcd>Tj

Also I not only require text from the page but also the whole formatting i.e. with Tj, Tf etc. so that I can create a new pdf with same formatting. That's why I am using getContentBytesForPage.

How can I get the PDF Text Stream out of the pdf having built-in encoding?

答案1

得分: 0

如前面在评论中提到的,你不能使用单一的编码来解码整个字节数组,因为其中的每个字符串对象可以有不同的编码方式。

你需要逐条解析字节数组的指令,跟踪当前选择的字体,并且当遇到文本绘制指令时,根据当前字体的属性对其字符串参数进行解码。

要使用的属性可能包括其编码(Encoding)ToUnicode映射,底层字体文件的信息等,这取决于字体类型以及提供的可选信息。

但是即使这样做了,你也不能简单地替换原始PDF中的文本这个回答(针对PDFBox库的类似问题)阐述了许多阻碍因素,尤其是字体(可能只包含子集嵌入)不包含所需的字形,以及布局考虑不明确等问题。


要了解如何解决前面提到的问题,可以查看以下答案:

  • 这个回答 提供了Java和C#的PdfContentStreamEditor类,可以作为编辑内容流指令的基类;这些类特别会跟踪图形状态,包括当前的文本状态参数。
  • 这个回答(不幸的是,提问者删除了问题,所以你需要一些声望才能有权限阅读答案)使用了PdfContentStreamEditor Java类来实现一个特定字体文本的删除工具,以及另一个用于大字号文本的删除工具。
  • 这个回答 使用了PdfContentStreamEditor C#类来实现一个BigTextRemover,它可以通过字体大小识别文本并将其删除。
  • 这个回答 描述了如何避免PdfContentStreamEditor在处理旋转文档时出现的问题。
  • 这个回答 也描述了如何避免PdfContentStreamEditor在处理旋转文档时出现的问题,并额外修复了PdfContentStreamEditor中的一个错误。
  • 这个回答 使用了PdfContentStreamEditor Java类来实现一个编辑器,将黑色文本的颜色更改为绿色。
  • 这个回答 提供了PdfContentStreamEditor移植到iText 7 / Java的版本,称为PdfCanvasEditor,并展示了通过字体名称或字体大小删除文本以及将黑色文本重新着色为绿色的示例用法。
  • 这个回答 使用了PdfContentStreamEditor C#类来实现一个TextRemover,从而删除所有文本绘制指令。
  • 这个回答 使用了PdfContentStreamEditor Java类来实现一个SimpleTextRemover,它可以识别文本绘制指令中的搜索文本,并将其删除,并返回删除文本的位置(在某些限制下解释了那里)。在这些位置上,你可以绘制新文本。

研究第一个答案中的PdfContentStreamEditor(使用第五个答案中的修复)以及SimpleTextRemover,你可以了解如何查找文本。如果你想以不同的方式编辑PDF,其他答案可能也会有所帮助。

至于替换部分,请考虑字体可能是不完整的,因此通常不能简单地替换文本绘制指令的字符串参数内容,而可能需要添加新字体并在替换文本绘制指令时切换字体。

英文翻译

As already mentioned in comments, you don't use a single encoding to decode the whole byte array because each string object therein can be encoded differently.

You have to parse the byte array instruction by instruction, keep track of which font currently is selected, and when when you encounter a text drawing instruction, its string arguments have to be decoded according to the properties of that current font.

The properties to use may be its Encoding, its ToUnicode map, information from the underlying font file,... depending on which font type it is and which optional information are given.

But even after doing so, you cannot simply replace the text in the original pdf, this answer (to a similar question in the context of the PDFBox library) illustrates a number of hindrances, in particular fonts (which may be subset-embedded only) not containing the glyphs you need and unclear layout considerations.


To get an idea how to address the former issues, have a look at the following answers:

  • This answer which provides PdfContentStreamEditor classes for Java and C# which can serve as base classes to edit content stream instructions; these classes in particular also keep track of the graphics state including the current text state parameters.
  • This answer (the OP unfortunately deleted the question, so you need some reputation to have permission to read the answer) uses that PdfContentStreamEditor Java class to implement a text remover for text in a specific font and another one for text with a large font size.
  • This answer uses that PdfContentStreamEditor C# class to implement a BigTextRemover which recognizes text by its font size and removes it.
  • This answer describes what to do to prevent PdfContentStreamEditor issues with rotated documents.
  • This answer also describes what to do to prevent PdfContentStreamEditor issues with rotated documents and additionally fixes a bug in the PdfContentStreamEditor.
  • This answer uses that PdfContentStreamEditor Java class to implement an editor that changes the color of black text to green.
  • This answer provides a port of the PdfContentStreamEditor to iText 7 / Java as PdfCanvasEditor and shows example usages removing text by font name or font size and re-coloring black text to green.
  • This answer uses that PdfContentStreamEditor C# class to implement a TextRemover removing all text drawing instructions.
  • This answer uses that PdfContentStreamEditor Java class to implement a SimpleTextRemover which recognizes a search text in text drawing instructions, removes it, and returns the positions at which the text was removed (under some restrictions explained there). At those positions one then can draw new text.

Studying the PdfContentStreamEditor from the first answer (with the fix from the fifth answer) and the SimpleTextRemover you get an idea how to find text. The other answers might be interesting in general if you want to edit PDFs in different ways.

As far as replacing goes, consider that fonts may be incomplete and you, therefore, in general cannot simply replace the contents of the string arguments of text drawing instructions but instead may have to add a new font and switch fonts for the replacement text drawing instruction.

huangapple
  • 本文由 发表于 2020年3月16日 15:54:22
  • 转载请务必保留本文链接:https://java.coder-hub.com/60702152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定