使用Java的PDFBox从未选中的PDF内容中提取文本。

huangapple 未分类评论54阅读模式
英文:

Extract text from unselected PDF content using java PDFBox

问题

以下是翻译好的内容:

我可以轻松地从PDF文件中获取内容,但我有一些文件,当我打开它时,其中的文本是不可选择的。我现有的代码不能用以下代码块提取这些文本 -

public class PDFBoxExample {
    public static void main(String[] args) {
        try {
            File file = new File("C:\\pdf\\pdf_result.pdf");
            try (PDDocument document = PDDocument.load(new FileInputStream(file))) {
                document.getClass();
                if (!document.isEncrypted()) {
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(false);
                    stripper.setShouldSeparateByBeads(true);
                    PDFTextStripper tStripper = new PDFTextStripper();

                    String content = tStripper.getText(document);
                    System.out.println(content);
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

请在以下链接中找到我的PDF文件 -
https://1drv.ms/b/s!AmRKaLhGJhJphvMOUBGADveatrx0hA?e=a0seG7

你能否为此提供一些解决方案。

英文:

I can easily get the content from the PDF file, but I got some file which text is not selectable when I open it. My existing code doesn't able to extract those text with following code block -

public class PDFBoxExample {
    public static void main(String[] args) {
        try {
            File file = new File("C:\\pdf\\pdf_result.pdf");
            try (PDDocument document = PDDocument.load(new FileInputStream(file))) {
                document.getClass();
                if (!document.isEncrypted()) {
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(false);
                    stripper.setShouldSeparateByBeads(true);
                    PDFTextStripper tStripper = new PDFTextStripper();

                    String content = tStripper.getText(document);
                    System.out.println(content);
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Please find the following link of my pdf file-
https://1drv.ms/b/s!AmRKaLhGJhJphvMOUBGADveatrx0hA?e=a0seG7

Can you please provide some solution for the same.

huangapple
  • 本文由 发表于 2020年6月29日 14:39:15
  • 转载请务必保留本文链接:https://java.coder-hub.com/62632530.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定