问题

我正在使用Java将PDF转换为抓取一些数据。我正在使用Tesseract来抓取图像PDF，使用PDFBox来抓取非图像PDF。通常我们可以通过从PDF中选择文本来检查PDF是否为图像，如果无法从PDF中选中文本，则为图像PDF。
我想知道在Java中是否有一种方法可以判断PDF是否为图像PDF还是非图像PDF？

英文:

I'm converting PDF to scrape some data using java. I'm using Tesseract to scrape image PDFs and PDFBox to scrape non-image PDFs. Normally we can check whether PDF is an image or not by selecting text from PDF, if you are unable to highlight text from PDF then it is image PDF.
I want to know is there a way in java to find out whether PDF is an image PDF or non-image PDF?

答案1

得分: 0

你可以使用PDFBox从PDF中提取文本。如果文本不多，或者提取出的文本是无意义的，那么很可能是一个图像PDF。

英文:

You can use PDFBox to pull out the text from PDF. If there isn't much text, or the retrieved text is gibberish, it's more likely an image PDF.

专注分享java语言的经验与见解，让所有开发者获益！

有没有办法在Java中识别出PDF是否为图像PDF？

问题

答案1

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论