从URL使用Selenium-WebDriver和PDF-Box读取PDF。

huangapple 未分类评论46阅读模式
英文:

Read PDF from a URL using Selenium-WebDriver and PDF-Box

问题

以下是您提供的代码的翻译部分:

我正在尝试使用Selenium Web驱动程序和PDFBox API从PDF中读取文本如果可能的话我不想下载文件只想从网络读取PDF将PDF的文本读入字符串中我正在使用的代码如下尽管无法使其工作

我找到了一些下载PDF并使用下载的文件进行比较的代码示例但没有一个示例能够从URL中提取PDF的文本

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class PDFextract {

    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
        WebDriver driver = new ChromeDriver();
        driver.manage().window().maximize();
        driver.get("THE URL OF SITE I CAN'T SHARE");
        System.out.println(driver.getTitle());
        List<WebElement> list = driver.findElements(By.xpath("//a[@title='Click to open file']"));
        int rows = list.size();
        for (int i = 1; i <= rows; i++) {
        }
        List<WebElement> links = driver.findElements(By.xpath("//a[@title='Click to open file']"));
        String fLinks = "";
        for (WebElement link : links) {
            fLinks = fLinks + link.getAttribute("href");
        }
        fLinks = fLinks.trim();
        System.out.println(fLinks);

        URL url = new URL(fLinks);
        HttpURLConnection connection = (HttpURLConnection) url.openConnection();
        InputStream is = connection.getInputStream();
        PDDocument pdd = PDDocument.load(is);
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(pdd);
        pdd.close();
        is.close();
        System.out.println(text);

        // I get the error:

        Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CAN'T SHARE THE URL***
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
        at

        sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
        at PDFextract.main(PDFextract.java:106)
    }
}

在2020年7月5日编辑的内容中,提到了一种方法,可以从链接中读取PDF内容:

String pdfContent = readPDFContent(driver.getCurrentUrl());

public String readPDFContent(String appUrl) throws Exception {
    URL url = new URL(appUrl);
    InputStream is = url.openStream();
    BufferedInputStream fileToParse = new BufferedInputStream(is);
    PDDocument document = null;
    String output = null;
    try {
        document = PDDocument.load(fileToParse);
        output = new PDFTextStripper().getText(document);
        System.out.println(output);
    } finally {
        if (document != null) {
            document.close();
        }
        fileToParse.close();
        is.close();
    }
    return output;
}

还提到了一些关于<embed>元素的信息,以及尝试访问stream-URL的问题。

希望这些翻译满足了您的需求。如果您需要更多帮助,请随时提问。

英文:

I'm trying to read the text from a PDF using Selenium-web driver and the PDFbox API. If possible I don't want to download the file, but only read the PDF from the web getting only the text of PDF into a string. The code I'm using its below, can't make to work though:

I've found examples of code to download the PDF and comparing it using the file downloaded, but none functional example extracting the text of the PDF from the URL.

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class PDFextract {


		public static void main(String[] args) throws Exception {
			// TODO Auto-generated method stub
			System.setProperty(&quot;webdriver.chrome.driver&quot;, &quot;C:\\chromedriver.exe&quot;);
			WebDriver driver=new ChromeDriver();
			driver.manage().window().maximize();
		    driver.get(&quot;THE URL OF SITE I CANT SHARE&quot;); //THE URL OF SITE I CAN&#39;T SHARE
		    System.out.println(driver.getTitle());   	    
		    List&lt;WebElement&gt; list = driver.findElements(By.xpath(&quot;//a[@title=&#39;Click to open file&#39;]&quot;));
	        int rows = list.size();
	        for (int i= 1; i &lt;= rows; i++) {
	        }
	        List&lt;WebElement&gt; links = driver.findElements(By.xpath(&quot;//a[@title=&#39;Click to open file&#39;]&quot;));
        String fLinks = &quot;&quot;;
        for (WebElement link : links) {
             fLinks = fLink + link.getAttribute(&quot;href&quot;);
        }
        fLinks = fLinks.trim();
        System.out.println(fLinks); // till here the code works fine.. i get a valid url link

        // the code bellow doesn&#39;t work
        URL url=new URL(fLinks);
        HttpURLConnection connection=(HttpURLConnection)url.openConnection();
        InputStream is=connection.getInputStream();
        PDDocument pdd=PDDocument.load(is);
        PDFTextStripper stripper=new PDFTextStripper();
        String text=stripper.getText(pdd);
        pdd.close();
        is.close();
        System.out.println(text);

I get the error:

Exception in thread &quot;main&quot; java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at 

sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
	at PDFextract.main(PDFextract.java:106)

Edited in 07.05.2020:
@TilmanHausherr, I've done more research, this helped out in the first part, how to read a PDF from a link: Selenium Tutorial: Read PDF Content using Selenium WebDriver

This method works:

String pdfContent = readPDFContent(driver.getCurrentUrl());

    public String readPDFContent(String appUrl) throws Exception {
    URL url = new URL(appUrl);
    InputStream is = url.openStream();
    BufferedInputStream fileToParse = new BufferedInputStream(is);
    PDDocument document = null;
    String output = null;
    try {
        document = PDDocument.load(fileToParse);
        output = new PDFTextStripper().getText(document);
        System.out.println(output);
    } finally {
        if (document != null) {
            document.close();
        }
        fileToParse.close();
        is.close();
    }
    return output;
}

It seems my problem its the link itself, the HTML element its '< embed >', in my case there is also a 'stream-URL':

&lt;embed id=&quot;plugin&quot; type=&quot;application/x-google-chrome-pdf&quot; 

src=&quot;https://&quot;SITE 
I CAN&#39;T TELL&quot;/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a&quot; 
stream-URL=&quot;chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1&quot; headers=&quot;accept-ranges: bytes
content-disposition: inline;filename=&amp;quot;online.pdf&amp;quot;
content-length: 71488
content-security-policy: frame-ancestors &#39;self&#39; https://*&quot;SITE I CAN&#39;T TELL&quot; 
https://*&quot;DOMAIN I CAN&#39;T TELL&quot;.net
content-type: application/pdf

Found this: 1. Download the File which has stream-url is the chrome extension in the embed tag using selenium
2. Handling contents of Embed tag in selenium python

But I still didn't manage to read the PDF with PDFbox because the element its '< embed>' and i might have to access the stream-URL.

huangapple
  • 本文由 发表于 2020年5月3日 23:59:19
  • 转载请务必保留本文链接:https://java.coder-hub.com/61577502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定