How to apply ads blocker to HTML content get by Java HttpClient GET request and parse by Jsoup?

huangapple 未分类评论46阅读模式
标题翻译

How to apply ads blocker to HTML content get by Java HttpClient GET request and parse by Jsoup?

问题

我将要抓取报纸和文章。然而,我不想要广告。我想要在我的请求上应用广告拦截(类似于手动启用广告拦截浏览网页,然后保存不带广告的HTML页面)

DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
   .setDefaultRequestConfig(this.config)
   .setRoutePlanner(routePlanner)
   .setSSLContext(sslContext)
   .setConnectionManager(cm)
   .setConnectionManagerShared(true)
   .build();

HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
    String headerContentType = response.getFirstHeader("Content-Type").getValue();
    if (headerContentType != null && headerContentType.contains("text/html")) {
        HttpEntity entity = response.getEntity();
        if (entity != null) {
            content = EntityUtils.toString(entity, "utf-8");
            EntityUtils.consume(entity);
        }
    } else {
        // 在这里记录失败事件
    }
}

现在我有String content作为HTML内容。
我使用org.jsoup.Jsoup解析内容。

Document contentDoc = Jsoup.parse(
        content
);
String contentstr = contentDoc.body()
        .getElementsByTag("p")
        .text();
英文翻译

I am going to crawl newspaper, article. However I don't want ads. I want to apply ads block on top of my request (similar to browsing the web manually with ads block on, then save the HTML page without ads)

DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
   .setDefaultRequestConfig(this.config)
   .setRoutePlanner(routePlanner)
   .setSSLContext(sslContext)
   .setConnectionManager(cm)
   .setConnectionManagerShared(true)
   .build();

HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
                    try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
                        String headerContentType = response.getFirstHeader("Content-Type").getValue();
                        if (headerContentType != null && headerContentType.contains("text/html")) {
                            HttpEntity entity = response.getEntity();
                            if (entity != null) {
                                content = EntityUtils.toString(entity, "utf-8");
                                EntityUtils.consume(entity);
                            }
                        } else {
                            // log fail event here
                        }
                    }

Now I have String content as HTML content.
I parse the content with org.jsoup.Jsoup.

    Document contentDoc = Jsoup.parse(
            content
    );
    String contentstr = contentDoc.body()
            .getElementsByTag("p")
            .text();

huangapple
  • 本文由 发表于 2020年3月16日 10:57:33
  • 转载请务必保留本文链接:https://java.coder-hub.com/60699809.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定