How to apply ads blocker to HTML content get by Java HttpClient GET request and parse by Jsoup?

huangapple 未分类评论45阅读模式
英文:

How to apply ads blocker to HTML content get by Java HttpClient GET request and parse by Jsoup?

问题

我将要抓取报纸和文章。然而,我不想要广告。我想要在我的请求上应用广告拦截(类似于手动启用广告拦截浏览网页,然后保存不带广告的HTML页面)

DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
   .setDefaultRequestConfig(this.config)
   .setRoutePlanner(routePlanner)
   .setSSLContext(sslContext)
   .setConnectionManager(cm)
   .setConnectionManagerShared(true)
   .build();

HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
    String headerContentType = response.getFirstHeader("Content-Type").getValue();
    if (headerContentType != null && headerContentType.contains("text/html")) {
        HttpEntity entity = response.getEntity();
        if (entity != null) {
            content = EntityUtils.toString(entity, "utf-8");
            EntityUtils.consume(entity);
        }
    } else {
        // 在这里记录失败事件
    }
}

现在我有String content作为HTML内容。
我使用org.jsoup.Jsoup解析内容。

Document contentDoc = Jsoup.parse(
        content
);
String contentstr = contentDoc.body()
        .getElementsByTag("p")
        .text();
英文:

I am going to crawl newspaper, article. However I don't want ads. I want to apply ads block on top of my request (similar to browsing the web manually with ads block on, then save the HTML page without ads)

DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
   .setDefaultRequestConfig(this.config)
   .setRoutePlanner(routePlanner)
   .setSSLContext(sslContext)
   .setConnectionManager(cm)
   .setConnectionManagerShared(true)
   .build();

HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
                    try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
                        String headerContentType = response.getFirstHeader("Content-Type").getValue();
                        if (headerContentType != null && headerContentType.contains("text/html")) {
                            HttpEntity entity = response.getEntity();
                            if (entity != null) {
                                content = EntityUtils.toString(entity, "utf-8");
                                EntityUtils.consume(entity);
                            }
                        } else {
                            // log fail event here
                        }
                    }

Now I have String content as HTML content.
I parse the content with org.jsoup.Jsoup.

    Document contentDoc = Jsoup.parse(
            content
    );
    String contentstr = contentDoc.body()
            .getElementsByTag("p")
            .text();

huangapple
  • 本文由 发表于 2020年3月16日 10:57:33
  • 转载请务必保留本文链接:https://java.coder-hub.com/60699809-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定