标题翻译
How to apply ads blocker to HTML content get by Java HttpClient GET request and parse by Jsoup?
问题
我将要抓取报纸和文章。然而,我不想要广告。我想要在我的请求上应用广告拦截(类似于手动启用广告拦截浏览网页,然后保存不带广告的HTML页面)
DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
.setDefaultRequestConfig(this.config)
.setRoutePlanner(routePlanner)
.setSSLContext(sslContext)
.setConnectionManager(cm)
.setConnectionManagerShared(true)
.build();
HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
String headerContentType = response.getFirstHeader("Content-Type").getValue();
if (headerContentType != null && headerContentType.contains("text/html")) {
HttpEntity entity = response.getEntity();
if (entity != null) {
content = EntityUtils.toString(entity, "utf-8");
EntityUtils.consume(entity);
}
} else {
// 在这里记录失败事件
}
}
现在我有String content
作为HTML内容。
我使用org.jsoup.Jsoup
解析内容。
Document contentDoc = Jsoup.parse(
content
);
String contentstr = contentDoc.body()
.getElementsByTag("p")
.text();
英文翻译
I am going to crawl newspaper, article. However I don't want ads. I want to apply ads block on top of my request (similar to browsing the web manually with ads block on, then save the HTML page without ads)
DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpClient = HttpClientBuilder.create()
.setDefaultRequestConfig(this.config)
.setRoutePlanner(routePlanner)
.setSSLContext(sslContext)
.setConnectionManager(cm)
.setConnectionManagerShared(true)
.build();
HttpGet getRequest = new HttpGet(url);
getRequest.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
try (CloseableHttpResponse response = httpClient.execute(getRequest)) {
String headerContentType = response.getFirstHeader("Content-Type").getValue();
if (headerContentType != null && headerContentType.contains("text/html")) {
HttpEntity entity = response.getEntity();
if (entity != null) {
content = EntityUtils.toString(entity, "utf-8");
EntityUtils.consume(entity);
}
} else {
// log fail event here
}
}
Now I have String content
as HTML content.
I parse the content with org.jsoup.Jsoup
.
Document contentDoc = Jsoup.parse(
content
);
String contentstr = contentDoc.body()
.getElementsByTag("p")
.text();
专注分享java语言的经验与见解,让所有开发者获益!
评论