2020年4月6日 10:37:31未分类评论59阅读模式

英文:

JSOUP missing tag when converting html row

问题

[&lt;html&gt;
 &lt;head&gt;&lt;/head&gt;
 &lt;body&gt;
  &lt;div class=&quot;content-wrapper&quot;&gt;
   &lt;p&gt;&lt;strong&gt;&lt;span class=&quot;CLASS 1 CLASS 2 CLASS 3&quot;&gt;123&lt;/span&gt;&lt;/strong&gt;&lt;br&gt;&lt;strong&gt;DATA 1&lt;/strong&gt;&lt;/p&gt;
  &lt;/div&gt;
 &lt;/body&gt;
&lt;/html&gt;]

英文:

I having problem with jsoup whereby i want to get a row of data which later I will be inserting the row into another html document. But when i inspect time saw that there is no <tr> and <t> tag. How can i solve it

String htmlcontent = &quot;&lt;tr&gt;&lt;td colspan=\&quot;2\&quot;&gt;&lt;div class=\&quot;content-wrapper\&quot;&gt;&lt;p&gt;&lt;strong&gt;&lt;span class=\&quot;CLASS 1 CLASS 2 CLASS 3\&quot;&gt;123&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;strong&gt;DATA 1&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;&quot;;


Document docnewinput = Jsoup.parse(htmlcontent, &quot;UTF-8&quot;);

[&lt;html&gt;
 &lt;head&gt;&lt;/head&gt;
 &lt;body&gt;
  &lt;div class=&quot;content-wrapper&quot;&gt;
   &lt;p&gt;&lt;strong&gt;&lt;span class=&quot;CLASS 1 CLASS 2 CLASS 3&quot;&gt;123&lt;/span&gt;&lt;/strong&gt;&lt;br&gt;&lt;strong&gt;DATA 1&lt;/strong&gt;&lt;/p&gt;
  &lt;/div&gt;
 &lt;/body&gt;
&lt;/html&gt;]

答案1

得分: 0

你有一个HTML片段（例如，包含几个p标签的div；与完整的HTML文档相对）需要解析。

使用Jsoup.parseBodyFragment(String html)方法。

String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parseBodyFragment(html);

parseBodyFragment方法会创建一个空的文档结构，并将解析后的HTML插入到body元素中。如果你使用正常的Jsoup.parse(String html)方法，通常会得到相同的结果，但是显式地将输入视为正文片段，可以确保用户提供的任何有问题的HTML都被解析到body元素中。

无论提供的HTML是否格式良好，解析器都会尽力创建一个干净的解析结果。它可以处理：

未闭合的标签（例如，<p>Lorem <p>Ipsum会解析为<p>Lorem</p> <p>Ipsum</p>）
隐式标签（例如，一个孤立的<td>Table data</td>会被包裹在<table><tr><td>...</td></tr></table>中）
可靠地创建文档结构（包含头部和正文的html，并且头部中只有适当的元素）

使用Jsoup.parse()的示例：

String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parse(html);

工作演示：https://try.jsoup.org/~EdJSrHl_biDcQkyhL2BLH5ZNnck

英文:

You have a fragment of body HTML (e.g. a div containing a couple of p tags; as opposed to a full HTML document) that you want to parse.

Use the Jsoup.parseBodyFragment(String html) method.

String html = &quot;&lt;table&gt;&lt;tr&gt;&lt;td colspan=\&quot;2\&quot;&gt;&lt;div class=\&quot;content-wrapper\&quot;&gt;&lt;p&gt;&lt;strong&gt;&lt;span class=\&quot;CLASS 1 CLASS 2 CLASS 3\&quot;&gt;123&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;strong&gt;DATA 1&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&quot;;
Document doc = Jsoup.parseBodyFragment(html);

The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal Jsoup.parse(String html) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.

The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:
unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

EDIT:

By using Jsoup.parse():

String html = &quot;&lt;table&gt;&lt;tr&gt;&lt;td colspan=\&quot;2\&quot;&gt;&lt;div class=\&quot;content-wrapper\&quot;&gt;&lt;p&gt;&lt;strong&gt;&lt;span class=\&quot;CLASS 1 CLASS 2 CLASS 3\&quot;&gt;123&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;strong&gt;DATA 1&lt;/strong&gt;&lt;/p&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&quot;;
Document doc = Jsoup.parse(html);

Working Demo: https://try.jsoup.org/~EdJSrHl_biDcQkyhL2BLH5ZNnck

答案2

得分: 0

需要使用xmlParser()，这样它将只是将字符串作为原样读取，而不会对其进行格式化。

英文:

Need to use xmlParser() so that it will just read the string as it without formatting it.

专注分享java语言的经验与见解，让所有开发者获益！

JSOUP在转换HTML行时缺失标签

问题

答案1

答案2

Go like channels in Java

在低资源环境下使用Apache Cassandra和Go服务器

avatica-go客户端读取Phoenix查询服务器：[驱动程序：连接错误]

向Spring端点发送POST请求，返回状态码400。

Spring Boot控制器从Golang应用程序接收到的重定向请求会被重复执行两次。

可以在不将其读入内存的情况下多次重用HTTP请求体吗？

How to register my go lang microservice in Spring Eureka Service Discovery

在应用程序-go + BDD-java中模拟第三方服务

What is value, reference vs pointer and what these three example used to pass?

Do goroutines and light-weight Java threads mean we never need use thread pools and async code again?

发表评论