當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Jsoup代码解读之三-Document的输出

發(fā)布時(shí)間：2023/12/3 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Jsoup代码解读之三-Document的输出小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

轉(zhuǎn)載自? ?Jsoup代碼解讀之三-Document的輸出

Jsoup官方說(shuō)明里，一個(gè)重要的功能就是***output tidy HTML***。這里我們看看Jsoup是如何輸出HTML的。

HTML相關(guān)知識(shí)

分析代碼前，我們不妨先想想，"tidy HTML"到底包括哪些東西：

換行，塊級(jí)標(biāo)簽習(xí)慣上都會(huì)獨(dú)占一行
縮進(jìn)，根據(jù)HTML標(biāo)簽嵌套層數(shù)，行首縮進(jìn)會(huì)不同
嚴(yán)格的標(biāo)簽閉合，如果是可以自閉合的標(biāo)簽并且沒(méi)有內(nèi)容，則進(jìn)行自閉合
HTML實(shí)體的轉(zhuǎn)義

這里要補(bǔ)充一下HTML標(biāo)簽的知識(shí)。HTML Tag可以分為block和inline兩類(lèi)。關(guān)于Tag的inline和block的定義可以參考http://www.w3schools.com/html/html_blocks.asp，而Jsoup的Tag類(lèi)則是對(duì)Java開(kāi)發(fā)者非常好的學(xué)習(xí)資料。

// internal static initialisers: // prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sources //block tags，需要換行 private static final String[] blockTags = {"html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame","noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5", "h6","ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins","del", "s", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th","td", "video", "audio", "canvas", "details", "menu", "plaintext" }; //inline tags，無(wú)需換行 private static final String[] inlineTags = {"object", "base", "font", "tt", "i", "b", "u", "big", "small", "em", "strong", "dfn", "code", "samp", "kbd","var", "cite", "abbr", "time", "acronym", "mark", "ruby", "rt", "rp", "a", "img", "br", "wbr", "map", "q","sub", "sup", "bdo", "iframe", "embed", "span", "input", "select", "textarea", "label", "button", "optgroup","option", "legend", "datalist", "keygen", "output", "progress", "meter", "area", "param", "source", "track","summary", "command", "device" }; //emptyTags是不能有內(nèi)容的標(biāo)簽，這類(lèi)標(biāo)簽都是可以自閉合的 private static final String[] emptyTags = {"meta", "link", "base", "frame", "img", "br", "wbr", "embed", "hr", "input", "keygen", "col", "command","device" }; private static final String[] formatAsInlineTags = {"title", "a", "p", "h1", "h2", "h3", "h4", "h5", "h6", "pre", "address", "li", "th", "td", "script", "style","ins", "del", "s" }; //在這些標(biāo)簽里，需要保留空格 private static final String[] preserveWhitespaceTags = {"pre", "plaintext", "title", "textarea" };

另外，Jsoup的Entities類(lèi)里包含了一些HTML實(shí)體轉(zhuǎn)義的東西。這些轉(zhuǎn)義的對(duì)應(yīng)數(shù)據(jù)保存在entities-full.properties和entities-base.properties里。

Jsoup的格式化實(shí)現(xiàn)

在Jsoup里，直接調(diào)用Document.toString()(繼承自Element)，即可對(duì)文檔進(jìn)行輸出。另外OutputSettings可以控制輸出格式，主要是prettyPrint(是否重新格式化)、outline(是否強(qiáng)制所有標(biāo)簽換行)、indentAmount(縮進(jìn)長(zhǎng)度)等。

里面的繼承和互相調(diào)用關(guān)系略微復(fù)雜，大概是這樣子：

Document.toString()=>Document.outerHtml()=>Element.html()，最終Element.html()又會(huì)循環(huán)調(diào)用所有子元素的outerHtml()，拼接起來(lái)作為輸出。

private void html(StringBuilder accum) {for (Node node : childNodes)node.outerHtml(accum); }

而outerHtml()會(huì)使用一個(gè)OuterHtmlVisitor對(duì)所以子節(jié)點(diǎn)做遍歷，并拼裝起來(lái)作為結(jié)果。

protected void outerHtml(StringBuilder accum) {new NodeTraversor(new OuterHtmlVisitor(accum, getOutputSettings())).traverse(this); }

OuterHtmlVisitor會(huì)對(duì)所有子節(jié)點(diǎn)做遍歷，并調(diào)用node.outerHtmlHead()和node.outerHtmlTail兩個(gè)方法。

private static class OuterHtmlVisitor implements NodeVisitor {private StringBuilder accum;private Document.OutputSettings out;public void head(Node node, int depth) {node.outerHtmlHead(accum, depth, out);}public void tail(Node node, int depth) {if (!node.nodeName().equals("#text")) // saves a void hit.node.outerHtmlTail(accum, depth, out);} }

我們終于找到了真正工作的代碼，node.outerHtmlHead()和node.outerHtmlTail。Jsoup里每種Node的輸出方式都不太一樣，這里只講講兩種主要節(jié)點(diǎn)：Element和TextNode。Element是格式化的主要對(duì)象，它的兩個(gè)方法代碼如下：

void outerHtmlHead(StringBuilder accum, int depth, Document.OutputSettings out) {if (accum.length() > 0 && out.prettyPrint()&& (tag.formatAsBlock() || (parent() != null && parent().tag().formatAsBlock()) || out.outline()) )//換行并調(diào)整縮進(jìn)indent(accum, depth, out);accum.append("<").append(tagName());attributes.html(accum, out);if (childNodes.isEmpty() && tag.isSelfClosing())accum.append(" />");elseaccum.append(">"); }void outerHtmlTail(StringBuilder accum, int depth, Document.OutputSettings out) {if (!(childNodes.isEmpty() && tag.isSelfClosing())) {if (out.prettyPrint() && (!childNodes.isEmpty() && (tag.formatAsBlock() || (out.outline() && (childNodes.size()>1 || (childNodes.size()==1 && !(childNodes.get(0) instanceof TextNode)))))))//換行并調(diào)整縮進(jìn)indent(accum, depth, out);accum.append("</").append(tagName()).append(">");} }

而ident方法的代碼只有一行：

protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) {//out.indentAmount()是縮進(jìn)長(zhǎng)度，默認(rèn)是1accum.append("\n").append(StringUtil.padding(depth * out.indentAmount())); }

代碼簡(jiǎn)單明了，就沒(méi)什么好說(shuō)的了。值得一提的是，StringUtil.padding()方法為了減少字符串生成，把常用的縮進(jìn)保存到了一個(gè)數(shù)組中。

好了，水了一篇文章，下一篇將比較有技術(shù)含量的parser部分。

另外，通過(guò)本節(jié)的學(xué)習(xí)，我們學(xué)到了要把StringBuilder命名為accum，而不是sb。

總結(jié)

以上是生活随笔為你收集整理的Jsoup代码解读之三-Document的输出的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 2014年电脑配置（2014 配置电脑
下一篇： Jsoup代码解读之二-DOM相关对象