Jsoup - extracting text - java

I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
And I need to build:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
Text node - Some text
Node <b> - with tags
Text node - might go here.
Node <p> - Also there are paragraphs
Text node - More text can go without paragraphs
Node <br> - <empty>

Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
Result:
1 TextNode
Some text
2 Element <b>with tags</b>
3 TextNode might go here.
4 Element <p>Also there are paragraphs</p>
5 TextNode More text can go without paragraphs
6 Element <br/>

for (Element el : doc.select("body").select("*")) {
for (TextNode node : el.textNodes()) {
node.text() ));
}
}

Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs
public static void main(String[] args) throws IOException {
String str =
"<div>"
+ " Some text <b>with tags</b> might go here."
+ " <p>Also there are paragraphs.</p>"
+ " More text can go without paragraphs<br/>"
+ "</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
StringBuilder builder = new StringBuilder();
stripTags(builder, div.childNodes());
System.out.println("Text without tags: " + builder.toString());
}
/**
* Strip tags from a List of type <code>Node</code>
* #param builder StringBuilder : input and output
* #param nodesList List of type <code>Node</code>
*/
public static void stripTags (StringBuilder builder, List<Node> nodesList) {
for (Node node : nodesList) {
String nodeName = node.nodeName();
if (nodeName.equalsIgnoreCase("#text")) {
builder.append(node.toString());
} else {
// recurse
stripTags(builder, node.childNodes());
}
}
}

you can use TextNode for this purpose:
List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
String html = "";
for(TextNode txNode:bodyTextNode){
html+=txNode.text();
}

Related

Parsing html in Jsoup

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.
I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
SpannableStringBuilder text = new SpannableStringBuilder();
if (value.contains("</span>")) {
Document document = Jsoup.parse(value);
Elements elements = document.getElementsByTag("span");
if (elements != null) {
int i = 0;
int start = 0;
for (Element ele : elements) {
String styleName = type + "." + ele.attr("class");
text.append(ele.text());
int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
text.append(ele.nextSibling().toString());
start = text.length();
i++;
}
}
return text;
}
I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".
Need output such as:
- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>
Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}
output:
0 ->
There are
1 -> two
2 -> workers from the
3 -> Front of House
By the way: I don't know how to get rid of the first line break before There are.

JSoup search by attribute and class

You can do:
Elements links = doc.select("a[href]");
to find all "a" elements with an href attribute.
And you can do:
doc.getElementsByClass("title")
to get all elements with a class that is called "title"
But how can I do both? (I.e search for an "a" element with an "href" tag that also has the class "title").
You can simply have
Elements links = doc.select("a[href].title");
This will select all <a> having an href attribute with a title class. The class is passed by prepending it with a dot:
Selector combinations
Any combination, e.g. a[href].highlight
Full example:
public static void main(String[] args) {
Document doc = Jsoup.parse(""
+ "<div>"
+ " <a href='link1' class='title another'>Link 1</a>"
+ " <a href='link2' class='another'>Link 2</a>"
+ " <a href='link3'>Link 3</a>"
+ "</div>");
Elements links = doc.select("a[href].title");
System.out.println(links); // prints "Link 1"
}

Jsoup not selector not returning result

Trying to use Jsoup selector to select everything in a div with class 'content', but at the same time not select any divs with class social,or media. I know I can do a simple select and loop, but would have expected the :not function to work for my purpose. Perhaps, my selector syntax is wrong.
public static void main(String args[]) throws ParseException {
String html = "<html>\n" +
"<body>\n" +
"<div class=\"content\">\n" +
"\t<p>some paragraph</p>\n" +
"\t<div class=\"social media\">\n" +
"\tfind us on facebook\n" +
"\t</div\n" +
"</div>\n" +
"</body>\n" +
"</html>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("div.content div:not(.social)");
System.out.println(elements.text());
}
Expected result: "some paragraph"
Actual result: null
Your selector as it is, matches divs that do not have class="social" and are childs of div with class="content". To have the expected outcome use this
Elements elements = doc.select("div.content :not(.social)");
Or this
Elements elements = doc.select("div.content").not(".social");

Extract HTML from <!-- --> comment to a closing tag using jsoup java

I have some HTML that looks like
<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>
I need to extract the HTML from the comment to a closing dl tag. The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). The HTML between the two is variable in length and content and doesn't have any good identifiers.
I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described?
Here's what I've come up with, which works, but obviously not the most efficient.
String myDirectoryPath = "D:\\Path";
File dir = new File(myDirectoryPath);
Document myDoc;
Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
for (File child : dir.listFiles()) {
System.out.println(child.getAbsolutePath());
File file = new File(child.getAbsolutePath());
String charSet = "UTF-8";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Matcher m = p.matcher(innerHtml);
if (m.find()) {
Document doc = Jsoup.parse(m.group(1));
String myText = doc.text();
try {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
out.println(myText);
out.close();
} catch (IOException e) {
//error }
}
}
To use a regex, maybe something simple
# "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"
<!-- \s* start \s* content \s* -->
([\S\s]*?)
</ \s* dl \s* >
Here's some example code - it may need further improvements - depending on what you want to do.
final String html = "<p>abc</p>" // Additional tag before the comment
+ "<!-- start content -->\n"
+ "<p>Blah...</p>\n"
+ "<dl><dd>blah</dd></dl>"
+ "<p>def</p>"; // Additional tag after the comment
// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
if( node.nodeName().equals("#comment") ) // if it's a comment we do something
{
// Some output for testing ...
System.out.println("=== Comment =======");
System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
System.out.println("=== Childs ========");
// Get the childs of the comment --> following nodes
final List<Node> childNodes = node.siblingNodes();
// Start- and endindex for the sublist - this is used to skip tags before the actual comment node
final int startIdx = node.siblingIndex(); // Start index - start after (!) the comment node
final int endIdx = childNodes.size(); // End index - the last following node
// Iterate over all nodes, following after the comment
for( Node child : childNodes.subList(startIdx, endIdx) )
{
/*
* Do whatever you have to do with the nodes here ...
* In this example, they are only used as Element's (Html Tags)
*/
if( child instanceof Element )
{
Element element = (Element) child;
/*
* Do something with your elements / nodes here ...
*
* You can skip e.g. 'p'-tag by checking tagnames.
*/
System.out.println(element);
// Stop after processing 'dl'-tag (= closing 'dl'-tag)
if( element.tagName().equals("dl") )
{
System.out.println("=== END ===========");
break;
}
}
}
}
}
For easier understanding, the code is very detailed, you can shorten it at some points.
And finally, here's the output of this example:
=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
<dd>
blah
</dd>
</dl>
=== END ===========
Btw. to get the text of the comment, just cast it to Comment:
String commentText = ((Comment) node).getData();

JEditorPane and source of HTML element

I have (still) problems with HTMLEditorKit and HTMLDocument in Java. I can only set the inner HTML of an element, but I cannot get it. Is there some way, how to get a uderlying HTML code of an element?
My problem is, that the HTML support is quite poor and bad written. The API does not allow basic and expected functions. I need change the colspan or rowspan attribute of <td>. The Java developers have closed the straightforward way: the attribute set of element is immutable. The workaround could be to take the code of element (e.g. <td colspan="2">Hi <u>world</u></td>) and replace it with new content (e.g. <td colspan="3">Hi <u>world</u></td>). This way seems to be closed too. (Bonus question: What's the HTMLEditorKit good for?)
You can get the selected Element html. Use write() method of the kit passing there offsets of the Element. But it will be included with surrounding tags "<html>" "<body>" etc.
Thanks for hint, Stanislav. That's my solution:
/**
* The method gets inner HTML of given element. If the element is named <code>p-implied</code>
* or <code>content</code>, it returns null.
* #param e element
* #param d document containing given element
* #return the inner HTML of a HTML tag or null, if e is not a valid HTML tag
* #throws IOException
* #throws BadLocationException
*/
public String getInnerHtmlOfTag(Element e, Document d) throws IOException, BadLocationException {
if (e.getName().equals("p-implied") || e.getName().equals("content"))
return null;
CharArrayWriter caw = new CharArrayWriter();
int i;
final String startTag = "<" + e.getName();
final String endTag = "</" + e.getName() + ">";
final int startTagLength = startTag.length();
final int endTagLength = endTag.length();
write(caw, d, e.getStartOffset(), e.getEndOffset() - e.getStartOffset());
//we have the element but wrapped as full standalone HTML code beginning with HTML start tag
//thus we need unpack our element
StringBuffer str = new StringBuffer(caw.toString());
while (str.length() >= startTagLength) {
if (str.charAt(0) != '<')
str.deleteCharAt(0);
else if (!str.substring(0, startTagLength).equals(startTag))
str.delete(0, startTagLength);
else
break;
}
//we've found the beginning of the tag
for (i = 0; i < str.length(); i++) { //skip it...
if (str.charAt(i) == '>')
break; //we've found end position of our start tag
}
str.delete(0, i + 1); //...and eat it
//skip the content
for (i = 0; i < str.length(); i++) {
if (str.charAt(i) == '<' && i + endTagLength < str.length() && str.substring(i, i + endTagLength).equals(endTag))
break; //we've found the end position of inner HTML of our tag
}
str.delete(i, str.length()); //now just remove all from i position to the end
return str.toString().trim();
}
This method can be easilly modified to get outter HTML (so the code containing the entire tag).

Categories