Parsing html in Jsoup - java

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.
I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
SpannableStringBuilder text = new SpannableStringBuilder();
if (value.contains("</span>")) {
Document document = Jsoup.parse(value);
Elements elements = document.getElementsByTag("span");
if (elements != null) {
int i = 0;
int start = 0;
for (Element ele : elements) {
String styleName = type + "." + ele.attr("class");
text.append(ele.text());
int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
text.append(ele.nextSibling().toString());
start = text.length();
i++;
}
}
return text;
}
I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".
Need output such as:
- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>

Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}
output:
0 ->
There are
1 -> two
2 -> workers from the
3 -> Front of House
By the way: I don't know how to get rid of the first line break before There are.

Related

How to get anchor tag href and anchor tag text inside a div using Selenium in Java

My HTML code consists of multiple divs. Inside each div is a list of anchor tags. I need to fetch the href values and text values of the anchor tags that are in the sub-container div. I'm using Selenium to get the HTML code of the webpage.
HTML code:
<body>
<div id="main-container">
One
Two
Three
<div id="sub-container">
Abc
Xyz
Pqr
</div>
</div>
</body>
Java code:
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
Output:
a=www.one.com
a=www.two.com
a=www.three.com
a=www.abc.com
a=www.xyz.com
a=www.pqr.com
Output I need:
a=www.abc.com , Abc
a=www.xyz.com , Xyz
a=www.pqr.com , Pqr
Try this,
List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(element.getTagName() + "=" + link +", "+ element.getText());
}
You can use element.getText() to get the link text.
If you only want to select the links in the sub-container, you can adjust your xPath:
//*[#id="sub-container"]/a
Pretty simple, try as below:
`List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/a"));
for (WebElement element : list) {
String link = element.getAttribute("href");
String text = element.getText();
System.out.println(e.getTagName() + "=" + link + ", " + text);
}
if id sub-container is unique, just use the below line
driver.findElements(By.cssSelector("div#sub-container>a"));
thanks

How to use JSoup to get hyperlink href?

I have the following jsFiddle
http://jsfiddle.net/B5zvV/
I am trying to use JSoup to obtain the value of the hyperlink's href string on Line 238:
<a href="/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450">
Hence, the desired result would be to obtain a String with a value of:
/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450
Here's my code:
Document doc = Jsoup.connect("http://myapp.example.com/fizz.html").get()
Elements elems = doc.getElementsByAttributeValueContaining("href", "repositoryId")
When I run this, the value of elems is empty: why, and what do I need to do to get the desired String?
The getElementsByAttributeValueContaining() method will return multiple values in this case because many hrefs has repositoryId. If you are particular about line 238 then that a is enclosed inside an li with class item item-default. There is only one such li and two a tags inside it. Just take the first one like
String html = "<li class=\"item item-default\" data-item-id=\"28049450\" id=\"item-28049450\">"
+ "<a href=\"/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450\">"
+ "<h3 class=\"item-title\">MCAppRepo <span class=\"item-default-marker grey\">(default)</span></h3>"
+ "</a>"
+ "<a href=\"/chain/admin/config/confirmDeleteRepository.action?planKey=AB-CSD&repositoryId=28049450\" class=\"delete\" title=\"Remove repository\">"
+ "<span class=\"assistive\">Delete</span>"
+ "</a>"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.item.item-default > a");
System.out.println(elems.first().attr("href"));

Extract HTML from <!-- --> comment to a closing tag using jsoup java

I have some HTML that looks like
<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>
I need to extract the HTML from the comment to a closing dl tag. The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). The HTML between the two is variable in length and content and doesn't have any good identifiers.
I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described?
Here's what I've come up with, which works, but obviously not the most efficient.
String myDirectoryPath = "D:\\Path";
File dir = new File(myDirectoryPath);
Document myDoc;
Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
for (File child : dir.listFiles()) {
System.out.println(child.getAbsolutePath());
File file = new File(child.getAbsolutePath());
String charSet = "UTF-8";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Matcher m = p.matcher(innerHtml);
if (m.find()) {
Document doc = Jsoup.parse(m.group(1));
String myText = doc.text();
try {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
out.println(myText);
out.close();
} catch (IOException e) {
//error }
}
}
To use a regex, maybe something simple
# "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"
<!-- \s* start \s* content \s* -->
([\S\s]*?)
</ \s* dl \s* >
Here's some example code - it may need further improvements - depending on what you want to do.
final String html = "<p>abc</p>" // Additional tag before the comment
+ "<!-- start content -->\n"
+ "<p>Blah...</p>\n"
+ "<dl><dd>blah</dd></dl>"
+ "<p>def</p>"; // Additional tag after the comment
// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
if( node.nodeName().equals("#comment") ) // if it's a comment we do something
{
// Some output for testing ...
System.out.println("=== Comment =======");
System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
System.out.println("=== Childs ========");
// Get the childs of the comment --> following nodes
final List<Node> childNodes = node.siblingNodes();
// Start- and endindex for the sublist - this is used to skip tags before the actual comment node
final int startIdx = node.siblingIndex(); // Start index - start after (!) the comment node
final int endIdx = childNodes.size(); // End index - the last following node
// Iterate over all nodes, following after the comment
for( Node child : childNodes.subList(startIdx, endIdx) )
{
/*
* Do whatever you have to do with the nodes here ...
* In this example, they are only used as Element's (Html Tags)
*/
if( child instanceof Element )
{
Element element = (Element) child;
/*
* Do something with your elements / nodes here ...
*
* You can skip e.g. 'p'-tag by checking tagnames.
*/
System.out.println(element);
// Stop after processing 'dl'-tag (= closing 'dl'-tag)
if( element.tagName().equals("dl") )
{
System.out.println("=== END ===========");
break;
}
}
}
}
}
For easier understanding, the code is very detailed, you can shorten it at some points.
And finally, here's the output of this example:
=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
<dd>
blah
</dd>
</dl>
=== END ===========
Btw. to get the text of the comment, just cast it to Comment:
String commentText = ((Comment) node).getData();

Search Function in HTML

How can I search text in HTMLDocument and then return the index and last index of that word/sentence but ignoring tags when searching..
Searching: stackoverflow
html: <p class="red">stack<b>overflow</b></p>
this should return index 15 and 31.
Just like in browsers when searching in webpages.
If you want to do that in Java, here are rough example using Jsoup. But of course you should implement the detail so that the code can parse properly for any given html.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";
String search = "stackoverflow";
Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow
if(search.matches(pPlainText)){
System.out.println("text found in html");
String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"
//search the text in pElementString
int start = pElementString.indexOf(firstWord); // 15
int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
System.out.println(start + " >> " + end);
}else{
System.out.println("cannot find searched text");
}

Jsoup - extracting text

I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
And I need to build:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
Text node - Some text
Node <b> - with tags
Text node - might go here.
Node <p> - Also there are paragraphs
Text node - More text can go without paragraphs
Node <br> - <empty>
Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
Result:
1 TextNode
Some text
2 Element <b>with tags</b>
3 TextNode might go here.
4 Element <p>Also there are paragraphs</p>
5 TextNode More text can go without paragraphs
6 Element <br/>
for (Element el : doc.select("body").select("*")) {
for (TextNode node : el.textNodes()) {
node.text() ));
}
}
Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs
public static void main(String[] args) throws IOException {
String str =
"<div>"
+ " Some text <b>with tags</b> might go here."
+ " <p>Also there are paragraphs.</p>"
+ " More text can go without paragraphs<br/>"
+ "</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
StringBuilder builder = new StringBuilder();
stripTags(builder, div.childNodes());
System.out.println("Text without tags: " + builder.toString());
}
/**
* Strip tags from a List of type <code>Node</code>
* #param builder StringBuilder : input and output
* #param nodesList List of type <code>Node</code>
*/
public static void stripTags (StringBuilder builder, List<Node> nodesList) {
for (Node node : nodesList) {
String nodeName = node.nodeName();
if (nodeName.equalsIgnoreCase("#text")) {
builder.append(node.toString());
} else {
// recurse
stripTags(builder, node.childNodes());
}
}
}
you can use TextNode for this purpose:
List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
String html = "";
for(TextNode txNode:bodyTextNode){
html+=txNode.text();
}

Categories