Unable to read <p> text under <h3> using selenium - java

I have an html code something like:
<h3> Some Heading </h3>
<p> Some String </p>
<p> more string </p>
<h3> Other heading</h3>
<p> some text </p>
I am trying to access Some String, more string and some text. With java, am trying to access like this:
List<WebElement> h3Tags = driver.findElements(By.tagName("h3"));
List<WebElement> para = null;
WebElement bagInfo = h3Tags.get(0); //reads first h3
if(bagInfo.getText().contains("carry-on") || bagInfo.getText().contains("Carry-on")){
para = AutoUtils.findElementsByTagName(bagInfo, "p");
System.out.println(para.get(0).getText()); //Null pointer here
}
bagInfo = h3Tags.get(1);
if(bagInfo.getText().contains("checked") || bagInfo.getText().contains("Checked")){
para = AutoUtils.findElementsByTagName(bagInfo, "p");
System.out.println(para.get(0).getText()); //Null pointer here too
}
Tried xpath like "h3['/p']" but still no luck. What is the best way to access those <p> strings?

Try xpath //h3/following-sibling::p to match all 3 paragraphs
Also note that your XPath h3['/p'] doesn't work as it means return h3 node which is DOM root node. Predicate ['/p'] will always return True as non-empty string ('/p' in your case) is always True

To access Some String, more string and some text you can use the following Locator Strategy :
To access the node with text as Some String
By.xpath("//h3[normalize-space()='Some Heading']//following::p[1]")
To access the node with text as more string
By.xpath("//h3[normalize-space()='Some Heading']//following::p[2]")
To access the node with text as some text
By.xpath("//h3[normalize-space()='Other heading']//following::p[1]")
Once you locate those elements you can use getAttribute("innerHTML") method to extract the text within the nodes.

Related

Xpath from a span text in break

I am trying to get the validate the $40 by asserting but unable to track the xpath. Any Suggestions
<div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12"><div class="vetting-details"><h3 class="paragraph text-center"><span>Vetting Price: </span>$40 <br> <span>Estimated Time for Vetting:</span> 30 seconds</h3></div></div>
That 40 is basically a text node. You can retrieve it using :
WebElement e = driver.findElement(By.xpath("//h3[contains(#class,'paragraph text-center')]"));
String el = (String)((JavascriptExecutor)driver).executeScript("return arguments[0].textContent;", e);
String s = el.split("\\ ")[2].trim();
System.out.println(s)
Explanation :
40 is not a plain text rather it is a text node. we need JS intervention to target the p tag, and then get all the text content. Then using split to split the string to get the desired element.
We can use below code (without JavaScriptExecutor)
String value=driver.findElement(By.tagName("h3")).getText();
String s1=value.split(":")[1].trim();
System.out.println(s1);

JSoup - How to parse nested texts?

I'm parsing html of a website with JSoup. I want to parse this part:
<td class="lastpost">
This is a text 1<br>
Website Page - 1
</td>
I want like this:
String text = "This is a text 1";
String textNo = "Website Page - 1";
String link = "post/13594";
How can I get the parts like this?
Your code would only get all the text that is in the td elements that you are selecting. If you want to store the text in separate variables, you should grab the parts separately like the following code. Extra comments added so you can understand how/why it is getting each piece.
// Get the first td element that has class="lastpost"
Element lastPost = document.select("td.lastpost").first();
// Get the first a element that is a child of the td
Element linkElement = lastPost.getElementsByTag("a").first();
// This text is the first child node of td, get that node and call toString
String text = lastPost.childNode(0).toString();
// This is the text within the a (link) element
String textNo = linkElement.text();
// This text is the href attribute value of the a (link) element
String link = linkElement.attr("href");

Get class name Jsoup

I am trying to parse some html for android app, but I can't get the value for the data-id class
Here's the html code
<div class="popup event-popup Predavanja" style="display: none;" data-id="246274" data-position="bottom" >
How can I parse the 246274 value?
If you have the Element object of the div tag, then this code will work:
String attr = element.attr("data-id"); // get the value of the 'data-id' attribute
int dataID = Integer.parseInt(attr); // convert it to an int
Optionally, if you want to check first if the attribute even exists, use this:
if (element.hasAttr("data-id")) // etc.
I think you can do like this
Document doc = JSoup.parse(""Url");
Element divElement = doc.select("div.popup event-popup Predavanja").first();//Div with class name
String dataId = divElement.attr("data-id");
Follow this link https://jsoup.org/cookbook/extracting-data/selector-syntax

Convert String to arraylist using split

Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}
Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.

Extract text from html: looking for a good sax-like parser or advices with a dom parser

I have an html document formatted this way:
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
I'd like to extract the text. With dom like parsers I could extract each paragraph , but the problem is inside: I'd have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:
some plain text some emphatized text, some strong text
and for this purpose I guess a sax like parser would be better than a dom, given that I can't know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.
You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.
Example
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
Use some dom parser to extract the p tags to strings, you would then have a string like so:
String content = "some plain text <em>some emphatized text</em>, <strong> some strong text</strong>";
content = stripHtmlTags( content );
println( content ); // some plain text some emphatized text, some strong text
String extractedText=Html.fromHtml(Your HTML String).toString()
This gives u extracted text..
Hope this help you.
Add code to read CDATA by DOM pase
**childNode.getNodeType() == Node.CDATA_SECTION_NODE**
if Using XMLUtils modify like
public static String getNodeValue(Node node) {
node.normalize();
String response = node.getNodeValue();
if (response != null) {
return response;
} else {
NodeList list = node.getChildNodes();
int size = list == null ? 0 : list.getLength();
for (int j = 0; j < size; j++) {
Node childNode = list.item(j);
if (childNode.getNodeType() == Node.TEXT_NODE
|| childNode.getNodeType() == Node.CDATA_SECTION_NODE) {
response = childNode.getNodeValue();
return response;
}
}
}
return "";
}

Categories