Select href from HTML table using Jsoup - java

I have HTML table:
<table class="table_class" id="table_id"
<tbody>
<tr>...</tr>
<tr>
<td>...</td>
<td>
...
</td>
<td>...</td>
</tr>
<tr>...</tr>
</tbody>
And need to get all such hrefs from 1 column in table.
I tried to use
Elements links = table.select("a[href]");
System.out.println(links);
but it parse hrefs from a tags on complete page.

Maybe this will work:
String url = "...";
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("#table_id a[href]");

Related

How to extract HTML's <td> tag data using regex in Java?

I trying to read Username and Password from an Email using Java
It is returning mail content in html format and I just wanted to extract Username and Password which is present under <td> tag. Below is my HTML code snippet -
<table width="200">
<tbody>
<tr>
<td colspan="2">Your Account Details:</td>
</tr>
<tr>
<td>EmailId:</td>
<td><a class="moz-txt-link-abbreviated" href="mailto:jainish.m.kapadia#trimantra.net">jainish.m.kapadia#trimantra.net</a></td>
</tr>
<tr>
<td>Password:</td>
<td>C3mRXh+|n#1J</td>
</tr>
</tbody>
</table>
How do I achieve this?
Please don't try to parse HTML with RegEx,
for a detailed answer on why you shouldn't try this see this SO answer.
You can use jsoup for parsing your HTML Strings like this:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
jsoup also offers methods for hierarchical navigation like
siblingElements();
nextElementSibling();
and so on.
You can use below code snippet:
String str = "your html";
Pattern pattern = Pattern.compile("(<td>(.*?)<\\/td>)");
Matcher matcher = pattern.matcher(str);
This will give you back all the <td> tag. Now you can loop through the matcher and get your required string.

Detecting same tags pattern in web page using jsoup java

I am writing a code for detecting matching tags patterns in web page. Here is the example.
<body>
<table width="200" border="1">
<tr>
<td>Name</td>
<td>Place</td>
<td>Animal</td>
</tr>
<p>hello World</p>
<tr>
<td>Jack</td>
<td>New york</td>
<td>Lion</td>
</tr>
<b>Code Works</b>
<tr>
<td>George</td>
<td>Sydney</td>
<td>Tiger</td>
</tr>
<tr>
<td>Tina</td>
<td>Delhi</td>
<td>Cat</td>
</tr>
</table>
<table>
<tbody>
<tr>
<td> </td>
<td>
1
2
3
4
5
</td>
</tr>
</tbody>
</table>
</body>
For above Tag pattern, I need to find the tags which are occurring repeatedly. And to discard those that are not in the pattern like tags b and p. For first table tags tr and td are occurring . For 2nd table 'a' tag is repeated.
This is what I have done till now:
Parsed to DOM tree using Jsoup.
Then used node visitor class to traverse the tree. Using head and tail methods, I can enter and exit tags.
But I am confused about how to proceed further.
Note: The tags pattern are not fixed.Tag pattern will vary depending on web page structure. Any kind of help will be appreciated.
But I am confused about how to proceed further.
Your confusion is propagating and reach us too. However, I'll try to give you an hint.
You can count the tags in your HTML code. If a tag count reaches a certain threshold, you can consider this tag as "repeatedly occuring".
// Load document
String html = ...
Document doc = Jsoup.parse(html);
// Count tags
String tagsSelector = "*";
Map<Element, Integer> tagsCountByType = new Hashmap<>();
for(Element e : doc.select("*")) {
Integer count = tagsCountByType.get(e);
if (count == null) {
tagsCountByType.put(e, new Integer(1));
} else {
tagsCountByType.put(e, new Integer(count.intValue() + 1));
}
}
// Find tag with a count greater than a given threshold
// ...
I didn't test the code. Just take it as an idea, some sort of inspiration.
Another idea, you can narrow down the tagsSelector. For example:
// All elements (tags) under any table directly under body.
String tagsSelector = "body > table *";

Detect innermost web element in (nested) in selenium

I am looking for getting the inner most web element in a page, when there are similar nested Webelements in a page.
Consider the example below:
<body>
<table id="level1">
<tr>
<td>
<table id="level2">
<tr>
<td>
<table id="level3">
<tr>
<td>
<p>Test</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table id="level1_table2">
<tr>
<td>
<table id="level2_table2">
<tr>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</body>
So when I do a search on the page by Driver.findElements by tag "table" and which have some text - "Test",
I will get 5 WebElements in total, namely - "level1", "level3" , "level1_table2" , "level2_table2"
What I want to achieve is to have a list of innermost(nested) elements which satisfy my search criteria .
So the List I should get should only have 2 WebElements namely - "level3" and "level2_table2".
I am looking something probably on the lines of recursion. Can somebody help me out.
You don't need recursion - everything you need is the proper XPath expression:
driver.findElements(By.xpath("table[not(.//table)]"))
I would use this strategy:
Search WebElements containing text Test
For each WebElement search for the first parent which match tag name is table
Here is in Java:
List<WebElement> elementsWithTest = driver.findElements(By.xpath("//*[contains(text(),'Test')]"));
List<WebElement> result = new ArrayList<>();
for(WebElement element : elementsWithTest) {
WebElement parent = element.findElement(By.xpath(".."));
while (! "table".equals(parent.getTagName())) {
parent = parent.findElement(By.xpath(".."));
}
if ("table".equals(parent.getTagName())) {
result.add(parent);
}
}
System.out.println(result);
Hope that helps.

How to get unformatted html from Jsoup

String testCases[] = {
"<table><tbody><tr><td><div><inline>Normal Line Text</inline><br/></div></td></tr></tbody></table>",
};
for (String testString : testCases) {
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
Elements elements = doc.select("table");
for (Element ele : elements) {
System.out.println("===============================================");
System.out.println(ele.html()); //Formatted
System.out.println("-----------------------------------------------");
System.out.println(ele.html().trim().replace("\n","").replace("\r","")); //Notice the Difference
}
}
Output:
===============================================
<tbody>
<tr>
<td>
<div>
<inline>
Normal Line Text
</inline>
<br />
</div></td>
</tr>
</tbody>
-----------------------------------------------
<tbody> <tr> <td> <div> <inline> Normal Line Text </inline> <br /> </div></td> </tr></tbody>
Due to the formatting done by JSoup, the value of textNodes change to include newlines.
Changing <inline> to <span> in the test case seems to work fine, but unfortunately, we have legacy data/html containing <inline> tags generated by redactor.
Try this:
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
doc.outputSettings().prettyPrint(false);
Hope it helps.
Taken from https://stackoverflow.com/a/19602313/3324704

jsoup retrieving a specific table from the header

i have been working on this for a while and just cant seem to work out how to get the correct table that corresponds to the header it has. the tables are split up into sections which i can retrieve however inside the section is a header with the title of the table. i need to find the section with the header that matches a string and then pull the data from it. I'm fine with getting the data out of the table its just getting the correct section for the table
HTML extract of the section:
<section class="blueTab">
<header><h2>Energy</h2></header> //<----- THE HEADER I NEED TO MATCH TO
<table class="infoTable">
<tr><th>Model</th><th>0-60 mph</th><th>Top Speed</th><th>BHP</th><th></th></tr>
<tr>
<td><p>1.4i 16V Energy 5d</p></td>
<td><p>12.8 secs</p></td>
<td><p>111 mph</p></td>
<td><p>88 bhp</p></td>
</tr>
<tr class="alternate">
<td><p>1.6i 16V Energy 5d</p></td>
<td><p>11.5 secs</p></td>
<td><p>115 mph</p></td>
<td><p>103 bhp</p></td>
</tr>
<tr>
<td><p>1.8i VVT Energy 5d Auto</p></td>
<td><p>10.7 secs</p></td>
<td><p>117 mph</p></td>
<td><p>138 bhp</p></td>
</tr>
<tr class="alternate">
<td><p>1.3 CDTi 16V Energy 5d</p></td>
<td><p>12.8 secs</p></td>
<td><p>107 mph</p></td>
<td><p>88 bhp</p></td>
</tr>
</table>
<div class="fr topMargin">
<div id="ctl00_contentHolder_topFullWidthContent" class="modelEnquiry">
<div id="ctl00_contentHolder_topFullWidthContent" class="buttonLinks">
</div>
<div class="cb"><!----></div>
</div>
</div>
<div class="cb"><!----></div>
</section>
Im guessing i will have to use doc.getElementsByClass("blueTab") in a for loop and for each element see if h2 equals the string im looking for, i am just not sure how to implement this
This should solve your problem
Document doc = Jsoup.parse(input, "UTF-8");
Elements elem = doc.select(".blueTab header h2");
for (Iterator<Element> iterator = elem.iterator(); iterator.hasNext();)
{
Element element = iterator.next();
if (element.text().equals("Energy")) // your comparison text
{
Element tableElement = element.parent().nextElementSibling(); //Your got the expected table Element as per your requirement
}
}

Categories