How to get unformatted html from Jsoup - java

String testCases[] = {
"<table><tbody><tr><td><div><inline>Normal Line Text</inline><br/></div></td></tr></tbody></table>",
};
for (String testString : testCases) {
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
Elements elements = doc.select("table");
for (Element ele : elements) {
System.out.println("===============================================");
System.out.println(ele.html()); //Formatted
System.out.println("-----------------------------------------------");
System.out.println(ele.html().trim().replace("\n","").replace("\r","")); //Notice the Difference
}
}
Output:
===============================================
<tbody>
<tr>
<td>
<div>
<inline>
Normal Line Text
</inline>
<br />
</div></td>
</tr>
</tbody>
-----------------------------------------------
<tbody> <tr> <td> <div> <inline> Normal Line Text </inline> <br /> </div></td> </tr></tbody>
Due to the formatting done by JSoup, the value of textNodes change to include newlines.
Changing <inline> to <span> in the test case seems to work fine, but unfortunately, we have legacy data/html containing <inline> tags generated by redactor.

Try this:
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
doc.outputSettings().prettyPrint(false);
Hope it helps.
Taken from https://stackoverflow.com/a/19602313/3324704

Related

Select href from HTML table using Jsoup

I have HTML table:
<table class="table_class" id="table_id"
<tbody>
<tr>...</tr>
<tr>
<td>...</td>
<td>
...
</td>
<td>...</td>
</tr>
<tr>...</tr>
</tbody>
And need to get all such hrefs from 1 column in table.
I tried to use
Elements links = table.select("a[href]");
System.out.println(links);
but it parse hrefs from a tags on complete page.
Maybe this will work:
String url = "...";
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("#table_id a[href]");

Trouble parsing table class with href in Jsoup

I am very new to JSOUP, have only been using it for a couple days, learning mostly from this website. Now I'm trying to get some information from the below HTML:
<td class="day no-repetition">Sun</td>
<td class="full-date" nowrap="nowrap">17/05/15</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/sunderland-association-football-club/683/" title="Sunderland">
Sunderland
</a>
</td>
<td class="score-time score">
<a href="/matches/2015/05/16/england/premier-league/sunderland-association-football-club/leicester-city-fc/1704225/" class="result-draw">
0 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/leicester-city-fc/682/" title="Leicester City">
Leicester City
</a>
</td>
<td class="events-button button first-occur">
</td>
<td class="info-button button">
More info
</td>
I need to extract the Home team, score and the Away Team from the above however I am currently having issues with this. I need both the link and the text itself. Below is the code I have:
try {
Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table[class=matches]").first();
Elements rows = table.select("tr");
for (int i=0; i<rows.size(); i++){
Element row = rows.get(i);
Elements data = row.select("td[class=team.team-a]");
System.out.println(data.text());
}
} catch (IOException e) {
e.printStackTrace();
}
This hasn't worked so far. I tried 'team.team-a', 'team.team.a' and all other variations of it. I managed to get the data that's in the 'competition' class, which works when I just replace ("td[class=team.team=a]") with (td[class=competition]) however this doesn't work with any of the classes that have a link.
Assistance would be highly appreciated!
Just separate multiple classes with a dot:
td.team.team-a > a // first team
td.team.team-b > a // second team
td.score > a // score

Detect innermost web element in (nested) in selenium

I am looking for getting the inner most web element in a page, when there are similar nested Webelements in a page.
Consider the example below:
<body>
<table id="level1">
<tr>
<td>
<table id="level2">
<tr>
<td>
<table id="level3">
<tr>
<td>
<p>Test</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table id="level1_table2">
<tr>
<td>
<table id="level2_table2">
<tr>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</body>
So when I do a search on the page by Driver.findElements by tag "table" and which have some text - "Test",
I will get 5 WebElements in total, namely - "level1", "level3" , "level1_table2" , "level2_table2"
What I want to achieve is to have a list of innermost(nested) elements which satisfy my search criteria .
So the List I should get should only have 2 WebElements namely - "level3" and "level2_table2".
I am looking something probably on the lines of recursion. Can somebody help me out.
You don't need recursion - everything you need is the proper XPath expression:
driver.findElements(By.xpath("table[not(.//table)]"))
I would use this strategy:
Search WebElements containing text Test
For each WebElement search for the first parent which match tag name is table
Here is in Java:
List<WebElement> elementsWithTest = driver.findElements(By.xpath("//*[contains(text(),'Test')]"));
List<WebElement> result = new ArrayList<>();
for(WebElement element : elementsWithTest) {
WebElement parent = element.findElement(By.xpath(".."));
while (! "table".equals(parent.getTagName())) {
parent = parent.findElement(By.xpath(".."));
}
if ("table".equals(parent.getTagName())) {
result.add(parent);
}
}
System.out.println(result);
Hope that helps.

Modifying HTML using java

I am trying to read a HTML file and add link to some of the texts :
for example :
I want to add link to "Campaign0" text. :
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">101</span></p></td>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">Campaign0</span>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">unknown</span></p></td>
Link to be added:
<a href="Second.html">
I need a JAVA program that modify html to add hyperlink over "Campaign0" .
How i do this with Jsoup ?
I tried this with JSoup :
File input = new File("D://First.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span").first(); <-- this is only for first span tag :(
span.wrap("");
Is this correct ?? It's not working :(
In short : is there anything like-->
if find <span>Campaign0</span>
then replace by <span>Campaign0</span>
using JSoup or any technology inside JAVA code??
Your code seems pretty much correct. To find the span elements with "Campaign0", "Campaign1", etc., you can use the JSoup selector "span:containsOwn(Campaign0)". See additional documentation for JSoup selectors at jsoup.org.
After finding the elements and wrapping them with the link, calling doc.html() should return the modified HTML code. Here's a working sample:
input.html:
<table>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</table>
Code:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span:containsOwn(Campaign0)").first();
span.wrap("");
span = doc.select("span:containsOwn(Campaign1)").first();
span.wrap("");
String html = doc.html();
BufferedWriter htmlWriter =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.html"), "UTF-8"));
htmlWriter.write(html);
htmlWriter.close();
output:
<html>
<head></head>
<body>
<table>
<tbody>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</tbody>
</table>
</body>
</html>

how to extract data inside a specific td in html table using java

I have:
<table class="cast_list">
<tr><td colspan="4" class="castlist_label"></td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_i1" ><img height="44" width="32" alt="Tim Robbins" title="Tim Robbins"src="http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._V379389446_.png"class="loadlate hidden " loadlate="http://ia.media-imdb.com/images/M/MV5BMTI1OTYxNzAxOF5BMl5BanBnXkFtZTYwNTE5ODI4._V1_SY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td class="itemprop" itemprop="actor" itemscope itemtype="http://schema.org/Person">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_t1" itemprop='url'> <span class="itemprop" itemprop="name">Tim Robbins</span>
</a> </td>
<td class="ellipsis">
...
</td>
how can I get only the information inside the second td class? (td class= itemprop). I want to get "/name/nm0000209/?ref_=ttfc_fc_cl_t1" and "Tim Robbins".
This is my code:
Elements elms = doc.getElementsByClass("cast_list").first().getElementsByTag("table");
Elements tds = elms.select("td");
for(Element td : tds){
if(td.attr("class").contains("itemprop")){
Elements links = tds.select("a[href]");
for(Element link : links){
if(link.attr("href").contains("name/nm"))
{
String castname = link.text();
String castImdbId = link.attr("href");
System.out.println("CastName:" + castname + "\n");
System.out.println("CastImdbID:" + castImdbId + "\n");
}
but it also returns the text of the link inside td class="primary_phptp" which is null, this is part of my output:
CastName:
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_i1
CastName:Tim Robbins
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_t1
CastName:
......
Could someone please let me know where is my problem? I think the condition if(td.attr("class").contains("itemprop")) does not work at all.
Thanks,
Use a different css selector instead of td. Since the right <td> is identified be the class, why not use it:
td.itemprop
Your java code then would start like this instead
Elements tds = elms.select("td.itemprop");

Categories