How to get unformatted html from Jsoup

How to get unformatted html from Jsoup - java

String testCases[] = {
"<table><tbody><tr><td><div><inline>Normal Line Text</inline><br/></div></td></tr></tbody></table>",
};
for (String testString : testCases) {
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
Elements elements = doc.select("table");
for (Element ele : elements) {
System.out.println("===============================================");
System.out.println(ele.html()); //Formatted
System.out.println("-----------------------------------------------");
System.out.println(ele.html().trim().replace("\n","").replace("\r","")); //Notice the Difference
}
}
Output:
===============================================
<tbody>
<tr>
<td>
<div>
<inline>
Normal Line Text
</inline>
<br />
</div></td>
</tr>
</tbody>
-----------------------------------------------
<tbody> <tr> <td> <div> <inline> Normal Line Text </inline> <br /> </div></td> </tr></tbody>
Due to the formatting done by JSoup, the value of textNodes change to include newlines.
Changing <inline> to <span> in the test case seems to work fine, but unfortunately, we have legacy data/html containing <inline> tags generated by redactor.

Try this:
Document doc = Jsoup.parse(testString,"", Parser.xmlParser());
doc.outputSettings().prettyPrint(false);
Hope it helps.
Taken from https://stackoverflow.com/a/19602313/3324704

Related

Select href from HTML table using Jsoup

I have HTML table:
<table class="table_class" id="table_id"
<tbody>
<tr>...</tr>
<tr>
<td>...</td>
<td>
...
</td>
<td>...</td>
</tr>
<tr>...</tr>
</tbody>
And need to get all such hrefs from 1 column in table.
I tried to use
Elements links = table.select("a[href]");
System.out.println(links);
but it parse hrefs from a tags on complete page.

Maybe this will work:
String url = "...";
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("#table_id a[href]");

Trouble parsing table class with href in Jsoup

I am very new to JSOUP, have only been using it for a couple days, learning mostly from this website. Now I'm trying to get some information from the below HTML:
<td class="day no-repetition">Sun</td>
<td class="full-date" nowrap="nowrap">17/05/15</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/sunderland-association-football-club/683/" title="Sunderland">
Sunderland
</a>
</td>
<td class="score-time score">
<a href="/matches/2015/05/16/england/premier-league/sunderland-association-football-club/leicester-city-fc/1704225/" class="result-draw">
0 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/leicester-city-fc/682/" title="Leicester City">
Leicester City
</a>
</td>
<td class="events-button button first-occur">
</td>
<td class="info-button button">
More info
</td>
I need to extract the Home team, score and the Away Team from the above however I am currently having issues with this. I need both the link and the text itself. Below is the code I have:
try {
Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table[class=matches]").first();
Elements rows = table.select("tr");
for (int i=0; i<rows.size(); i++){
Element row = rows.get(i);
Elements data = row.select("td[class=team.team-a]");
System.out.println(data.text());
}
} catch (IOException e) {
e.printStackTrace();
}
This hasn't worked so far. I tried 'team.team-a', 'team.team.a' and all other variations of it. I managed to get the data that's in the 'competition' class, which works when I just replace ("td[class=team.team=a]") with (td[class=competition]) however this doesn't work with any of the classes that have a link.
Assistance would be highly appreciated!

Just separate multiple classes with a dot:
td.team.team-a > a // first team
td.team.team-b > a // second team
td.score > a // score

Detect innermost web element in (nested) in selenium

I am looking for getting the inner most web element in a page, when there are similar nested Webelements in a page.
Consider the example below:
<body>
<table id="level1">
<tr>
<td>
<table id="level2">
<tr>
<td>
<table id="level3">
<tr>
<td>
<p>Test</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table id="level1_table2">
<tr>
<td>
<table id="level2_table2">
<tr>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</body>
So when I do a search on the page by Driver.findElements by tag "table" and which have some text - "Test",
I will get 5 WebElements in total, namely - "level1", "level3" , "level1_table2" , "level2_table2"
What I want to achieve is to have a list of innermost(nested) elements which satisfy my search criteria .
So the List I should get should only have 2 WebElements namely - "level3" and "level2_table2".
I am looking something probably on the lines of recursion. Can somebody help me out.

You don't need recursion - everything you need is the proper XPath expression:
driver.findElements(By.xpath("table[not(.//table)]"))

I would use this strategy:
Search WebElements containing text Test
For each WebElement search for the first parent which match tag name is table
Here is in Java:
List<WebElement> elementsWithTest = driver.findElements(By.xpath("//*[contains(text(),'Test')]"));
List<WebElement> result = new ArrayList<>();
for(WebElement element : elementsWithTest) {
WebElement parent = element.findElement(By.xpath(".."));
while (! "table".equals(parent.getTagName())) {
parent = parent.findElement(By.xpath(".."));
}
if ("table".equals(parent.getTagName())) {
result.add(parent);
}
}
System.out.println(result);
Hope that helps.

Modifying HTML using java

I am trying to read a HTML file and add link to some of the texts :
for example :
I want to add link to "Campaign0" text. :
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">101</span></p></td>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">Campaign0</span>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">unknown</span></p></td>
Link to be added:
<a href="Second.html">
I need a JAVA program that modify html to add hyperlink over "Campaign0" .
How i do this with Jsoup ?
I tried this with JSoup :
File input = new File("D://First.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span").first(); <-- this is only for first span tag :(
span.wrap("");
Is this correct ?? It's not working :(
In short : is there anything like-->
if find <span>Campaign0</span>
then replace by <span>Campaign0</span>
using JSoup or any technology inside JAVA code??

Your code seems pretty much correct. To find the span elements with "Campaign0", "Campaign1", etc., you can use the JSoup selector "span:containsOwn(Campaign0)". See additional documentation for JSoup selectors at jsoup.org.
After finding the elements and wrapping them with the link, calling doc.html() should return the modified HTML code. Here's a working sample:
input.html:
<table>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</table>
Code:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span:containsOwn(Campaign0)").first();
span.wrap("");
span = doc.select("span:containsOwn(Campaign1)").first();
span.wrap("");
String html = doc.html();
BufferedWriter htmlWriter =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.html"), "UTF-8"));
htmlWriter.write(html);
htmlWriter.close();
output:
<html>
<head></head>
<body>
<table>
<tbody>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</tbody>
</table>
</body>
</html>

how to extract data inside a specific td in html table using java

I have:
<table class="cast_list">
<tr><td colspan="4" class="castlist_label"></td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_i1" ><img height="44" width="32" alt="Tim Robbins" title="Tim Robbins"src="http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._V379389446_.png"class="loadlate hidden " loadlate="http://ia.media-imdb.com/images/M/MV5BMTI1OTYxNzAxOF5BMl5BanBnXkFtZTYwNTE5ODI4._V1_SY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td class="itemprop" itemprop="actor" itemscope itemtype="http://schema.org/Person">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_t1" itemprop='url'> <span class="itemprop" itemprop="name">Tim Robbins</span>
</a> </td>
<td class="ellipsis">
...
</td>
how can I get only the information inside the second td class? (td class= itemprop). I want to get "/name/nm0000209/?ref_=ttfc_fc_cl_t1" and "Tim Robbins".
This is my code:
Elements elms = doc.getElementsByClass("cast_list").first().getElementsByTag("table");
Elements tds = elms.select("td");
for(Element td : tds){
if(td.attr("class").contains("itemprop")){
Elements links = tds.select("a[href]");
for(Element link : links){
if(link.attr("href").contains("name/nm"))
{
String castname = link.text();
String castImdbId = link.attr("href");
System.out.println("CastName:" + castname + "\n");
System.out.println("CastImdbID:" + castImdbId + "\n");
}
but it also returns the text of the link inside td class="primary_phptp" which is null, this is part of my output:
CastName:
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_i1
CastName:Tim Robbins
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_t1
CastName:
......
Could someone please let me know where is my problem? I think the condition if(td.attr("class").contains("itemprop")) does not work at all.
Thanks,

Use a different css selector instead of td. Since the right <td> is identified be the class, why not use it:
td.itemprop
Your java code then would start like this instead
Elements tds = elms.select("td.itemprop");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get unformatted html from Jsoup - java

Try this: Document doc = Jsoup.parse(testString,"", Parser.xmlParser()); doc.outputSettings().prettyPrint(false); Hope it helps. Taken from https://stackoverflow.com/a/19602313/3324704

Related

Select href from HTML table using Jsoup

Trouble parsing table class with href in Jsoup

Detect innermost web element in (nested) in selenium

Modifying HTML using java

how to extract data inside a specific td in html table using java

Categories

Resources