JSOUP - Extract data directly in a specific format from a page - java

I am currently experimenting with jsoup and my goal is to extract data from this retail website, in the form of:
Title: blabl
Link: foba
Grösse: 9999
KP: FALSE
Miete: TRUE
Preis: 1923,23
I have written so far this test program:
public class jsoup_test {
public static void main(String[] args) throws IOException {
String url = "http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=&timestamp=1363305908912";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements price = doc.select("tr.topangebot");
Elements price1 = doc.select("tr.white");
System.out.println("--------------------------------");
System.out.println(price);
System.out.println("--------------------------------");
System.out.println(price1);
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
However, this program gives me my data like that:
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoupload/2013/02/27/277515f7-f935-4a13-83fb-dbe3af930e28.jpg" alt="" /> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong>Gehobene Qualität, Design und exquisite Ausführung: Dachausbau mit Weitblick und 100 m² Terrasse</strong><br />Wien 16.,Ottakring, Dachgeschoss<br /><span style="color: gray">Erstbezug, Küche, Parkettboden, Hauptmiete, Terrasse, Lift, Keller, Altbau, Kabel/Sat-TV, Barrierefrei</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">125 m²</span><br /><span class="strong">4 </span>Zimmer </td>
<td class="PreisTD" style="border:none;"> <span class="light">Miete</span> 2.190 <br /> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_MerklisteTR" class="topangebot">
<td class="merkliste"> </td>
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl03_InseratInfoTR" class="topangebot">
<td class="BildTD" rowspan="2"> <img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoimporte/justimmo2/files.justimmo.at/public/pic/big/AEs_YegpKC.JPG" alt="" /> </td>
<td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong>HS-IMMO: 14. PREISSENSATION Eckzinshaus 1414m² Leerstand - Gesamtnutzfläche 1670m² + Rohdachboden ca. 700m² erzielbar ( Baubescheid ) € 1555.-/m² NFL</strong><br />Wien 14.,Penzing, Zinshaus<br /><span style="color: gray">Parkettboden, Altbau, Kabel/Sat-TV</span> </td>
<td class="GroessenTD" rowspan="2"> <span class="strong">1.670 m²</span><br /> </td>
<td class="PreisTD" style="border:none;"> <span class="light">KP</span> 2.590.000 <br /> </td>
</tr>...
Which is not in a human readable format. Therefore my question is. How to get jsoup, that it extracts the data DIRECTLY in the Format I want?
Thx for your replies?

For example for selecting title you need to do something like this
String title = doc.select("tr.topangebot > td.TitleTD").first.text();

you can navigate the page using DOM if you know the page structure:
http://jsoup.org/cookbook/extracting-data/dom-navigation
This question has a bunch of good web scrapers
Web scraping with Java

I like to use Jsoup because it's methods were literally built for DOM traversal. So, if you are good at HTML, CSS, and Jquery, this library was built for you. Yes, the Jsoup approach may be too fast. Yes, it may not suit your needs. But, when it comes to gathering any type of information from any type of website, Jsoup is flexible enough to meet your needs.

Related

Replace GET request by POST without form in JSP

So i have this JSP page which having data from table and forming a GET request to render more data on another page , by clicking one of the table line
Problem is i have to transforming it into POST method , to avoid getting information in the http request link
i know how to use post with form, but here i have to take the date from a table line and not a form
Any idea how to do that. i'm new to JSP so i don't know how to do it
<table border=0 bgcolor=#92ADC2 cellspacing=1 cellpadding=3 width=95% align=center>
<tr class=entete>
<td class=texte8 align=center> <spring:message code="nom"/></td>
<td class=texte8 align=center> <spring:message code="date_naissance"/></td>
<td class=texte8 align=center> <spring:message code="numero"/></td>
</tr>
<%
String v_Person = "";
String v_date = "";
String v_numero = "";
for (int i = 0; i < PersonListeBean.getPerson(); i++)
{
Gen_rechBean cb = PersonListeBean.getPerson(i);
v_Person = cb.getname();
v_date=cb.getdate();
v_numero=cb.getNumero();
}
%>
<tr class="<%=class_cell%>" onMouseOver="this.className='over';" onMouseOut="this.className='<%=class_cell%>';" onclick="javascript:parent['gauche'].document.location='ResultServlet?name=<%=v_Person%>&numero=<%=v_numero%>&date_naissance=<%=v_date%>">
<td class=texte7 align=left > <%=cb.getname()%></td>
<td class=texte7 align=left > <%=cb.getdate()%></td>
<td class=texte7 align=left > <%=cb.getNumero()%></td>
</tr>
</table>
<br>
<table width="95%" align="center" border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="right">
<a target="corps" href="rechResult.jsp" class="rub2" </a>
</td>
</tr>
</table>
I see what are you trying to do.
The easiest way to do that is using a form. So you can call a js method when you click the
<tr onclick="myMethod()">
that you want.
The method can fill your form and send the submit. Using this you can be redirected without sending data in you url.
A basic example could be:
(Supposing these are elements printed by server-side)
<tr onclick="myMethod(<%=getName()%>, <%=getDate()%>, <%=getNumero()%>)">
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<form id="myForm" action="targetFile.jsp" method="post">
//Hidden inputs to prevent users form touching this fields
<input type="hidden" name="name" id="data1">
<input type="hidden" name="date" id="data2">
<input type="hidden" name="numero" id="data3">
</form>
<script>
function myMethod(data1, data2, data3){
//Im gonna use jQuery. Is like javascript but quite faster to use
//Filling the form
$("#data1").val(data1);
$("#data2").val(data2);
$("#data3").val(data3);
//Submiting it
$("myForm").submit();
}
</script>
Let me know if it was helpful.
c:

Regex replacement to remove whitespace between html tags

I'm currently working with HTML built from a mustache/handlebars template.
The goal is to take the text after handlebars generates it and reduce its size by removing unnecessary whitespace characters, but keeping attribute values and content of tags intact.
Consider the following as an example:
</p> </td> </tr> <tr> <td>
should become:
</a></td></tr><tr><td>
And:
<p align="left"> Untouchable text </p> </td> </tr>
should become:
<p align="left"> Untouchable text </p></td></tr>
You can use replaceAll(">\\s+<", "><") as shown below:
public class Main {
public static void main(String[] args) {
String s = "<p align=\"left\"> Untouchable text </p> </td> </tr>";
System.out.println(s.replaceAll(">\\s+<", "><"));
}
}
Output:
<p align="left"> Untouchable text </p></td></tr>
Note:
Check this to learn more about String::replaceAll.
The regex, \\s+is used to match space(s).

Trouble parsing table class with href in Jsoup

I am very new to JSOUP, have only been using it for a couple days, learning mostly from this website. Now I'm trying to get some information from the below HTML:
<td class="day no-repetition">Sun</td>
<td class="full-date" nowrap="nowrap">17/05/15</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/sunderland-association-football-club/683/" title="Sunderland">
Sunderland
</a>
</td>
<td class="score-time score">
<a href="/matches/2015/05/16/england/premier-league/sunderland-association-football-club/leicester-city-fc/1704225/" class="result-draw">
0 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/leicester-city-fc/682/" title="Leicester City">
Leicester City
</a>
</td>
<td class="events-button button first-occur">
</td>
<td class="info-button button">
More info
</td>
I need to extract the Home team, score and the Away Team from the above however I am currently having issues with this. I need both the link and the text itself. Below is the code I have:
try {
Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table[class=matches]").first();
Elements rows = table.select("tr");
for (int i=0; i<rows.size(); i++){
Element row = rows.get(i);
Elements data = row.select("td[class=team.team-a]");
System.out.println(data.text());
}
} catch (IOException e) {
e.printStackTrace();
}
This hasn't worked so far. I tried 'team.team-a', 'team.team.a' and all other variations of it. I managed to get the data that's in the 'competition' class, which works when I just replace ("td[class=team.team=a]") with (td[class=competition]) however this doesn't work with any of the classes that have a link.
Assistance would be highly appreciated!
Just separate multiple classes with a dot:
td.team.team-a > a // first team
td.team.team-b > a // second team
td.score > a // score

Selecting a checkbox based on a string in Selenium

After Entering a string into a table with a checkbox next to it, I would like to click on the checkbox. In selenium, how can i iterate through the table and search for a particular text, then check the checkbox next to it.
Here's the html of the table:
<tbody>
<tr class="keyword-list-item">
<td width="75%">
<input class="keyword-selection-checkbox" type="checkbox" data-id="gw_78669090303"/>
<span>+spatspatulalas</span>
</td>
<td width="25%" style="text-align: right; padding-right: 4px;">
<span class="icon iconGoogle"/>
</td>
</tr>
<tr class="keyword-list-item">
<td width="75%">
<input class="keyword-selection-checkbox" type="checkbox" data-id="gw_102731166303"/>
<span>12.10 test post</span>
</td>
<td width="25%" style="text-align: right; padding-right: 4px;">
<span class="icon iconGoogle"/>
</td>
</tr>
You can use xpath for this. Just needs be little smart how you write the xpath. Notice the following xpath find the checkbox using the text of it.
String text = "12.10 test post";
By xpath = By.xpath("//span[contains(text(),'" + text + "')]/../input");
WebElement element = driver.findElement(xpath);

get table span class content using jsoup

I have a website that contains a table that look like similar(bigger..) to this one:
</table>
<tr>
<td>
<table width="100%" cellspacing="-1" cellpadding="0" border="0" dir="rtl" style="padding-top: 25px;">
<tr>
<td align="right" style="padding-right: 25px;">
<span class="artist_name_txt">
name
<p class="diccografia">subname</p>
</span>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table width="100%" border="0" cellspacing="0" cellpadding="0" dir="rtl" style="padding-right: 25px; padding-left: 25px">
<tr>
<td class="songs" align="right">
number1
</td>
</tr>
<tr>
<td class="songs" align="right">
number2
.......
</td>
</tr>
</table>
and I need an idea how can i parse the website and extract this table into 2 arrays -
one will be something like names{number1, number2}
and the second will be links{number1link, number2link}
I tried a lot of ways and nothing really helps me.
You should read the JSoup Cookbook - especially the Selector syntax is very powerful.
Here's an example:
final String html = ...
// use connect().get() instead if you connect to an website
Document doc = Jsoup.parse(html);
List<String> names = new ArrayList<>();
List<String> links = new ArrayList<>();
for( Element element : doc.select("a.artist_player_songlist") )
{
names.add(element.text());
links.add(element.attr("href"));
}
System.out.println("Names: " + names);
System.out.println("Links: " + links);
Output:
Names: [number1, number2]
Links: [/number1link, /number2link]
Android Web Scraping with a Headless Browser
Htmlunit on Android application
HttpUnit/HtmlUnit equivalent for android

Categories