Android using JSOUP for HTML - java

I am just completely lost and confused when using JSOUP to parse this html document...
I dont mean to just ask for straight up code but if someone has the time or can get me started that would be great...
Here is the document:
http://radar.weather.gov/ridge/RadarImg/N0R/ILN/
If you view the source I am trying to fetch these lines:
<tr><td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td><td>ILN_20140112_0021_N0R.gif</td><td align="right">12-Jan-2014 00:23 </td><td align="right">2.2K</td><td> </td></tr>
As you notice there are many of these... I need the value in
<a href=
I also need that value in the first ten of those lines...
As i said if anyone has the time to help me out, it would be greatly appreciated!

First you need to store the contents of the HTML into a Document (explained more here):
String url = "http://radar.weather.gov/ridge/RadarImg/N0R/ILN/";
Document doc = Jsoup.connect(url).get();
Next select the Elements from the Document that you want (see here). In the following line, it will select all <a> elements with a href attribute that contains the String "gif":
Elements links = doc.select("a[href]:contains(gif)");
Then to print out the value from the first ten, you could just use a loop. The attr() method allows you to extract only the value of a certain attribute, rather than the complete HTML or its text:
for (int i=0;i<10;i++) {
System.out.println(links.get(i).attr("href"));
}
The output is:
ILN_20140112_0221_N0R.gif
ILN_20140112_0227_N0R.gif
ILN_20140112_0232_N0R.gif
ILN_20140112_0237_N0R.gif
ILN_20140112_0242_N0R.gif
ILN_20140112_0248_N0R.gif
ILN_20140112_0253_N0R.gif
ILN_20140112_0258_N0R.gif
ILN_20140112_0303_N0R.gif
ILN_20140112_0308_N0R.gif
This is essentially the basic methodology for most of the parsing you will do in Jsoup. You should have a go at extracting some other Elements from the page (use this page for reference).

Try this
String TestUrl = "<tr><td><img src='/icons/image2.gif' alt='[IMG]'></td><td><a href='ILN_20140112_0021_N0R.gif'>ILN_20140112_0021_N0R.gif</a></td><td align='right'>12-Jan-2014 00:23</td><td align='right'>2.2K</td><td> </td></tr>";
Document doc = Jsoup.parse(TestUrl);
Element link = doc.select("a").first();
/**
* value will be "ILN_20140112_0021_N0R.gif"
*/
String value = link.text();

Edit: Refer to #ashatte's solution instead.
Document doc = Jsoup.parse
(new URL("http://radar.weather.gov/ridge/RadarImg/N0R/ILN/"),
3000);
//Or whatever your link is; 3000 is timeout
int ignoreCount = 0;
//using a counter to ignore top 2 lines
for (Element item : doc.select("tr")) {
// Selects the <tr> elements so item is a single <tr>
if (a > 1) {
Element link = item.select("a").first();
// selects first <a> element
if (link != null && link.hasAttr("href"))
String href = link.attr("href"));
// fetches href attribute from the selected <a>
}
a++;
}
This is just a way to do it among many. I strongly suggest you read up the JSOUP cookbook

Related

How to select an element with specific text from the HTML a element using JSoup

I got an a element with few attributes one of them is data-product-id this is my element that I want.
for example data-product-id="002212" I am intrested in the number "002212"
My problem is that there can be couple a elements with this link
There is how link looks like.
<a href="something.com" title="test tile" class="title-product" data-jsevent="obj:title--product" data-product-name="test" data-product-id="002212" ddata-product-price="1.99" data-product-brand="test" data-product-quantity="1">
I did something like this:
Elements links = document.select("a.title-product");
I receives every a element with class title-product now How can I get from received html data-product-id but with my number 002212?
I can't parse links to String.
I also tried something like this:
if(links.contains("data-product-id=\"002212\"")){
System.out.println("it works");
} else {
System.out.println("nothing");
}
But links.contains equals always "false" even this number is there.
also I tried
it works but I get only first element with for example number 002211 instead of 002212
String linktext = a.attr("data-product-id");
and this is null
String linktext = a.attr("data-product-id=\"002212\"");
Solved this line below did it.
Elements links = document.select("a[data-product-id=\"002212\"]");

How would I obtain only the texts of a webpage that contain necessary keyword using JSoup?

I came up with something like this which didn't work out. I am trying to extract the texts that contain the keyword alone and not the entire text of the webpage just because the webpage has that keyword.
String pconcat="";
for (i = 0; i < urls.length; i++) {
Document doc=Jsoup.connect(urls[i]).ignoreContentType(true).timeout(60*1000).get();
for(int x=0;x<keyWords.length;x++){
if(doc.body().text().toLowerCase().contains(keyWords[x].toLowerCase())){
Elements e=doc.select("body:contains("+keyWords[x]+")");
for(Element element : e)
{
pconcat+=element.text();
System.out.println("pconcat"+pconcat);
}
}
}
}
Consider example.com , if the keyword I look for is "documents" , I need the output as "This domain is established to be used for illustrative examples in documents." and nothing else
You don't need to lowercase the body text in order to use the :contains selector, it is case insensitive.
elements that contains the specified text. The search is case
insensitive. The text may appear in the found element, or any of its
descendants.
select() is only going to return elements if it finds a match.
elements that match the query (empty if none match)
You don't need an if-statement to check for "documents", just use css selectors to select any element that matches then do something with the results.
Document doc = Jsoup
.connect(url)
.ignoreContentType(true)
.timeout(60*1000)
.get();
for (String keyword : keywords) {
String selector = String.format(
"p:contains(%s)",
keyword.toLowerCase());
String content = doc
.select(selector)
.text();
System.out.println(content);
}
Output
This domain is established to be used for illustrative examples in
documents. You may use this domain in examples without prior
coordination or asking for permission.

How to parse Aria-Label from a div tag attribute?

I am using JSoup to parse some HTMLL information, and I would like to parse the aria label value of a specific div attribute. The line I am trying to parse is the following:
<div class="tiny-star star-rating-non-editable-container" aria-label=" Rated 5 stars out of five stars ">
I have used the following:
Document document = Jsoup.connect(url).get();
Elements stars= document.select("div.tiny-star star-rating-non-editable-container[aria-label]");
String value = stars.text();
System.out.println("The rating is " + value);
However, the String value, returns blank. Why is this?
That selector expression won't give you what you expect. It's treated as a two-part selector
div.tiny-star - find a div element with class tiny-star
star-rating-non-editable-container[aria-label] - then look for a descendant star-rating-non-editable-container element which has an aria-label attribute
Try something more like
Element divWithStars = document.select(
"div.tiny-star.star-rating-non-editable-container[aria-label]");
String ariaLabel = divWithStars.attr("aria-label");
Note the dot rather than space between tiny-star and star-rating-..., and also the fact that select returns the element that hosts the aria-label attribute, not the attribute itself - you have to use attr to extract the attribute value.

Jsoup element text to textview

I want to display my element to an textview.
code
Document doc = Jsoup.parse(myURL);
Elements name = doc.getElementsByClass(".lNameHeader");
for (Element nametext : name){
String text = nametext.text();
tabel1.setText(text);
but it displays nothing.
(the site i am parsing http://roosters.gepro-osi.nl/roosters/rooster.php?leerling=120777&type=Leerlingrooster&afdeling=12-13_OVERIG&tabblad=2&school=905)
From your previous question it shows that myURL is a String. In this case you are are using the constructor Jsoup.parse(String html).
You need the one that takes a URL to make the connection:
Document doc = Jsoup.parse(new URL(myURL), 2000);
Elements name = doc.getElementsByClass("lNameHeader");
Also drop the leading . character from the class name. If you don't wish to specify a timeout you can simply use:
Document doc = Jsoup.connect(myURL).get();
Actually the class for it is:
lNameHeader
Note that first letter is not 1 (one) - it's l (letter L)
So it should be:
Elements name = doc.getElementsByClass("lNameHeader");
Note also that JSoup getElementsByClass methods doesn't work like CSS selectors - so the . must be omitted.

Jsoup get comment before element

Say I have this html:
<!-- some comment -->
<div class="someDiv">
... other html
</div>
<!-- some comment 2 -->
<div class="someDiv">
... other html
</div>
I'm currently getting all divs where class == someDiv and scraping them for information. To do that I'm simply doing this:
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select(".someDiv");
for (Element element : elements) {
//scrape stuff
}
Within the for loop, is there any way to get the comment tag found before the particular div.someDiv element I'm on?
If this isn't possible, should I go about parsing this html structure differently with this requirement?
Thanks for any advice.
Though this question is a few month old here my answer for completeness. How about using previousSibling to get the preceding Node. Of course in the real code you probably want to check, whether you really get a Comment there.
String html = "<!-- some comment --><div class=\"someDiv\">... other html</div><!-- some comment 2 --><div class=\"someDiv\">... other html</div>";
Document doc = Jsoup.parseBodyFragment(html);
Elements elements = doc.select(".someDiv");
for (Element element : elements) {
System.out.println(((Comment) element.previousSibling()).getData());
}
This produces:
some comment
some comment 2
(tested with jsoup 1.6.1 and 1.6.3)
Try something like this, Iterate over all comments and check if their sibling is the div you were after
for (int i = 0; i < doc.childNodes().size(); i++) {
Node child = doc.childNode(i);
if (child.nodeName().equals("#comment")) {
//do some checking on child.nextSibling() , like hasAttr or attr to figure out if it the div you were expecting for...
}
}
Take a look at the jsoup Node docs
Elements elements = doc.select("div.someDiv");
http://jsoup.org/cookbook/

Categories