Using Jsoup to select classes and id - java

I am using this as an example
http://www.shopping.com/digital-camera/products?CLT=SCH&KW=digital+camera
In the linke above there is a class
<span class="numTotalResults">
Results 1 - 40 of 1500+
</span>
I got it using
Document query_result = Jsoup.connect("http://www.shopping.com")
.data("CLT", "digital camera")
.post();
but when I
System.out.println(query_result.select(".numTotalResults"));
System.out.println(query_result.select("#quickLookItem-1"));
System.out.println(query_result.select("[name=D0]"));
Nothing happens,
while
System.out.println(query_result);
System.out.println(query_result.select("span"));
clearly prints out the values
The selector seems to work only with div and span and anchor, but I can' select the classes or the id
Can someone help me?
Thanks
Edit:
It seems like the post did not go through. I don't quite understand why it didn't.

Instead of using POST request, try GET one:
Document query_result = Jsoup.connect("http://www.shopping.com/digital-camera/products?CLT=SCH&KW=digital+camera")
.get();
Take a look how does this search works. It doesn't use POST method and it keeps all search parameters in a query string. After this small change your first select example will work well.

Related

Going to next page on an aspx form with JSoup

I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.

Using jsoup for extracting attributes from "a" inside "span" inside "class" for sports software

I´ve been reading all the questions i could find regarding jsoup and attributes, classes, spans and so on.. But none could help me get this data from this website.
I am working on some sports software and retrieve match-data from the site soccer24.com
and now i want to get more data from specific match pages(win-lose history)
so i need either the last scores, or whats even better the "win" or "lose" result
the scores are written like this:
<td class="" style="cursor: pointer;"><span class="score"><strong>2 : 1</strong></span></td>
here i could work with the "2:1"
this is what i try right now:
Elements wl =docl.select("span.score");
System.out.println(wl);
for(Element w :wl){
System.out.println(w.ownText());
}
the result is written like this:
<td class="winLose" style="cursor: pointer;"><span class="winLoseIcon"><a title="Win" class="form-bg-last form-w"><span></span></a></span></td>
here i would need the "win" from the a title
ive really tried everything but cant extract it.. would be really grateful for any help..... and before i make it another question... i would also need the odds-movement..
i get the final odds but the movements are written like this:
<span class="up" alt="1.73[u]1.75">1.75</span>
so the "alt" attribute
if i could get all these things would be awesome and i know its not a big deal for u , but ive been trying around for hours now and this is really my last resort
thanks in advance :)
If I understand your question correctly, you want to extract attribute from an element ? If so,
EDIT:
Now it seems your real issue is not JSOUP parsing, but getting the content.
The link contains #h2h;overall. means it is not getting actual response from server, but it makes ajax request after it loads the page, to another url(http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2)
When I checked the response, I found that it repetitively makes calls to server and updates the result. This request and response both are encrypted. Following updated code should display you correct results.
// ** Test Data
//Document doc = Jsoup.parse("<html><body><h1></h1><table>"
// + "<td class=\"winLose\" style=\"cursor: pointer;\"><span class=\"winLoseIcon\"><a title=\"Win\" class=\"form-bg-last form-w\"><span></span></a></span></td>"
// + "<span class=\"up\" alt=\"1.73[u]1.75\">1.75</span>" + "</table>/</body></html>");
//
Connection con = Jsoup.connect("http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2");
con.header("X-Fsign", "SW9D1eZo");
Document doc = con.get();
//Your code
Elements elems=doc.select("td.winLose > span.winLoseIcon > a[title]");
for(Element elem:elems){
System.out.println(elem.attr("title"));
}
Similarly for odds:
Elements elems=doc.select("span.up[alt]");
for(Element elem:elems) println( elem.attr("alt"));
RESULT:
..Lots of lines Win | Lose | Draw..

Getting Started With Android & JSOUP

I am currently attempting to make an Android application and have come to the conclusion that I must use JSOUP to finish it. I am using JSOUP to extract data from the Internet and then post it on my app.
What I am trying to figure out is how to extract multiple bits of data from the url and then use each one of them inside of their own XML String TextView (If that is correct?)
Here is a snipbit of the HTML I am trying to extract.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACoN TURKEY SLICED" OnCick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACON TURKEY SLICED', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">BACON TURKEY SLICED
I am trying to extract the words BACON TURKEY SLICED
The problem is I do not understand JSOUP at all. Like I have an idea about it but I can't seem to practically use it and all that. I was wondering if someone could try and give me a push in the right direction.
Also, I have tried reading the cookbook to no prevail.
If anyone could help, thank you so much!
EDIT
Here are two more. I believe they are the exact same thing.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">STATION OMELET
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">CEREAL HOT GRITS
So, this answer is going to assume that you are interested in:
<a href=".." >TEXT YOU WANT</a>
All these <a> tags have the style attribute "recipeLink".
Given your example, here as a String:
String tastyTurkeySandwich= "BACON TURKEY SLICED";
You can extract the (first) text with the following code:
Document doc = Jsoup.parse(tastyTurkeySandwich);
Elements links = doc.select("a[href].recipeLink");
// This will just print the text in the first one
System.out.println(links.first().text());
To iterate over an Elements (which implements the Iterable interface) instance:
for (Element link : links) {
// Calling link.text() will return BACON TURKEY SLICED etc. etc.
System.out.println(link.text());
}
In short:
a[href] will match all the <a> tags that have a href attribute.
the .recipeLink part will filter that selection to only include links that have the recipeLink style.

JSoup - Select only one listobject

I'm trying to extract some certain data from a website using JSoup and Java. So far I've been successful in what I'm trying to achieve.
<ul class="beverageFacts">
<li><span>Årgång</span><strong>**2009** </strong></li>
I want to extract what is inside the ** in the above HTML. I can do this by using the code that follows in JSoup:
doc.select("ul.beverageFacts li:lt(1) strong");
I'm using the lt(1) because there are several more list items following that I want to omit.
Now to my problem; there's an optional information tab on the site I'm extracting data from, and it also has a class called "beverageFacts". My code will at the moment extract that data too, which I don't want it to do.
The code is further down in the source of the website, and I've tried to use the indexer :lt(1) here aswell, but it wont work.
<div id="beverageMoreFacts" style="display: block">
<ul class="beverageFacts"><li class="half">
<span> Färg</span><strong> Ljusgul färg.</strong>
My overall result is that I extract "2009 Ljusgul färg." instead of only "2009". How can I write my code so it will only extract the first part, which it succesfully does, and omits the rest?
EDIT:
I get the same result using:
doc.select("ul.beverageFacts li:eq(0) strong");
Thanks,
Z
You are qualifying only one part, whereas you should qualify both. Try this:
doc.select("ul.beverageFacts:eq(0) li:eq(0) strong");
What you are saying is: give me the first list item of each list of beverages. What you need to say instead is: Give me the first item of the first list of beverages.

A weird problem happened when parsing a html page using HTMLParser

I was parsing a web page using HTMLParser in Java, I met a weird problem when using class HasAttributeFilter.
The element I wanna parse in the page is <span style="vertical-align: middle;"></span>, so the expression should be HasAttributeFilter filter = new HasAttributeFilter("style", "vertical-align: middle;");, right? Yeah, I used this exp, but it DIDN'T WORK! BUT I am sure there IS the node in the page
After that, I applied some other exp, such as HasAttributeFilter filter = new HasAttributeFilter("class", "singlecolumnminwidth"); to the same page, and also, the node is there, something weird happened, this expression WORKED!
Has anyone met this problem before? Help me ...
Thanks in advance!
The page's link.
what do you get if you fetch the value of this attribue and print it out to the screen?
do you maybe have to escape some chars like space or minus? think it could have problems with the space in between
does vertical-align:middle; work?
or maybe test if its the minus causing an error

Categories