I am currently attempting to make an Android application and have come to the conclusion that I must use JSOUP to finish it. I am using JSOUP to extract data from the Internet and then post it on my app.
What I am trying to figure out is how to extract multiple bits of data from the url and then use each one of them inside of their own XML String TextView (If that is correct?)
Here is a snipbit of the HTML I am trying to extract.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACoN TURKEY SLICED" OnCick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACON TURKEY SLICED', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">BACON TURKEY SLICED
I am trying to extract the words BACON TURKEY SLICED
The problem is I do not understand JSOUP at all. Like I have an idea about it but I can't seem to practically use it and all that. I was wondering if someone could try and give me a push in the right direction.
Also, I have tried reading the cookbook to no prevail.
If anyone could help, thank you so much!
EDIT
Here are two more. I believe they are the exact same thing.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">STATION OMELET
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">CEREAL HOT GRITS
So, this answer is going to assume that you are interested in:
<a href=".." >TEXT YOU WANT</a>
All these <a> tags have the style attribute "recipeLink".
Given your example, here as a String:
String tastyTurkeySandwich= "BACON TURKEY SLICED";
You can extract the (first) text with the following code:
Document doc = Jsoup.parse(tastyTurkeySandwich);
Elements links = doc.select("a[href].recipeLink");
// This will just print the text in the first one
System.out.println(links.first().text());
To iterate over an Elements (which implements the Iterable interface) instance:
for (Element link : links) {
// Calling link.text() will return BACON TURKEY SLICED etc. etc.
System.out.println(link.text());
}
In short:
a[href] will match all the <a> tags that have a href attribute.
the .recipeLink part will filter that selection to only include links that have the recipeLink style.
Related
My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.
I´ve been reading all the questions i could find regarding jsoup and attributes, classes, spans and so on.. But none could help me get this data from this website.
I am working on some sports software and retrieve match-data from the site soccer24.com
and now i want to get more data from specific match pages(win-lose history)
so i need either the last scores, or whats even better the "win" or "lose" result
the scores are written like this:
<td class="" style="cursor: pointer;"><span class="score"><strong>2 : 1</strong></span></td>
here i could work with the "2:1"
this is what i try right now:
Elements wl =docl.select("span.score");
System.out.println(wl);
for(Element w :wl){
System.out.println(w.ownText());
}
the result is written like this:
<td class="winLose" style="cursor: pointer;"><span class="winLoseIcon"><a title="Win" class="form-bg-last form-w"><span></span></a></span></td>
here i would need the "win" from the a title
ive really tried everything but cant extract it.. would be really grateful for any help..... and before i make it another question... i would also need the odds-movement..
i get the final odds but the movements are written like this:
<span class="up" alt="1.73[u]1.75">1.75</span>
so the "alt" attribute
if i could get all these things would be awesome and i know its not a big deal for u , but ive been trying around for hours now and this is really my last resort
thanks in advance :)
If I understand your question correctly, you want to extract attribute from an element ? If so,
EDIT:
Now it seems your real issue is not JSOUP parsing, but getting the content.
The link contains #h2h;overall. means it is not getting actual response from server, but it makes ajax request after it loads the page, to another url(http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2)
When I checked the response, I found that it repetitively makes calls to server and updates the result. This request and response both are encrypted. Following updated code should display you correct results.
// ** Test Data
//Document doc = Jsoup.parse("<html><body><h1></h1><table>"
// + "<td class=\"winLose\" style=\"cursor: pointer;\"><span class=\"winLoseIcon\"><a title=\"Win\" class=\"form-bg-last form-w\"><span></span></a></span></td>"
// + "<span class=\"up\" alt=\"1.73[u]1.75\">1.75</span>" + "</table>/</body></html>");
//
Connection con = Jsoup.connect("http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2");
con.header("X-Fsign", "SW9D1eZo");
Document doc = con.get();
//Your code
Elements elems=doc.select("td.winLose > span.winLoseIcon > a[title]");
for(Element elem:elems){
System.out.println(elem.attr("title"));
}
Similarly for odds:
Elements elems=doc.select("span.up[alt]");
for(Element elem:elems) println( elem.attr("alt"));
RESULT:
..Lots of lines Win | Lose | Draw..
I don't really know much about HTML parsers(using Jsoup currently) and have tried many times and can not get it to work due to my poor understanding of it, so please bare that in mind.
Anyway I am trying to grab certain parts of an HTML document. This is what I want to be extracted:
<div class ="detNane" >
<a class="detLink" title="Details for Hock part3">Hock part3</a></div>
Obviously the HTML document has multiple [div class="detName"] and I want to extract all text in each detName div class. I would greatly appreciate it.
Thank you for your time.
You can use a selector for this:
Document doc = // parse your document here or connect to a website
for( Element element : doc.select("div.detNane") )
{
System.out.println(element.text()); // Print the text of that element
}
I'm a novice Java programmer, and am just now beginning to expand into the world of libraries, APIs, and the like. I'm at the point where I have an idea that is relatively simple, and can be my pet project when I'm not working on homework.
I'm interested in scraping html from a few different sites, and building strings that look like " Artist - "Track Name" ". I've got one site working the way I want, but I feel it could be done a lot more smoothly... Here's the rundown on what I do for Site A:
I have JSoup create Elements for everything that is of the class plrow like so:
<p class="plrow"><b>Artist</b> “Title” (<span class="sn_ld">Label</span>) <SMALL><b>N </b></SMALL></p></td></tr><tr class="ev"><td><a name="98069"></a><p class="pltime">Time</p>
From there, I create a String array of lines that are split after the last </p>, then use the following code to process the array:
for (int i = 0; i < tracks.length; i++){
tracks[i] = Jsoup.parse(tracks[i]).text();
tracks[i] = tracks[i].split("”")[0];
tracks[i] = tracks[i].toString()+ "”";
}
Which is a pretty hackish way to get Artist "Title" the way I want, but the result is fine for me.
Site B is a little bit different.
I've determined that the Artists and Titles are all contained like this:
<span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span>
along with more information, all inside of <li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li>
I was trying to go through and snag all of the artists first, and then the titles and then merge the two, but I was having trouble with that because the "dc:title" property used to display the track title is used for other non music things, so I can't directly match up the artist with a track.
I have spent the lion's share of this weekend trying to get this working by viewing countless questions tagged with Jsoup, and spending a lot of time reading the Jsoup cookbook and API guide. I have a feeling that part of my trouble could also stem from my relatively limited knowledge of how web pages are coded, though that may mostly be my trouble with my understanding of how to plug these bits of code into Jsoup.
I appreciate any help or guidance, and I've got to say, it's really nice to ask a non-homework question here (though I find quite a few hints from what others have asked! ;) )
Common:
If you have some different websites where you want to parse content its a good idea to differ between them. Maybe you can decide if you parse Page A or Page B by the URL.
Example:
if( urlToPage.contains("pagea.com") )
{
// Call parsemethod for Page A or create parserclass
}
else if( urlToPage.contains("pageb.com") )
{
// Call parsemethod for Page B or create parserclass
}
// ...
else
{
// Eg. throw Exception because there's no parser available
}
You can connect and parse each page into a document with a single line of code:
// Note: the protocol (http) is required here
Document doc = Jsoup.connect("http://pagewhaterver.com").get();
Without knowing the Html or the structure of each page, here are some basic approaches:
Page A:
for( Element element : doc.select("p.plrow") )
{
String title = element.ownText(); // Title - output: '“Title” ()' (you have to replace the " and () here)
String artist = element.select("a").first().text(); // Artist
String label = element.select("span.sn_ld").first().text(); // Label
// etc.
}
Page B:
Similar to Page B, Artitst and Title can be selected like this:
String artist = doc.select("span.artist").first().text();
String title = doc.select("span.title").first().text();
Here's a good overview of the Jsoup Selector API: http://jsoup.org/cookbook/extracting-data/selector-syntax
I'm trying to extract some certain data from a website using JSoup and Java. So far I've been successful in what I'm trying to achieve.
<ul class="beverageFacts">
<li><span>Årgång</span><strong>**2009** </strong></li>
I want to extract what is inside the ** in the above HTML. I can do this by using the code that follows in JSoup:
doc.select("ul.beverageFacts li:lt(1) strong");
I'm using the lt(1) because there are several more list items following that I want to omit.
Now to my problem; there's an optional information tab on the site I'm extracting data from, and it also has a class called "beverageFacts". My code will at the moment extract that data too, which I don't want it to do.
The code is further down in the source of the website, and I've tried to use the indexer :lt(1) here aswell, but it wont work.
<div id="beverageMoreFacts" style="display: block">
<ul class="beverageFacts"><li class="half">
<span> Färg</span><strong> Ljusgul färg.</strong>
My overall result is that I extract "2009 Ljusgul färg." instead of only "2009". How can I write my code so it will only extract the first part, which it succesfully does, and omits the rest?
EDIT:
I get the same result using:
doc.select("ul.beverageFacts li:eq(0) strong");
Thanks,
Z
You are qualifying only one part, whereas you should qualify both. Try this:
doc.select("ul.beverageFacts:eq(0) li:eq(0) strong");
What you are saying is: give me the first list item of each list of beverages. What you need to say instead is: Give me the first item of the first list of beverages.