Building Strings by scraping html with JSoup

Building Strings by scraping html with JSoup - java

I'm a novice Java programmer, and am just now beginning to expand into the world of libraries, APIs, and the like. I'm at the point where I have an idea that is relatively simple, and can be my pet project when I'm not working on homework.
I'm interested in scraping html from a few different sites, and building strings that look like " Artist - "Track Name" ". I've got one site working the way I want, but I feel it could be done a lot more smoothly... Here's the rundown on what I do for Site A:
I have JSoup create Elements for everything that is of the class plrow like so:
<p class="plrow"><b>Artist</b> “Title” (<span class="sn_ld">Label</span>) <SMALL><b>N </b></SMALL></p></td></tr><tr class="ev"><td><a name="98069"></a><p class="pltime">Time</p>
From there, I create a String array of lines that are split after the last </p>, then use the following code to process the array:
for (int i = 0; i < tracks.length; i++){
tracks[i] = Jsoup.parse(tracks[i]).text();
tracks[i] = tracks[i].split("”")[0];
tracks[i] = tracks[i].toString()+ "”";
}
Which is a pretty hackish way to get Artist "Title" the way I want, but the result is fine for me.
Site B is a little bit different.
I've determined that the Artists and Titles are all contained like this:
<span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span>
along with more information, all inside of <li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li>
I was trying to go through and snag all of the artists first, and then the titles and then merge the two, but I was having trouble with that because the "dc:title" property used to display the track title is used for other non music things, so I can't directly match up the artist with a track.
I have spent the lion's share of this weekend trying to get this working by viewing countless questions tagged with Jsoup, and spending a lot of time reading the Jsoup cookbook and API guide. I have a feeling that part of my trouble could also stem from my relatively limited knowledge of how web pages are coded, though that may mostly be my trouble with my understanding of how to plug these bits of code into Jsoup.
I appreciate any help or guidance, and I've got to say, it's really nice to ask a non-homework question here (though I find quite a few hints from what others have asked! ;) )

Common:
If you have some different websites where you want to parse content its a good idea to differ between them. Maybe you can decide if you parse Page A or Page B by the URL.
Example:
if( urlToPage.contains("pagea.com") )
{
// Call parsemethod for Page A or create parserclass
}
else if( urlToPage.contains("pageb.com") )
{
// Call parsemethod for Page B or create parserclass
}
// ...
else
{
// Eg. throw Exception because there's no parser available
}
You can connect and parse each page into a document with a single line of code:
// Note: the protocol (http) is required here
Document doc = Jsoup.connect("http://pagewhaterver.com").get();
Without knowing the Html or the structure of each page, here are some basic approaches:
Page A:
for( Element element : doc.select("p.plrow") )
{
String title = element.ownText(); // Title - output: '“Title” ()' (you have to replace the " and () here)
String artist = element.select("a").first().text(); // Artist
String label = element.select("span.sn_ld").first().text(); // Label
// etc.
}
Page B:
Similar to Page B, Artitst and Title can be selected like this:
String artist = doc.select("span.artist").first().text();
String title = doc.select("span.title").first().text();
Here's a good overview of the Jsoup Selector API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

Using jsoup for extracting attributes from "a" inside "span" inside "class" for sports software

I´ve been reading all the questions i could find regarding jsoup and attributes, classes, spans and so on.. But none could help me get this data from this website.
I am working on some sports software and retrieve match-data from the site soccer24.com
and now i want to get more data from specific match pages(win-lose history)
so i need either the last scores, or whats even better the "win" or "lose" result
the scores are written like this:
<td class="" style="cursor: pointer;"><span class="score"><strong>2 : 1</strong></span></td>
here i could work with the "2:1"
this is what i try right now:
Elements wl =docl.select("span.score");
System.out.println(wl);
for(Element w :wl){
System.out.println(w.ownText());
}
the result is written like this:
<td class="winLose" style="cursor: pointer;"><span class="winLoseIcon"><a title="Win" class="form-bg-last form-w"><span></span></a></span></td>
here i would need the "win" from the a title
ive really tried everything but cant extract it.. would be really grateful for any help..... and before i make it another question... i would also need the odds-movement..
i get the final odds but the movements are written like this:
<span class="up" alt="1.73[u]1.75">1.75</span>
so the "alt" attribute
if i could get all these things would be awesome and i know its not a big deal for u , but ive been trying around for hours now and this is really my last resort
thanks in advance :)

If I understand your question correctly, you want to extract attribute from an element ? If so,
EDIT:
Now it seems your real issue is not JSOUP parsing, but getting the content.
The link contains #h2h;overall. means it is not getting actual response from server, but it makes ajax request after it loads the page, to another url(http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2)
When I checked the response, I found that it repetitively makes calls to server and updates the result. This request and response both are encrypted. Following updated code should display you correct results.
// ** Test Data
//Document doc = Jsoup.parse("<html><body><h1></h1><table>"
// + "<td class=\"winLose\" style=\"cursor: pointer;\"><span class=\"winLoseIcon\"><a title=\"Win\" class=\"form-bg-last form-w\"><span></span></a></span></td>"
// + "<span class=\"up\" alt=\"1.73[u]1.75\">1.75</span>" + "</table>/</body></html>");
//
Connection con = Jsoup.connect("http://d.soccer24.com/x/feed/d_hh_K2AUJ0ih_en_2");
con.header("X-Fsign", "SW9D1eZo");
Document doc = con.get();
//Your code
Elements elems=doc.select("td.winLose > span.winLoseIcon > a[title]");
for(Element elem:elems){
System.out.println(elem.attr("title"));
}
Similarly for odds:
Elements elems=doc.select("span.up[alt]");
for(Element elem:elems) println( elem.attr("alt"));
RESULT:
..Lots of lines Win | Lose | Draw..

Using Jsoup to find elements troubles

I don't really know much about HTML parsers(using Jsoup currently) and have tried many times and can not get it to work due to my poor understanding of it, so please bare that in mind.
Anyway I am trying to grab certain parts of an HTML document. This is what I want to be extracted:
<div class ="detNane" >
<a class="detLink" title="Details for Hock part3">Hock part3</a></div>
Obviously the HTML document has multiple [div class="detName"] and I want to extract all text in each detName div class. I would greatly appreciate it.
Thank you for your time.

You can use a selector for this:
Document doc = // parse your document here or connect to a website
for( Element element : doc.select("div.detNane") )
{
System.out.println(element.text()); // Print the text of that element
}

Getting Started With Android & JSOUP

I am currently attempting to make an Android application and have come to the conclusion that I must use JSOUP to finish it. I am using JSOUP to extract data from the Internet and then post it on my app.
What I am trying to figure out is how to extract multiple bits of data from the url and then use each one of them inside of their own XML String TextView (If that is correct?)
Here is a snipbit of the HTML I am trying to extract.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACoN TURKEY SLICED" OnCick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m12296&MI=122&RN=BACON TURKEY SLICED', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">BACON TURKEY SLICED
I am trying to extract the words BACON TURKEY SLICED
The problem is I do not understand JSOUP at all. Like I have an idea about it but I can't seem to practically use it and all that. I was wondering if someone could try and give me a push in the right direction.
Also, I have tried reading the cookbook to no prevail.
If anyone could help, thank you so much!
EDIT
Here are two more. I believe they are the exact same thing.
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m4903&MI=122&RN=STATION OMELET', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">STATION OMELET
a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&MI=122&RN=CEREAL HOT GRITS', 'RDA_window', 'width=450, height=600, scrollbars=no, toolbar=no, directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">CEREAL HOT GRITS

So, this answer is going to assume that you are interested in:
<a href=".." >TEXT YOU WANT</a>
All these <a> tags have the style attribute "recipeLink".
Given your example, here as a String:
String tastyTurkeySandwich= "BACON TURKEY SLICED";
You can extract the (first) text with the following code:
Document doc = Jsoup.parse(tastyTurkeySandwich);
Elements links = doc.select("a[href].recipeLink");
// This will just print the text in the first one
System.out.println(links.first().text());
To iterate over an Elements (which implements the Iterable interface) instance:
for (Element link : links) {
// Calling link.text() will return BACON TURKEY SLICED etc. etc.
System.out.println(link.text());
}
In short:
a[href] will match all the <a> tags that have a href attribute.
the .recipeLink part will filter that selection to only include links that have the recipeLink style.

How do I parse an HTML document with JSoup to get a list of links?

I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:
Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries
Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.
How can I step through each div.state_delimiter, pull that text then go down until
there is a </ul> which defines the end of the states individual counties/cities links/text?
I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul> I want all the <li></li> under the <ul></ul> for each state. Looking to grab the http:// && html that goes along with it as easy as possible.

The <ul> containing the cities is the next sibling of the <div class="state_delimiter">. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:
Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");
for (Element country : countries) {
System.out.println("Country: " + country.select("h1.continent_header").text());
Elements states = country.select("div.state_delimiter");
for (Element state : states) {
System.out.println("\tState: " + state.text());
Elements cities = state.nextElementSibling().select("li");
for (Element city : cities) {
System.out.println("\t\tCity: " + city.text());
}
}
}
The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all <div class="state_delimiter"> and <ul> elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.

Creating Javascript History like Object in Coldfusion

I want to create a History Object for my Website, It should be Similar to Javascript history object.
I want to use this Object to Creating Previous/Next nevigational Links for my Website.
The Problem with Javascript History Object is that it is based on the Window object, not a website specific. And Using this Object clicking on "Prvious" link can cause Leaving my website.
What should be the best possible way to Create a History Object in Coldfusion?

Looking at global question that sounds like "How can I make the user surfing my site better?" I can three important answers:
Obvious navigation aka main menu. Especially, when you have a lot links there. Solutions may vary: plain links, tabs, drop-down menus etc.
Using breadcrumbs. People should be able to go level up (though it is not always "Back" action).
History. Implementing custom history can be useful, say for e-shop -- to show previously viewed stuff in reliable and handy way.
Please note that history is the task #3, not 1 or 2. Reason of explaining all this is that your History should not serve for #1 (definitely) and #2 (can be sometimes).
Basically history can be stored in two ways: for current session only (for any user) and between sessions (typically for logged in users).
Simplest way to implement the first way is to use ColdFusion sessions. When creating session (onSessionStart() if using Application.cfc) initialize the container, I would use the array.
Consider the following samples:
<cfscript>
session.history = [];
</cfscript>
When user opens new page (even in new tab -- which starts new browser history) -- push the page information into the container (page should contain link and kind of label at least):
<cfscript>
page = {};
page.link = "/index.cfm?product=100";
page.label = "Product Foo";
ArrayAppend(session.history, page);
</cfscript>
Finally, somewhere in page template loop over this array and display the links:
<cfloop array="#session.history#" index="page">
<div>#HTMLEditFormat(page.label)#</div>
</cfloop>
Obviously, if you want to show the Previous/Next links, you should modify the way of storing the history, maybe keep current page position (in array) too -- to pick the previous and next elements. Though as a User I would not find such feature much useful.
Finally, if you want to store the history between sessions, simply write this dataset in the database identified by user id (fk) and restore it when user logs in.
Please remember, that it is highly recommended to use locking when reading/writing.

Preventing people from leaving your site is a lame and annoying thing to do. Do you have a good reason for doing it?

Are you using a link/button on your site to act as the back button?
If so, you could use javascript to hide the "Previous" link/button in your site if the history-1 doesn't contain your domain name.
edit - you can't use the history object because of security but you can use document.referrer.
<head>
<script language="javascript">
function showBackLink(){
var ref = document.referrer;
var fromThisDomain = ref.indexOf("yourdomain.com");
if(fromThisDomain > 0){ // your domain was found, show the link
document.getElementById("backLink").style.display = "";
}else{ // your domain was not found, hide the link
document.getElementById("backLink").style.display = "none";
}
}
</script>
</head>
<body onload="showBackLink();">
<a ID = "backLink" href = "javascript: history.go(-1);">prev</a>
</body>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Building Strings by scraping html with JSoup - java

Related

Using jsoup for extracting attributes from "a" inside "span" inside "class" for sports software

Using Jsoup to find elements troubles

Getting Started With Android & JSOUP

How do I parse an HTML document with JSoup to get a list of links?

Creating Javascript History like Object in Coldfusion

Categories

Resources