Getting a block of text using Jsoup

Getting a block of text using Jsoup - java

Basically what I'm attempting to do is input the song and artist in the url which will then bring me to the page with the song's lyrics I'm then going to find the correct way to get those lyrics. I'm new to using Jsoup. So far the issue I've been having is I can't figure out the correct way to get the lyrics. I've tried getting the first "div" after the "b" but it doesn't seem to work out the way I plan.
public static void search() throws MalformedURLException {
Scanner search = new Scanner(System.in);
String artist;
String song;
artist = search.nextLine();
artist = artist.toLowerCase();
System.out.println("Artist saved");
song = search.nextLine();
song = song.toLowerCase();
System.out.println("Song saved");
artist = artist.replaceAll(" ", "");
System.out.println(artist);
song = song.replaceAll(" ", "");
System.out.println(song);
try {
Document doc;
doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
System.out.println(doc.title());
for(Element element : doc.select("div")) {
if(element.hasText()) {
System.out.println(element.text());
break;
}
}
} catch (IOException e){
e.printStackTrace();
}
}

I don't know if this is consistent or not in all song pages, but in the page you have shown, the lyrics appear with the div element whose first attribute is margin. If this is consistent, you could try something on the order of...
Elements eles = doc.select("div[style^=margin]");
System.out.println(eles.html());
Or if it's always the 6th div element with lyrics, you could use that:
Elements eles = doc.select("div");
if (eles.size() >= 6) {
System.out.println(eles.get(6).html());
}

Related

Having trouble webscraping Premier League results in Java with JSoup

I am a complete beginner to webscraping. I have followed a couple tutorials online, but I can't seem to get it to work with Premiere League results.
Here is the exact link I've tried scraping from: https://www.premierleague.com/results
My goal is to read all the home-team and away teams as well as get their results (1-1 etc.). If anyone could help I would really appreicate it! Below is code I've tried so far:
First attempt
String element = doc.select("div.fixtures__matches-list span.competitionLabel1").first().text();
Second attempt
Elements elements = doc.select("div.fixtures__matches-list");
Elements matches = doc.getElementsByClass("matchList");
Element ULElement = matches.get(0);
Elements childElements = ULElement.children();
for (Element e : childElements) {
String first = e.select("ul.matchList").select("li.matchFixtureContainer data-home").text();
System.out.println(e.text());
}
Third attempt
Elements test = doc.getElementsByClass("fixtures");
Element firstE = test.get(0);
System.out.println(firstE.text())
for (Element e : test) {
System.out.println(e.text());
}
Fourth attempt
Elements names = doc.select("data-home");
for (Element name : names) {
System.out.println(name.text());
}
Fifth attempt
String webUrl = "https://www.premierleague.com/results";
Document doc = null;
try {
doc = Jsoup.connect(webUrl).timeout(6000).get();
}
catch(IOException e) {
e.printStackTrace();
}
Elements body = doc.select("div.tabbedContent");
for (Element e : body) {
String data = e.select("div.col-12 section.fixtures div.fixtures__matches-list ul.matchList").text();
}
I really can't figure it out.

Java jsoup link extracting

I am trying to extract the links within a given element in jsoup. Here what I have done but its not working:
Document doc = Jsoup.connect(url).get();
Elements element = doc.select("section.row");
Element s = element.first();
Elements se = s.getElementsByTag("article");
for(Element link : se){
System.out.println("link :" + link.select("href"));
}
Here is the html:
The thing I am trying to do is get all the links withing the article classes. I thought that maybe first I must select the section class ="row", and then after that derive somehow the links from the article class but I could not make it work.

Try out this.
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("#main"); //select section with the id = main
Elements allArtTags = section.select("article"); // select all article tags in that section
for (Element artTag : allArtTags ){
Elements atags = artTag.select("a"); //select all a tags in each article tag
for(Element atag : atags){
System.out.println(atag.text()); //print the link text or
System.out.println(atag.attr("href"));//print link
}
}

I'm using this in one of my projects:
final Elements elements = doc.select("div.item_list_section.item_description");
you'll have to get the elements you want to extract links from.
private static ... inspectElement(Element e) {
try {
final String name = getAttr(e, "a[href]");
final String link = e.select("a").first().attr("href");
//final String price = getAttr(e, "span.item_price");
//final String category = getAttr(e, "span.item_category");
//final String spec = getAttr(e, "span.item_specs");
//final String datetime = e.select("time").attr("datetime");
...
}
catch (Exception ex) { return null; }
}
private static String getAttr(Element e, String what) {
try {
return e.select(what).first().text();
}
catch (Exception ex) { return ""; }
}

how to store webtable in java collection hashmap or hashset or arraylist?

In my application on users profile page, user has:
Name: XYZ
Age: ##
Address: st.XYZ
and so on...
When an element is missing (example age) other row takes its place, so I can't hardcode the xpath of elements. What I want is:
I want to (print) extract entire table data and compare with actual.
So when I ask for "Name" as key it should give cell value infront of it as value of key.
What I tried:
I was able to get text of tr tags elements keeping td fixed. But for another user when some row is missing it fails or gives wrong value.
for (int i = 2; i < 58; i++) {
String actor_name = new WebDriverWait(driver, 30).until(ExpectedConditions
.elementToBeClickable(By.xpath(first_part+i+part_two))).getText();
System.out.print("\n"+"S.no. "+(i-1)+" "+actor_name);
try {
driver.findElement(By.xpath(first_part+i+part_two)).click();
new WebDriverWait(driver, 30).until(ExpectedConditions
.elementToBeClickable(By.partialLinkText("bio"))).click();
//driver.findElement(By.partialLinkText("bio")).click();
} catch (Exception e) {
// TODO: handle exception
System.out.println("Not a link");
}
Thread.sleep(5000);
System.out.print(" "+driver.findElement(By.xpath("//*[#id='overviewTable']/tbody/tr[3]/td[2]")).getText());
driver.get("http://www.imdb.com/title/tt2310332/fullcredits?ref_=tt_cl_sm#cast");
}
Above code works fine for top 3 actors on this page but fails for 4th because that doesn't have one row missing on bio page.
On the bio page there two columns in the table one has attribute other has its value. I want to make a collection with key value pair with key as attribute (value from left column) and its value as value from right column. So that I get the freedom of fetching the values by mentioning the attribute value.
I am using JAVA to write scripts.

Can you try out with following code and provide me with any concerns if you have any...
driver.get("http://www.imdb.com/title/tt2310332/fullcredits?ref_=tt_cl_sm#cast");
String height = "";
String actorName = "";
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
List<WebElement> lstUrls = driver.findElements(By
.xpath("//span[#itemprop='name']/..")); // all a tags
List<String> urls = new ArrayList<>();
for (WebElement webElement : lstUrls) {
urls.add(webElement.getAttribute("href")); // saving all hrefs attached in each a tag
}
Map<String, String> actorHeightData = new HashMap<String, String>();
for (String string : urls) {
driver.get(string);
actorName = driver.findElement(
By.xpath(".//*[#id='overview-top']/h1/span")).getText(); // Getting actor's name
driver.findElement(By.xpath("//a[text()='Biography']")).click(); // Clicking Biography
try {
height = driver.findElement(
By.xpath("//td[.='Height']/following-sibling::td"))
.getText(); // Getting height
} catch (NoSuchElementException nsee) {
height = ""; // If height not found
}
actorHeightData.put(actorName, height); // Adding to map
}

You can create class PersonData with all nullable fields you need. But with not null getters.
for example
calss PersonData{
private String name;
public getName(){
if(name == null)
return "";
return name;
}
}
and store all persons in a List.
In you page you will ask person for field and always have something in table's cell.

JSoup parsing data from within a tag

I am managing to parse most of the data I need except for one as it is contained within the a href tag and I am needing the number that appears after "mmsi="
Sunsail 4013
my current parser fetches all the other data I need and is below. I tried a few things out the code commented out returns unspecified occasionally for an entry. Is there any way I can add to my code below so that when the data is returned the number "235083844" returns before the name "Sunsail 4013"?
try {
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements tables = doc.select("table.shipInfo");
for( Element element : tables )
{
Elements tdTags = element.select("td");
//Elements mmsi = element.select("a[href*=/showship.php?mmsi=]");
// Iterate over all 'td' tags found
for( Element td : tdTags ){
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Example of data parsed and html file here

You can use attr on an Element object to retrieve a particular attribute's value
Use substring to get the required value if the String pattern is consistent
Code
// Using just your anchor html tag
String html = "Sunsail 4013";
Document doc = Jsoup.parse(html);
// Just selecting the anchor tag, for your implementation use a generic one
Element link = doc.select("a").first();
// Get the attribute value
String url = link.attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ link.text());
Gives,
235083844 Sunsail 4013
Modified condition in your for loop from your code:
...
for (Element td : tdTags) {
// Print it's text if not empty
final String text = td.text();
if (text.isEmpty() == false) {
if (td.getElementsByTag("a").first() != null) {
// Get the attribute value
String url = td.getElementsByTag("a").first().attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ td.text());
}
else {
System.out.println(td.text());
}
}
}
...
The above code would print the desired output.

If you need value of attribute, you should use attr() method.
for( Element td : tdTags ){
Elements aList = td.select("a");
for(Element a : aList){
String val = a.attr("href");
if(StringUrils.isNotBlank(val)){
String yourId = val.substring(val.indexOf("=") + 1);
}
}

jSoup extract Text out of DIV tag to String

I want to extract some Text out of a website and store in String.
<div class="textclass" id="textid" itemprop="itemtext">I want to get this Text</div>
What goes into the question marks?
protected Void doInBackground(Void... params) {
try {
Document document = Jsoup.connect(url).get();
Elements text = document.select("???");
desc = text.attr("???");
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

Use the below
Elements text = document.select("div");
String desc = text.text();
Log.i(".........",+desc);
The log after trying at my end
01-31 04:45:15.272: I/.........(1233): I want to get this Text
Edit:
You can use
Elements text = document.select("div[class=textclass]");
or using id
Elements text = document.select("div[id=textid]");
or
Elements text = document.select("div[itemprop=itemtext]");

You can try this:
Document doc1 = Jsoup.connect(url).get();
Element contentDiv = doc1.select("div[id=textid]").first();
String text=contentDiv.getElementsByTag("div").text();
System.out.println(text); // The result
So get the text in the div with the id "textid" saved in the variable "text".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting a block of text using Jsoup - java

Related

Having trouble webscraping Premier League results in Java with JSoup

Java jsoup link extracting

how to store webtable in java collection hashmap or hashset or arraylist?

JSoup parsing data from within a tag

jSoup extract Text out of DIV tag to String

Categories

Resources