Jsoup imdb rating - java

I wrote a program which reads the name and the rating of the top 250 movies on imdb and return the mean of the rating. I have the follow program
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class da {
/**
* #param args
*/
public static void main(String[] args) {
try {
Document doc=Jsoup.connect("http://www.imdb.com/chart/top").get();
Elements e=doc.getElementsByClass("titleColumn");
Elements t=doc.getElementsByClass("imdbRating");
float suma=0;
for(int i=0;i<e.size();i++)
suma=suma+Float.parseFloat(t.get(i).text());
System.out.println(suma/250);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
My question is why in 't' it needs "imdbRating" because if i look in the html on the page i see that where rating is located it writes "ratingColumn imdbRating" (i did this program by mistake and i don't know why it is working this way and not the other way)

You don't need the element e in this program. The titleColumn in the webpage just contains the title of the movie. Considering you only need the ratings, this is unnecessary. You can just use the t element when I renamed to ratings and cleaned up your code a little bit:
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class da {
/**
* #param args
*/
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").get();
Elements ratings = doc.select(".ratingColumn.imdbRating");
float suma = 0;
for(int i = 0; i < ratings.size(); i++)
suma = suma + Float.parseFloat(ratings.get(i).child(0).text());
System.out.println(suma/250);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
EDIT: To select elements with multiple classes, you must use doc#select and pass it a CSS query like above.

nicholas79171 has a good answer, but I would just like to point out that you can use CSS Selectors to target the ratings directly, without all of the dom traversal methods.
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").get();
float ratingSum = 0;
Elements ratings = doc.select("td.ratingColumn.imdbRating > strong");
for (Element rating : ratings)
ratingSum += Float.parseFloat(rating.ownText());
System.out.println(ratingSum / ratings.size());

You can't use getElementsByClass to get an element which contain multiple classes; it only works singularly; If you wanted to get them with multiple elements you might use select on your Document. You can read more about how select works here.

Related

how can i do web scraping in this case?

i am trying to scrap text from https://in-the-sky.org/data/object.php?id=A216&day=17&month=6&year=2022
so i wrote a code like
import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String args[]) {
int num = 216;
int day = 17;
int month = 6;
int year = 2022;
String url ="https://in-the-sky.org/data/object.php?id=A"+Integer.toString(num)+"&day="+Integer.toString(day)+"&month="+Integer.toString(month)+"&year="+Integer.toString(year);
System.out.println(url);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
System.out.println("=======================================================");
Elements element = doc.select("div.col-md-6 col-md-pull-6");
String output = element.select("p").text();
System.out.println(output);
System.out.println("=======================================================");
}
}
but it doesnt work well. i would like someone to help me please
I believe that you can use Elements element = doc.select("div.col-md-6 > p"); to get your desired output.

Jsoup How do I parse this span for its text?

<span class="c-city__hrMin" data-bind="{attr:{id:'p'+id()}}" id="p64">10:52</span>
How do I get this to print out just 10:52
So far I have tried
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.w3c.dom.Node;
import org.jsoup.select.*;
public class Main
{
public static void main(String [] args) {
Document doc = null;
try {
doc = Jsoup.connect("https://www.timeanddate.com/worldclock/personal.html").get();
String title = doc.title();
Elements elements = doc.select(".c-city__hrMin");
System.out.println("Website : " + title + elements.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
From this the output is Website : The Personal World Clock but their isn't any syntax error
Simply
doc.select(".c-city__hrMin") should work.
But if this class c-city__hrMin presents in other elements too then try
doc.select(span[class=c-city__hrMin]) It will select all span element having that class exclusively.
NB: For more reference and idea about Jsoup CSS Selectors follow this. You can try the selectors for a documents here also.

Search address by name link - Jsoup

How to get the web address not by the title but by the description of the link (in this case, "następna strona" it's means next page) with html code?
More specifically draw the internet address of the link name which is between text
następna strona
package outerDictionary;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class adressWWW {
public static void main(String[] args) {
Document doc;
List<String> wikiWords = new ArrayList<String>();
String addresWWW="http://pl.wiktionary.org/w/index.php?title=Kategoria:angielski_(indeks)&pagefrom=abducent#mw-pages";
try {
doc = Jsoup .connect(addresWWW).get();
String title = doc.title();
System.out.println(title);
//Element inDiv = doc.select("a[title=Kategoria:angielski (indeks)]").first();
Element inDiv = doc.select("a[title=Kategoria:angielski (indeks)]następna strona").first();
System.out.println(inDiv);
String row = inDiv.attr("abs:href");
System.out.println("xxx "+row);
// System.out.println(row.text());}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
for (String x : wikiWords)
System.out.println(x);
System.out.println(wikiWords.size());
}}
You can test the text of each link:
Document doc = Jsoup.connect("http://pl.wiktionary.org/w/index.php?title=Kategoria:angielski_(indeks)&pagefrom=abducent#mw-pages").get();
for( Element element : doc.select("a") )
{
if( element.text().equalsIgnoreCase("następna strona") )
{
System.out.println(element);
}
}
Or using the selector syntax:
// ...
for( Element element : doc.select("a:contains(następna strona)") )
{
System.out.println(element);
}
In both cases, the result is:
następna strona
następna strona

Using Jsoup to extract single value from page source

I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);
Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}
can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]
public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}

ArrayIndexOutOfBoundsException gui jlist

Hello i am trying to make a program that will allow me to load in a file and upload the name of it to a list. Once i select a file name in the list it should go through that file and take each line an put it in the specified jtextfield. But when i try and load a second file and try and select it, it tells me arrayIndexOutOfBounds. Can someone please explain to me what I'm doing wrong. I am using NetBeans.
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package prog24178.assignment4;
import java.awt.event.KeyEvent;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JFileChooser;
public class CustomerView extends javax.swing.JFrame {
/**
* Creates new form CustomerView
*/
private Application ass4App = new Application();
public ArrayList<Customer> customer = new ArrayList<Customer>();
public ArrayList<String> names = new ArrayList<String>();
public String fileName;
public Customer customers = new Customer();
public int i;
public void setApplication(Application customerApp) {
this.ass4App = ass4App;
}
public CustomerView() {
initComponents();
}
/**
* This method is called from within the constructor to initialize the form.
* WARNING: Do NOT modify this code. The content of this method is always
* regenerated by the Form Editor.
*/
private void jExitItemActionPerformed(java.awt.event.ActionEvent evt) {
// TODO add your handling code here:
System.exit(0);
}
private void jOpenCusItemActionPerformed(java.awt.event.ActionEvent evt) {
// TODO add your handling code here:
String currentPath = System.getProperty("user.dir");
JFileChooser fc = new JFileChooser();
fc.setMultiSelectionEnabled(true);
fc.setFileSelectionMode(JFileChooser.FILES_ONLY);
if (fc.showOpenDialog(null) == JFileChooser.APPROVE_OPTION) {
File[] file = fc.getSelectedFiles();
for (int i = 0; i < file.length; i++) {
try {
customers.constructCustomer(file[i]);
} catch (FileNotFoundException ex) {
Logger.getLogger(CustomerView.class.getName()).log(Level.SEVERE, null, ex);
}
customer.add(customers);
names.add(customer.get(i).getName());
}
jCustomerList.setListData(names.toArray());
}
}
private void jCustomerListValueChanged(javax.swing.event.ListSelectionEvent evt) {
// TODO add your handling code here:
jCusNameField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getName());
jAddressField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getAddress());
jCityField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getCity());
jProvinceField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getProvince());
jPostalCodeField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getPostalCode());
jEmailAddressField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getEmailAddress());
jPhoneNumberField.setText((String) customer.get(jCustomerList.getSelectedIndex()).getPhoneNumber());
}
I fix the problem. I realized that i was just adding the variable customers to customer without giving it a proper value.
customer.add(customers.constructCustomer(file[i]));
I don't know what customers.constructCustomer(file[i]); or customer.add(customers); do, exactly -- we don't have enough code to know -- but you are using i to iterate through the array of File objects and to obtain a customer (customers.get(i)). That's the second place I'd look.
The FIRST place I'd look is at the error message; it tells you the line on which the array index was out of bounds, the value of the index, and the upper bound on the array.

Categories