How to get data from a URL in new lines using JSOUP? - java

I'm scrapping IMDB chart of 250 movies. I want to store each movie name in an array, but I don't know why it puts all the movie names into the first index, i.e Array[0].
Below is my code.
Can anyone please help me out. I've to complete another project and this is the main thing that is needed.
If you can direct me any website or tutorial I'll be very thankful to you.
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").userAgent("Mozilla").get();
int counterVariable = 0;
for (Element el : doc.select(".lister-list")) {
mString[counterVariable] = el.select(".titleColumn").text();
totalNumberOfLines++;
counterVariable++;
}
} catch (Exception e) {
System.out.println("Sorry website couldn't be opened");
System.out.println(e);
}
System.out.println(mString[0]);// It's putting all the names into this index

The problem is that you have only one element matching selector .lister-list, so iterating over it does not make much sense. When you call el.select(".titleColumn").text(); Jsoup concatenates text from all matching elements. This is why you get all results in one element. Instead you can try to select all td tags with class tittleColumn that are children of tr element that are child of .lister-list
for (Element el : doc.select(".lister-list > tr > td.titleColumn")) {
mString[counterVariable] = el.text();
totalNumberOfLines++;
counterVariable++;
}
More about jsoup css selectors you can learn here.

Related

Fetching data from a webpage

I need to fetch data from the website "https://www.arbatunity.com/index.php", I want the data from the top right of the website that says current market profit.
I need this as a string value that can be updated.
With the JSoup library, this is easy:
Document doc = Jsoup.connect("https://www.arbatunity.com/index.php").get();
Elements elements = doc.select("#id_profit b");
String percent = ""
for (Element e : elements) {
percent = e.html();
}
//percent holds the String you're looking for

JSoup parsing a text file containing a html table with Java

I am really unsure how I can get the information I need to place into a database, the code below just prints the whole file.
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
System.out.println(doc.toString());
My HTML is here from line 61 and I am needing to get the items under the column headings but also grab the MMSI number which is not under a column heading but in the href tag. I haven't used JSoup other than to get the HTML from the web page. I can only really see tutorials to use php and I'd rather not use it.
To get those information, the best way is to use Jsoup's selector API. Using selectors, your code will look something like this (pseudeocode!):
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements matches = doc.select("<your selector here>");
for( Element element : matches )
{
// do something with found elements
}
There's a good documentation available here: Use selector-syntax to find elements. If you get stuck nevertheless, please describe your problem.
Here are some hints for that selector, you can use:
// Select the table with class 'shipinfo'
Elements tables = doc.select("table.shipinfo");
// Iterate over all tables found (since it's only one, you can use first() instead
for( Element element : tables )
{
// Select all 'td' tags of that table
Elements tdTags = element.select("td");
// Iterate over all 'td' tags found
for( Element td : tdTags )
{
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}

How to use Jsoup to find element by ID?

I am trying to scrape the Top Stories section in google news for all the titles. In order to only get the titles in the Top Stories section, I must narrow into this tag:
<div class="section top-stories-section" id=":2r">..</div>
This is the code I use (in Eclipse):
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://news.google.com";
Document document = Jsoup.connect(url).get();
// Extract data
Element topStories = document.getElementById(":2r").;
Elements titles = topStories.select("span.titletext");
// Output data
for (Element title : titles) {
System.out.println("Title: " + title.text());
}
}
I always seem to be getting a NullPointerException. It doesn't work either, when I try to reach the Top Stories like this:
Element topStories = document.select("#:2r").first();
Am I missing something? Shouldn't this be working? I am relatively new to this, please help and thank you!
Judging from the error message (and actually looking at the page) that div tag doesn't contain an id attribute. Instead you could select based on the CSS class
Element topStories = document.select("div.section.top-stories-section").first();

Finding a specific file on a site using jsoup

So i'm trying to create a little program that updates a World of Warcraft addon for me. Im using jsoup to get a list of links on a specific site. How do I ignore files/links that don't end in .zip?
This is my link list so far, as you can see it will print a list of all the links on the site. The goal is to only find .zip files (there are only two). And then download one of them. Direct link to download changes every time they update the addon, so I can't just download a specific link. I need to find the latest version every time.
public static void LinkList() {
Document doc;
try {
doc = Jsoup.connect("http://www.tukui.org/dl.php").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
You can use [attr$=value] selector to checks if attribute ends with value
Elements links = doc.select("a[href$=zip]");
Demo:
Document doc = Jsoup.connect("http://www.tukui.org/dl.php").get();
Elements links = doc.select("a[href$=zip]");
List<String> list = new ArrayList<>();
for (Element link : links) {
System.out.println("link : " + link.attr("href"));
list.add(link.attr("href"));
}
String[] arr = list.toArray(new String[list.size()]);
System.out.println("array content:" + Arrays.toString(arr));
Output:
link : http://www.tukui.org/downloads/tukui-15.79.zip
link : http://www.tukui.org/downloads/elvui-6.82.zip
link : /client/win/tc2430.zip
array content:[http://www.tukui.org/downloads/tukui-15.79.zip, http://www.tukui.org/downloads/elvui-6.82.zip, /client/win/tc2430.zip]

HTML table id and class id

How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India
I was able to see the classes wikitable sortable jquery-tablesorter
This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table?
What is the CSS equivalent to get all the names in that table?
I want to get only the states... the first column. I am using jsoup.
If this is still pending issue, here is how you can get list of states in India :
public static void main(String[] args) throws IOException
{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements tables = doc.select("table");
for (Element table : tables) {
Element tableCaption = table.getElementsByTag("big").first();
if (tableCaption != null && tableCaption.text().equals("States of India")) {
Document statesDoc = Jsoup.parse(table.toString());
Elements states = statesDoc.select("tr td:eq(0)");
for (Element state : states) {
System.out.println(state.text().replaceAll("\\[\\d\\]", ""));
}
}
}
}
There is no ID on that table. If you want to get the content of the table which has the class "wikitable". Use Jsoup with this code
package com.main;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main (String args[]){
Document doc;
try {
doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements newsHeadlines = doc.select("table.wikitable").get(0).select("td:eq(0) a");
System.out.println(newsHeadlines.html());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
So it looks like you're trying to screenscrape this table.
The answer to your question is there there is no id on that particular <table>.
The html that starts the table is:
<table class="wikitable sortable jquery-tablesorter" style="width:70%;">
As you can see there is no id attribute for that element.
What libraries are you using to parse the HTML? In JavaScript you could use document.getElementsByClassName('wikitable')[0] and find that uniquely on the page. But the syntax you would use will depend on what kind of HTML DOM traversing are available to you.
The id element is optional; not every element on a page will have one. This table doesn't.
Using JQuery. You want the first table with classes wikitable sortable jquery-table-sorter.
$(".wikitable.sortable.jquery-table-sorter").first()
Although, the css classes could change at any time so I wouldn't rely on that. It might be worth asking someone who can edit the wiki page to add an id to all the tables.

Categories