jsoup: How to extract correct data from this website

jsoup: How to extract correct data from this website - java

I am trying to extract data from a Spanish dictionary using jsoup. Essentially, the user will input words he wants to define as command line arguments and the program will return a formatted list of definitions. Here is what I have done so far:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
String[] urls = new String[args.length];
for(int i=0; i<args.length; i++) {
urls[i] = "http://www.diccionarios.com/detalle.php?palabra="
+ args[i]
+ "&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on";
try{
Document doc = Jsoup.connect(urls[i]).get();
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
System.out.println(untokenized);
}catch (Exception e) {
System.out.println("EXCEPTION: Word is probably not in this dictionary.");
}
}
}
}
That url array gives the correct urls where the information for the definition is.
Now, what I'm expecting to be returned is what you would get if you went to the try.jsoup website and used (for example) this : http://www.diccionarios.com/detalle.php?palabra=libro&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on
as the link and typed in html as the CSS Query. I need that data so I can tokenize the definition from that.
So I guess my question is, what method would I use to obtain the same data that you can see on the try.jsoup website. Thanks a lot!
Edit: This is about interpreting the data from the url. The end result data I want (in this example) is "Conjunto de hojas escritas unidas o cosidas por uno de sus lados y cubiertas por tapas de cartón u otro material." That is the definition on the website. However, I noticed that on that try.jsoup website that if I put the html text in the CSS Query box then the result was a huge bunch of text. My assumption was that the following 2 lines of code would capture this huge bunch of text and save it as a string:
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
However, the output for when I print untokenized is instead this: "Usuario Clave ¿Olvidaste tu clave? Condiciones Privacidad Versión completa © 2011 Larousse Editorial, SL." So my question is, how to I obtain the string data for that huge bunch of text found on the try.jsoup website?
EDIT: I followed the advice of the question here: Jsoup - CSS Query selector issue (?) and it worked great.

Related

Scrape information from Web Pages with Java?

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)

You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.

As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

jsoup : How to search for date text from a webpage

Simply this is what I am trying to do :
(I want to use jsoup)
pass only one url to parse
search for date(s) which are mentioned inside the contents of web page
Extracts at least one date from the each page contents
convert that date into standard format
So, Point #1
What I have now :
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?
Then Point #2 What I have now:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.
I have done point #4.
So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.
Thanks in Advance !
Updated :
So here how I want :
public static void main(String[] args){
try {
// using USER AGENT for giving information to the server that I am a browser not a bot
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
// My only one url which I want to parse
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
// Creating a jsoup.Connection to connect the url with USER AGENT
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
// retrieving the parsed document
Document htmlDocument = connection.get();
/* Now till this part, I have A parsed document of the url page which is in plain-text format right?
* If not, in which type or in which format it is stored in the variable 'htmlDocument'
* */
/* Now, If 'htmlDocument' holds the text format of the web page
* Why do i need elements to find dates, because dates can be normal text in a web page,
* So, how I am going to find an element tag for that?
* As an example, If i wanted to collect text from <p> paragraph tag,
* I would use this :
*/
// I am not sure is it correct or not
//***************************************************/
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
//***************************************************//
/* But I do not want any elements to find to gather dates on the page
* I just want to search the whole text document for date
* So, I need a regex formatted date string which will be passed as a input for a search method
* this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
*/
// At the end we will use only one date from our search result and format it in a standard form
/*
* That is it.
*/
/*
* I was trying something like this
*/
//final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}

Here is one possible solution i found:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Created by ruben.alfarodiaz on 21/12/2016.
*/
#RunWith(JUnit4.class)
public class StackTest {
#Test
public void findDates() {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
try {
String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
//with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");
//Here we find all document elements which have some element with the searched pattern
Elements elements = htmlDocument.getElementsMatchingText(pattern);
//in this loop we are going to filter from all original elements to find only the leaf elements
List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
finalElements.stream().forEach(elem ->
System.out.println("Node: " + elem.html())
);
}catch(Exception ex){
ex.printStackTrace();
}
}
//Method to decide if the current element is a leaf or contains others dates inside
private boolean isLastElem(Element elem, Pattern pattern) {
return elem.getElementsMatchingText(pattern).size() <= 1;
}
}
The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities
Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance
<html>
<body>
<div>
20/11/2017
</div>
</body>
</html>
If we find for the pattern dd/mm/yyyy the library will return 3 elements
html, body and div, but we are just interested in div

How to extract the next tag elements in jsoup

I am using jsoup to parse html document . I need the value of P tag just after the SPAN tag which contains id attribute.
I am trying with the following code
Elements spanList = body.select("span");
if (spanList != null) {
for (Element element1 : spanList) {
if (element1.attr("id").contains("midArticle")) {
Element element = element1.after("<p>"); // This line is wrong
if (element != null) {
String text = element.text();
if (text != null && !text.isEmpty()) {
out.println(text);
}
}
}
}
}
The html sample code
<span id="midArticle_9"></span>
<p>"The Director owes it to the American people to immediately provide the full details of what he is now examining," Podesta said in a statement. "We are confident this will not produce any conclusions different from the one the FBI reached in July." </p>
<span id="midArticle_10"></span>
<p>Clinton has repeatedly apologized for using the private email server in her home instead of a government email account for her work as secretary of state from 2009 to 2013. She has said she did not knowingly send or receive classified information.</p>

i hope this resolves your issue...
public static void main(String[] args) {
String html = "<span id=\"midArticle_9\"></span><p>\"The Director owes it to the American people to immediately provide the full details of what he is now examining,\" Podesta said in a statement. \"We are confident this will not produce any conclusions different from the one the FBI reached in July.\" </p><span id=\"midArticle_10\"></span><p>Clinton has repeatedly apologized for using the private email server in her home instead of a government email account for her work as secretary of state from 2009 to 2013. She has said she did not knowingly send or receive classified information.</p>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("span");
for (Element element : elements) {
System.out.println(element.nextElementSibling().text());
}
}

Java - Getting HTTP error 503 when querying google

I'm trying to code a little program in Java, with a small UI, that lets you use some google search's keyword to improve your search.
I have 2 text field (one for the site and one for the keywords) and 2 date pickers to let the user select the date range for the searching result .
When I press the search button it will connect to the following url:
"https://www.google.it/search?q=" + site + Keywords + daterange
site = "site:SITE_MAIN_URL"
keywords are the keywords i am looking for
daterange = "daterange:JULIAN_DATE_1 - JULIAN_DATE_2"
after all this I fetch the first 10 result, but here's the problem...
If I select no dates I can easily fetch the links
If I set the daterange I get the HTTP 503 error that is the one for service unavailable (if I paste the generated URL on my web browser everything works fine)
(the User Agent is set to mozilla 5.0)
EDIT: didn't post any code :P
//here i generate the site
site = "site:" + website_field.getText();
//here i convert the dates using a class found on the net
d1 = (int) DateLabelFormatter.dateToJulian(date1);
d2 = (int) DateLabelFormatter.dateToJulian(date2);
daterange += "+daterange:" + d1 + "-" + d2;
//here i generate the keywords
keywords = keyword_field.getText();
String[] keyword = keywords.split(" ");
for (int i = 0; i < keyword.length; i++) {
tempKeyword += "+" + keyword[i];
}
//the query
query = "https://www.google.it/search?q=" + site + tempKeyword + daterange;
//the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect(query).userAgent("Mozilla/5.0").timeout(5000).get();
//fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++) {
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url")) //donothing
if (temp.contains("webcache")) { //donothing
} else {
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
After executing all this (NotSoWellWritten)code if i select no date, and i use just the "site" and the "keywords" I can see on the console the first 10 result found on the google search page.
If i select a daterange from the datepickers i get the 503 error.
If you wanna try a working query, here's one that search on facebook.com the keyword "dog" starting from the 1st of november to the 15th generated with this "tool"
https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342
`

I have no problems using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main
{
public static void main(String[] args) throws IOException
{
// the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect("https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342").userAgent("Mozilla/5.0").timeout(5000).get();
// fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++)
{
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url") && !temp.contains("webcache"))
{
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
}
}
The code gives this as output on my computer:
https://www.facebook.com/uniladmag/videos/1912071728815877/
https://it-it.facebook.com/DogEvolutionAsd
https://it-it.facebook.com/DylanDogSergioBonelliEditore
https://www.facebook.com/DelawareCountyDogShelter/
https://www.facebook.com/LostDogAlert/
https://it-it.facebook.com/pages/Toelettatura-Vanity-DOG/270854126382923
https://it-it.facebook.com/washdogsgm
https://www.facebook.com/thedailystar/videos/1193933410623520/
https://www.facebook.com/OakhurstDogPark/
https://www.facebook.com/bigdogdinerco/
A 503 error usually means that the web server is having temporary issues. Specifically:
503: The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.
If this code works but your original code still does not, then your code is not generating the URL you posted and you should investigate further.

Besides the coding style, I don't see any functional problems with the provided code and it supplies the answers correctly (tested it locally). The problem might reside in the dateToJulian which I don't know what it returns and how the result is cast to int (if information is lost).
Also, consider the case in which the keywords contain dangerous characters and they are unescaped. They should be sanitized beforehand.
Another possibility is that Google is rejecting your queries if you are sending too many too fast. If this was done using a visual browser, you'd get a "We want to make sure you're not a robot." and a CAPTCHA page. That is why I'd recommend leveraging the Google API for your searches. See this SO for more info: How can you search Google Programmatically Java API

How extract the data from a list?

I'm developing an Android application and I want to recognize hashtags, mentions and links. I have a code that can be usable in objective-c that do my propose. I question these and now I have these code:
import java.net.URL;
import java.util.List;
String input = /* text from edit text */;
String[] words = input.split("\\s");
List<URL> urls=null;
for (String s : words){
try
{
urls.add(new URL(s));
}
catch (MalformedURLException e) {
// not a url
}
}
Now I want to put these on a tweet, I have developed the code to do it, and the tweet is based on an string. My question is how I put the data from the list in the string?
//I test these
String tweet="Using my app"+urls
But in the tweet appears "Using my appnull"
How I reuse this code to recognize hashtags and mentions?
I think that is changing the input.split("\\s") by "#\\s" or "#\\s"

You could just use a library here:
https://github.com/twitter/twitter-text-java
that does what you're trying to do.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

jsoup: How to extract correct data from this website - java

Related

Scrape information from Web Pages with Java?

jsoup : How to search for date text from a webpage

How to extract the next tag elements in jsoup

Java - Getting HTTP error 503 when querying google

How extract the data from a list?

Categories

Resources