Scrape information from Web Pages with Java? - java

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)

You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.

As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

Related

How to parse tabular data from CNBC Markets Page?

I have a program I am writing that takes user input to connect to a site, download it's html into a text, and retrieve data from a table twice a day. I understand the code will not be one size fits all for any page (I will likely "hardwire" the url into the code once I get it working). My issue presently is that my jsoup parser isn't properly reading in the tabular data. I'm not sure if my element selectors are too generic? The table looks like it is in standard table/tr/td format, but my rows array populates with size 0. If someone could help me debug my parser and possibly provide some suggestions on where to look for making it grab data silently twice a day, I'd really appreciate it! No runtime/compile errors, just need to correct output.
Source site: https://www.cnbc.com/us-markets/
Source code for table (snipet) :
<table class="BasicTable-table"><thead class="BasicTable-tableHeading BasicTable-tableHeadingSortable"><tr><th class="BasicTable-textData"><span>SYMBOL <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData"><span>PRICE <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData">
My code:
public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf("Title: %s", doc.title());
//Try to print site content
System.out.println("");
System.out.println("Writing html contents to 'html.txt'...");
//Save html contents to text file
PrintWriter outputfile = new PrintWriter("html.txt");
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println("Enter the name of the stock you want to check");
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select("table").get(0);
Elements rows = table.select("tr");
System.out.println(rows.size());
for(int i = 1; i < rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select("td");
if(col.get(0).equals(name)) {
System.out.println("I worked!");
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
The problem here is that this site is a dynamic page that is loading content after the browser initially downloads the page. Jsoup is not going to be adequate to scrape pages like this. A couple options you have:
1) Use a tool that simulates a browser and makes all the necessary api calls. A couple options are Selenium WebDriver or HTMLUnit.
2) Figure out the api calls you are interested in on this site, and just call those api's directly to get a JSON document you can parse. You can see api details by opening developer tools in your browser, then look at the Network tab. For this site an example would be the following, which includes the stock quote for DJI:
https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: "http://quote.cnbc.com/services/MultiQuote/2006",
ExtendedQuote: [{
QuickQuote: {
symbol: ".DJI",
code: "0",
curmktstatus: "REG_MKT",
FundamentalData: {
yrlodate: "2020-03-23",
yrloprice: "18213.65",
yrhidate: "2020-02-12",
yrhiprice: "29568.57"
},
mappedSymbol: {
xsi:nil: "true"
},
source: "Exchange",
cnbcId: "599362",
prev_prev_closing: "21413.44",
high: "22783.45",
low: "21693.63",
provider: "CNBC Quote Cache",
streamable: "0",
last_time: "2020-04-06T17:16:28.000-0400",
countryCode: "US",
previous_day_closing: "21052.53",
altName: "Dow Industrials",
reg_last_time: "2020-04-06T17:16:28.000-0400",
last_time_msec: "1586207788000",
altSymbol: ".DJI",
change_pct: "7.73",
providerSymbol: ".DJI",
assetSubType: "Index",
comments: "RIC",
last: "22679.99",
issue_id: "599362",
cacheServed: "false",
responseTime: "Mon Apr 06 19:12:09 EDT 2020",
change: "1627.46",
timeZone: "EDT",
onAirName: "Dow Industrials",
symbolType: "issue",
assetType: "INDEX",
volume: "614200990",
fullVolume: "614200990",
realTime: "true",
name: "Dow Jones Industrial Average",
quoteDesc: { },
exchange: "Dow Jones Global Indexes",
shortName: "DJIA",
cachedTime: "Mon Apr 06 19:12:09 EDT 2020",
currencyCode: "USD",
open: "21693.63"
}
}
...

Java - Getting HTTP error 503 when querying google

I'm trying to code a little program in Java, with a small UI, that lets you use some google search's keyword to improve your search.
I have 2 text field (one for the site and one for the keywords) and 2 date pickers to let the user select the date range for the searching result .
When I press the search button it will connect to the following url:
"https://www.google.it/search?q=" + site + Keywords + daterange
site = "site:SITE_MAIN_URL"
keywords are the keywords i am looking for
daterange = "daterange:JULIAN_DATE_1 - JULIAN_DATE_2"
after all this I fetch the first 10 result, but here's the problem...
If I select no dates I can easily fetch the links
If I set the daterange I get the HTTP 503 error that is the one for service unavailable (if I paste the generated URL on my web browser everything works fine)
(the User Agent is set to mozilla 5.0)
EDIT: didn't post any code :P
//here i generate the site
site = "site:" + website_field.getText();
//here i convert the dates using a class found on the net
d1 = (int) DateLabelFormatter.dateToJulian(date1);
d2 = (int) DateLabelFormatter.dateToJulian(date2);
daterange += "+daterange:" + d1 + "-" + d2;
//here i generate the keywords
keywords = keyword_field.getText();
String[] keyword = keywords.split(" ");
for (int i = 0; i < keyword.length; i++) {
tempKeyword += "+" + keyword[i];
}
//the query
query = "https://www.google.it/search?q=" + site + tempKeyword + daterange;
//the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect(query).userAgent("Mozilla/5.0").timeout(5000).get();
//fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++) {
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url")) //donothing
if (temp.contains("webcache")) { //donothing
} else {
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
After executing all this (NotSoWellWritten)code if i select no date, and i use just the "site" and the "keywords" I can see on the console the first 10 result found on the google search page.
If i select a daterange from the datepickers i get the 503 error.
If you wanna try a working query, here's one that search on facebook.com the keyword "dog" starting from the 1st of november to the 15th generated with this "tool"
https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342
`
I have no problems using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main
{
public static void main(String[] args) throws IOException
{
// the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect("https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342").userAgent("Mozilla/5.0").timeout(5000).get();
// fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++)
{
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url") && !temp.contains("webcache"))
{
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
}
}
The code gives this as output on my computer:
https://www.facebook.com/uniladmag/videos/1912071728815877/
https://it-it.facebook.com/DogEvolutionAsd
https://it-it.facebook.com/DylanDogSergioBonelliEditore
https://www.facebook.com/DelawareCountyDogShelter/
https://www.facebook.com/LostDogAlert/
https://it-it.facebook.com/pages/Toelettatura-Vanity-DOG/270854126382923
https://it-it.facebook.com/washdogsgm
https://www.facebook.com/thedailystar/videos/1193933410623520/
https://www.facebook.com/OakhurstDogPark/
https://www.facebook.com/bigdogdinerco/
A 503 error usually means that the web server is having temporary issues. Specifically:
503: The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.
If this code works but your original code still does not, then your code is not generating the URL you posted and you should investigate further.
Besides the coding style, I don't see any functional problems with the provided code and it supplies the answers correctly (tested it locally). The problem might reside in the dateToJulian which I don't know what it returns and how the result is cast to int (if information is lost).
Also, consider the case in which the keywords contain dangerous characters and they are unescaped. They should be sanitized beforehand.
Another possibility is that Google is rejecting your queries if you are sending too many too fast. If this was done using a visual browser, you'd get a "We want to make sure you're not a robot." and a CAPTCHA page. That is why I'd recommend leveraging the Google API for your searches. See this SO for more info: How can you search Google Programmatically Java API

find amazon categories using python

I would like to get the categories of the amazon ,I am planning to scrap not to use API.
I have scrapped the http://www.amazon.com.I have scraped all the categories and sub-categories under Shop By Department drop down .I have created a web service to do this The code is here
#route('/hello')
def hello():
text=list();
link=list();
req = urllib2.Request("http://www.amazon.com",
headers={"Content-Type": "application/json"})
html=urllib2.urlopen(req).read()
soup = BeautifulSoup(html)
last_page = soup.find('div', id="nav_subcats")
for elm in last_page.findAll('a'):
texts = elm.text
links = elm.get('href')
links = links.partition("&node=")[2]
text.append(texts)
link.append(links)
alltext=list();
for i,j in zip(text,link):
alltext.append({"name":i,"id":j})
response.content_type = 'application/json'
print(alltext)
return dumps(alltext)
run(host='localhost', port=8080, debug=True)
I am passing the category name and category id as a JSON object to one of my members to pass it to the API to get the product listing for each category
It is written in JAVA.Here is the code
for (int pageno = 1; pageno <= 10; pageno++) {
String page = String.valueOf(pageno);
String category_string = selectedOption.get("category_name").toString();
String category_id = selectedOption.get("category_id").toString();
final Map<String, String> params = new HashMap<String, String>(3);
params.put(AmazonClient.Op.PARAM_OPERATION, "ItemSearch");
params.put("SearchIndex", category_string);
params.put("BrowseNodeId", category_id);
params.put("Keywords", category_string);
params.put("ItemPage", page);
System.out.println(client.documentToString(client.getXml(params)));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document doc = null;
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream is = client.getInputStream(params);
doc = db.parse(is);
NodeList itemList = doc.getElementsByTagName("Items");
But i am getting this error when i pass the category id as the BrowseNodeId and category name as keyword and search index.
For example
Search Index and Keyword -Amazon Instant Video
BrowseNodeId-2858778011
The value you specified for SearchIndex is invalid. Valid values include [ 'All','Apparel',...................................reless','WirelessAccessories' ].
I would like to know from which amazon url i will get all the categories and its browse nodes
Thank you
I have never looked at Amazon's API before, so this is just a guess but, based on the error message it would seem that "Amazon Instant Video" is not a valid search index. Just because it is there in the drop-down list, doesn't necessarily mean that it is a valid search index.
Here's a list of search indices for US: http://docs.aws.amazon.com/AWSECommerceService/latest/DG/USSearchIndexParamForItemsearch.html . I don't know how up to date it is, but "Amazon Instant Video" does not appear on the list. The error message does include a list of valid search index values, and these do appear to correspond to the above list.
For other locales look here : http://docs.aws.amazon.com/AWSECommerceService/latest/DG/APPNDX_SearchIndexParamForItemsearch.html
I don't think that this is a coding problem per se.
You might like to take a look at python-amazon-product-api. The API might be useful to you, and the documentation might give you some ideas.

How to get h2 Tag of a table using Jsoup

I need some help scraping a webpage with Jsoup. I want to pars player profiles from the hcfactions webpage and gather their kills and deaths. The problem I'm running into is that each profile page is dynamically created and will only have said tables if the player has kills or deaths. So in order to tell which table I'm parsing I need to get the header text that's set after the call.
example web page: http://www.hcfactions.net/index.php?action=playerinfo&player=Djmaddox.
Below is a html segment from the web page I'm scraping:
<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>
<tr><td>Date</td><td>Reason</td><td>Details</td></tr><tr><td>Dec 11 5:27pm CST</td>.....
I have this code that pulls the tables and counts entries but it wont pull the h2 tags with it for me to select.
public void getPlayerDetails(String name) {
String data = "";
Avatar temp = _db.getPlayer(name);
playerUrl = "http://www.hcfactions.net/index.php?action=playersearch&player=" + name;
try {
// data = Jsoup.connect(url)
// .url(url).get().html();
playerDoc = Jsoup.connect(playerUrl).get();
} catch (IOException ex) {
Logger.getLogger(JParser.class.getName()).log(Level.SEVERE, null, ex);
}
if (playerDoc.select("table").size() == 1) {
return;
} else if (playerDoc.select("table").size() >= 2) {
for (int x = 1; x < playerDoc.select("table").size(); x++) {
System.out.println("deaths");
Element table = playerDoc.select("table").get(x);
Iterator<Element> ite = table.select("tr").iterator();
int count = 0;
while (ite.hasNext()) {
data = ite.next().text();
count++;
}
if (count > 0) {
temp.setDeaths(count - 1);
}
}
}
}
The tag <h2> is on an invalid position. That's why JSoup cannot find it I think. You have to extract it yourself with regular expressions. You can get the content of the <h2> with the following code:
String tableToString = "<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>" + "<tr>" + "<td>Date</td>" + "<td>Reason</td>" + "<td>Details</td>" + "</tr>" + "</table>";
String regex = "<h2.*>(.*)?</h2>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(tableToString);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
You can init tableToString with table.toString() from your code.
As ka3ak says, the <h2> is mispositioned. But you don't have to abandon your parser as resort to regex for that. Assuming JSoup is a decent HTML parser (never used it myself) the <h2> element should be the element immediately preceding the <table> element. Get your 'select' statement to look for it there.
Elements headers=playerDoc.select("div.span10.offset1 h2");
IMHO Your selections seams to be little bit overcomplicated, but maybe it has to be like that. Anyway snippet above will get you every H2 tags present in proper container.
Later on you ca select required tables like that Elements tables=playerDoc.select("div.span10.offset1 table"); and apply proper data digging onto them. Headers will be in corresponding order to tables ofc. I think, that my job is done here :)

Grails AddTo in for loop

Merged with Grails addTo in for loop.
I am facing a problem due to that i'm newbie to grails
i'm doing a website for reading stories and my goal now is to do save the content of the story into several pages to get a list and then paginate it easily .. so i did the following.
in the domain i created two domains one called story and have this :
class Story {
String title
List pages
static hasMany=[users:User,pages:Page]
static belongsTo = [User]
static mapping={
users lazy:false
pages lazy:false
}
}
and have of course domain called page have this :
class Page {
String Content
Story story
static belongsTo = Story
static constraints = {
content(blank:false,size:3..300000)
}
}
and the controller saving method gone like this:
def save = {
def storyInstance = new Story(params)
def pages = new Page(params)
String content = pages.content
String[] contentArr = content.split("\r\n")
int i=0
StringBuilder page = new StringBuilder()
for(StringBuilder line:contentArr){
i++
page.append(line+"\r\n")
if(i%10==0){
pages.content = page
storyInstance.addToPages(pages)
page =new StringBuilder()
}
}
if (storyInstance.save(flush:true)) {
flash.message = "${message(code: 'default.created.message', args: [message(code: 'story.label', default: 'Story'), storyInstance.id])}"
redirect(action: "viewstory", id: storyInstance.id)
}else {
render(view: "create", model: [storyInstance: storyInstance])
}
}
i know it looks messy but it's a prototype..any way.. the problem is that i'm waiting from "storyInstance.addToPages(pages)" line to add to the set of pages an instance of the pages every time the condition is true..but what actually happen that it's give me the last instane only with the last page_idx while i thought it should save the pages one by one and so i can get a list of pages to every story..
why this happen and is there a simpler way to do it than what i did..
i'm waiting for any help here that is appreciated

Categories