No output for parsing google news content - java

For my code here , I want to get the google new search title & URL .
It worked in the past .However , I don't know why it is not working now ?
Did Google change its CSS structure or what ?
Thanks
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String news="&tbm=nws";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}

If the question is "how do I get the code working again?"
It would be difficult for anyone to know what the old page looked like unless they saved off a copy.
I broke down your select like this and it worked for me.
String string = google + URLEncoder.encode(search , charset) + news;
Document document = Jsoup.connect(string).userAgent(userAgent).get();
Elements links = document.select( ".r>a");
The current page source looks like
<div class="g">
<table>
<tbody>
<tr>
<td valign="top" style="width:516px"><h3 class="r">Marlboro Ransomware Defeated in One Day</h3>
Results:
Title: Marlboro Ransomware Defeated in One Day
URL: https://www.bleepingcomputer.com/news/security/marlboro-ransomware-defeated-in-one-day/
Title: Stack Overflow puts a new spin on resumes for developers
URL: https://techcrunch.com/2016/10/11/stack-overflow-puts-a-new-spin-on-resumes-for-developers/
Edited - Time range
These URL parameters look awful.
Add the suffix &tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016
But this part "min%3A5%2F30%2F2016" contains your minimum date. 5 30 2016.
min%3A + (month of year) + %2F + (day of month) + %2F + year
And in "max%3A6%2F30%2F2016" is your maximum date. 6 30 2016.
max%3A + (month of year) + %2F + (day of month) + %2F + year
Here's the full URL searching for Mindy Kaling between 05/30/2016 and 06/30/2016
https://www.google.com/search?tbm=nws&q=mindy%20kaling&tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016

Below worked for me. Please note the pattern ".g .r>a" - find elements with class g >>> all elements inside that with class r which is immediately descended with tag a
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news)
.userAgent(userAgent).get().select( ".g .r>a");
From documentation:
.class: find elements by class name, e.g. .masthead
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
Though the solution worked, I guess relying on the same might not be recommended unless this is for study purpose or temporary use. Shipping this as a part of product might lead to failure anytime Google changes their page rendering.

Related

GWT - extract text in between two characters

In GWT i have a servlet that returns an image from the database to the client. I need to extract out part of the string to properly show the image. What is returned in chrome, firefox, and IE has a slash in the src part. Ex: String s = "src=\""; Which is not visible in the string below. Maybe the slash is adding more parentheses around the http string. Im not sure?
what is returned in those 3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
EDGE browser doesn't have the slash in the src so my method to extract the image doesnt work in edge
What edge returns:
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
Problem: I need to extract the string below.
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
either with src= or src=\
What I tried and works with the browsers that return without the parentheses "src=\":
String s = "src=\"";
int index = returned.indexOf(s) + s.length();
image.setUrl(returned.substring(index, returned.indexOf("\"", index + 1)));
But fails to work in EDGE because it doesnt return a slash
I do not have access to Pattern, and matcher in GWT.
How can i extract and keep in mind the entityId number will change
http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
out of what is returned string above?
EDIT:
I need a generic way to extract out http://localhost:8080/dashboardmanager/downloadfile?entityId=4886
When the string might look like this both ways.
String edge = "<img src=”http://localhost:8080/dashboardmanager/downloadfile?entityId=4886”>";
3 browsers is = <img style="-webkit-user-select: none;" src="http://localhost:8080/dashboardmanager/downloadfile?entityId=4886">
public static void main(String[] args) {
String toParse = "<img style=\"-webkit-user-select: none;\" src=\"http://localhost:8080/dashboardmanager/downloadfile?entityId=4886\">";
String delimiter = "src=\"";
int index = toParse.indexOf(delimiter) + delimiter.length();
System.out.println(toParse.substring(index, toParse.length()).split("\"")[0]);
}

Scrape information from Web Pages with Java?

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)
You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.
As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

jsoup - Capturing <h1> elements excluding those that have the value

I'm using a crawler to capture data from a website.
Now, I'm trying to select all of the <h1> elements, and print it (for now). I noticed that there are some headers that contains only which makes the data looks empty.
I want to exclude <h1>s with the values .
Here's what I have tried:
`private static void getAllH1(String url, Element tCon) {
// System.out.println("Url: " + url);
Elements headers1 = tCon.getElementsByTag("h1");
System.out.println("Url\t\tHeader");
for(Element h1: headers1) {
if(h1.text().length()!=0 && h1.text()!="\u00a0") {
System.out.println(url + "\t\t" + h1.text());
}
}
}`
EDIT: I saw from one of the threads here about jsoup reading as \u00a0 but it's still not working.
Here is an example output:
`
Url Header
http://www.url.com/index.asp Quick Links
http://www.url.com/index.asp What's New
http://www.url.com/index.asp  
http://www.url.com/index.asp What's Next
http://www.url.com/index.asp What's On
http://www.url.com/index.asp Key Rates
http://www.url.com/index.asp Public Advisories
`
Thank you in advance!
I have found the answer from this link:
Element.text() doesn't normalize ' ' whitespace #529
So what I did, from jsoup-1.9.2, I updated my jsoup into jsoup-1.11.2.
Then, when I run the code (same code; no alteration), it finally recognized the .

Weird issue when using HTML form to pass information to Java Servlet

Okay so I have two Java Servlets, one for letting the user select which images to delete (DeleteImages) and another for actually deleting the images (HandleDelete). DeleteImages displays all the images in the container with a checkbox HTML form for the user to select which images to delete. Then, using POST the servlet passes that information along to HandleDelete which iterates over which images it received and deletes them.
I actually had this working, but then I tried to change the structure of the code (have DeleteImages forward to a .jsp file that output the HTML form which would then forward to HandleDelete) but that didn't work out, so I'm trying to go back to the old way and now it's not working even though I'm pretty sure it's the same as I had before.
From DeleteImages:
// retrieve image files
List<? extends SwiftObject> objs = os.objectStorage().objects().list("imageFiles");
out.println("<!DOCTYPE html><html><head><title>Object Storage - Delete</title><link rel='stylesheet' href='stylesheet.css' type='text/css' /></head>"
+ "<body>
<h1>Select images to Delete from container</h1>
<form method='POST' enctype='multipart/form-data' action='/ImageUpload/OSHandleDelete'>");
for (SwiftObject o : objs) {
// omitted code that gets the image's name, date last modified, and filepath (all Strings, I know this works)
out.println("<b>Name:</b> " + name + "<br/> <b>Time of Upload:</b> " + date + "<br/>"
+ "<input type='checkbox' name='"+name+"'> "
+ "<img src='"+filepath+"' alt='' style='max-width:800px;' /> <br/> <br/> <br/> ");
}
out.println("<input type='submit' value='Delete Images' /></form>");
out.println("</form> <br/> <br/> <div><br/><a class='return' href='index.jsp'><b>Click here to return home</b></a><br/><br/></div> <br/> <br/></body></html>");
From HandleDelete:
Enumeration<String> parameterNames = request.getParameterNames();
if (!parameterNames.hasMoreElements())
System.out.println("no parameters (null)");
while (parameterNames.hasMoreElements()) {
String paramName = parameterNames.nextElement();
System.out.println("*****************");
System.out.println("parameter: " + paramName);
String[] vals = request.getParameterValues(paramName);
if (vals == null)
System.out.println(" vals is null for " + paramName);
else {
for (int i = 0; i < vals.length; i++) {
System.out.println(" vals["+i+"]: " + vals[i]);
}
}
Right now I don't have HandleDelete actually doing anything besides print statements. This is because I use request.getParameter("<name>") to find out whether the user checked image of <name> (i.e. if it's null it was not checked but if it's not null it was checked).
The HTML form displays perfectly with the images and everything. My problem is that no matter what's checked in the HTML form, HandleDelete always prints to the console no parameters (null) meaning nothing was passed from DeleteImages to HandleDelete. I have a feeling the problem comes from either (1) the setAttribute statement in DeleteImages or (2) something with the HTML form. I've done a lot of searching and I'm pretty confident what I have is right though and I really can't figure out what's causing this issue (especially since I'm pretty sure this is exactly what I had before and it worked). Does anyone have any ideas?
I found the error: I included enctype=multipart/form-data
Not exactly sure why that caused the error but I removed that part and now it's passing the parameters perfectly. Thanks for your help guys!

How to get h2 Tag of a table using Jsoup

I need some help scraping a webpage with Jsoup. I want to pars player profiles from the hcfactions webpage and gather their kills and deaths. The problem I'm running into is that each profile page is dynamically created and will only have said tables if the player has kills or deaths. So in order to tell which table I'm parsing I need to get the header text that's set after the call.
example web page: http://www.hcfactions.net/index.php?action=playerinfo&player=Djmaddox.
Below is a html segment from the web page I'm scraping:
<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>
<tr><td>Date</td><td>Reason</td><td>Details</td></tr><tr><td>Dec 11 5:27pm CST</td>.....
I have this code that pulls the tables and counts entries but it wont pull the h2 tags with it for me to select.
public void getPlayerDetails(String name) {
String data = "";
Avatar temp = _db.getPlayer(name);
playerUrl = "http://www.hcfactions.net/index.php?action=playersearch&player=" + name;
try {
// data = Jsoup.connect(url)
// .url(url).get().html();
playerDoc = Jsoup.connect(playerUrl).get();
} catch (IOException ex) {
Logger.getLogger(JParser.class.getName()).log(Level.SEVERE, null, ex);
}
if (playerDoc.select("table").size() == 1) {
return;
} else if (playerDoc.select("table").size() >= 2) {
for (int x = 1; x < playerDoc.select("table").size(); x++) {
System.out.println("deaths");
Element table = playerDoc.select("table").get(x);
Iterator<Element> ite = table.select("tr").iterator();
int count = 0;
while (ite.hasNext()) {
data = ite.next().text();
count++;
}
if (count > 0) {
temp.setDeaths(count - 1);
}
}
}
}
The tag <h2> is on an invalid position. That's why JSoup cannot find it I think. You have to extract it yourself with regular expressions. You can get the content of the <h2> with the following code:
String tableToString = "<table class='table-bordered'><h2 style='text-align:center'>Deaths</h2>" + "<tr>" + "<td>Date</td>" + "<td>Reason</td>" + "<td>Details</td>" + "</tr>" + "</table>";
String regex = "<h2.*>(.*)?</h2>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(tableToString);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
You can init tableToString with table.toString() from your code.
As ka3ak says, the <h2> is mispositioned. But you don't have to abandon your parser as resort to regex for that. Assuming JSoup is a decent HTML parser (never used it myself) the <h2> element should be the element immediately preceding the <table> element. Get your 'select' statement to look for it there.
Elements headers=playerDoc.select("div.span10.offset1 h2");
IMHO Your selections seams to be little bit overcomplicated, but maybe it has to be like that. Anyway snippet above will get you every H2 tags present in proper container.
Later on you ca select required tables like that Elements tables=playerDoc.select("div.span10.offset1 table"); and apply proper data digging onto them. Headers will be in corresponding order to tables ofc. I think, that my job is done here :)

Categories