JSOUP HTTP error fetching URL. Status=403 - java

I tried to scrape the google news content from a specific range of time from 1/1/2016 to 31/12/2016.
Originally, this code worked before. After running several times, it appears the http error.
I don't know whether I set the userclient incorrect or it is blocked by GOOGLE ?
> Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://ipv4.google.com/sorry/index?continue=http://www.google.com/search%253Fq%253Dstackoverflow%2526tbm%253Dnws%2526tbs%253Dcdr%2525253A1%2525252Ccd_min%2525253A5%2525252F30%2525252F2016%2525252Ccd_max%2525253A6%2525252F30%2525252F2016%2526start%253D0&q=EgTKLTckGKH5hsQFIhkA8aeDS-3IYZmr41q-m4rIMh7Uw7vC3wdLMgNyY24
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:679)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:676)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:249)
at javaapplication3.JavaApplication3.main(JavaApplication3.java:36)
Code here:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String news="&tbm=nws";
String string = google + URLEncoder.encode(search , charset) + news+"&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016";
String userAgent ="Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
int numberOfResultpages = 10; // grabs first two pages of search results
for (int i = 0; i < numberOfResultpages; i++) {
Document document = Jsoup.connect(string).userAgent(userAgent) .data("start",""+i).get();
Elements links = document.select( ".r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}

Related

Google custom image search - only returns website images

I want to fetch some google images with the Google custom search API. But instead of the google images Iam getting the thumbnails of the websites. Here an example:
Iam getting the link of these thumbnail images:
But I want to have the links of these images:
Maybe somoene can tell me how to do that!
The Code:
public static void main(String[] args) throws Exception {
String key = "";
String cx = "";
String keyword = "coke";
URL url = new URL("https://www.googleapis.com/customsearch/v1?key=" + key + "&cx=" + cx + "&q=" + keyword);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
BufferedReader br = new BufferedReader(new InputStreamReader((conn.getInputStream())));
String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null) {
if((output.contains("jpg") || output.contains("png")) && output.contains("src")){
System.out.println(output); //Will print the google search links
}
}
conn.disconnect();
}
Thanks a lot!
You aren't specifying that you want image search from Google. You are just searching for possible images in normal results. You'll need to add searchType=image.
Check this question and learn more about querying here.

Unexpected HTTP 403 error in Java

I am using Api in my Java App and triggering this URL (http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999). I am getting HTTP 403 error in console but in web browser no error occurs and getting the expected response. I also tried other URL and they work fine without problem or any errors.
So, what is the problem in URL and what should I do?
Here is source code :
Main.java
import org.json.simple.*;
import org.json.simple.parser.*;
public class Main
{
public static void main(String[] args) throws Exception
{
String numb = "9999999999,8888888888";
String response = new http_client("http://checkdnd.com/api/check_dnd_no_api.php?mobiles="+numb).response;
System.out.println(response);
// encoding response
Object obj = JSONValue.parse(response);
JSONObject jObj = (JSONObject) obj;
String msg = (String) jObj.get("msg");
System.out.println("MESSAGE : "+msg);
JSONObject msg_text = (JSONObject) jObj.get("msg_text");
String[] numbers = numb.split(",");
for(String number : numbers)
{
if(number.length() != 10 || number.matches(".*[A-Za-z].*")){
System.out.println(number+" is invalid.");
}else{
if(msg_text.get(number).equals("Y"))
{
System.out.println(number+" is DND Activated.");
}else{
System.out.println(number+" is not DND Activated.");
}
}
}
}
}
Now , http_client.java
import java.net.*;
import java.io.*;
public class http_client
{
String response = "";
http_client(String URL) throws Exception
{
URL url = new URL(URL);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
BufferedReader bs = new BufferedReader(new InputStreamReader(con.getInputStream()));
String data ="";
String response = "";
while((data = bs.readLine()) != null){
response = response + data;
}
con.disconnect();
url = null;
con = null;
this.response = response;
}
}
Without you showing us the code you're using to access the supplied URL (http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999) it makes it a wee bit difficult to determine exactly where your problem lies but my first guess would be that the link you provided is only accessible through a Secure Socket Layer (SSL). In other words, the link should start with https:// instead of http://
To validate this simply make the change to your url string: https://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999 and try again.
You're not going to have an issue with a browser for the simple reason that generally browsers will always try both protocols to make a connection. It is also up to the Website what protocol is acceptable, lots allow for both and some just don't.
To check if a url string is utilizing a valid protocol you can use this little method I quickly whipped up:
/**
* This method will take the supplied URL String regardless of the protocol (http or https)
* specified at the beginning of the string, and will return whether or not it is an actual
* "http" (no SSL) or "https" (is SSL) protocol. A connection to the URL is attempted first
* with the http protocol and if successful (by way of data acquisition) will then return
* that protocol. If not however, then the https protocol is attempted and if successful then
* that protocol is returned. If neither protocols were successful then Null is returned.<br><br>
*
* Returns null if the supplied URL String is invalid, a protocol does not
* exist, or a valid connection to the URL can not be established.<br><br>
*
* #param webLink (String) The full link path.<br>
*
* #return (String) Either "http" for Non SLL link, "https" for a SSL link.
* Null is returned if the supplied URL String is invalid, a protocol does
* not exist, or a valid connection to the URL can not be established.
*/
public static String isHttpOrHttps(String webLink) {
URL url;
try {
url = new URL(webLink);
} catch (MalformedURLException ex) { return null; }
String protocol = url.getProtocol();
if (protocol.equals("")) { return null; }
URLConnection yc;
try {
yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
in.close();
return "http";
} catch (IOException e) {
// Do nothing....check for https instead.
}
try {
yc = new URL(webLink).openConnection();
//send request for page data...
yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
yc.connect();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
in.close();
return "https";
} catch (IOException e) {
// Do Nothing....allow for Null to be returned.
}
return null;
}
To use this method:
// Note that the http protocol is supplied within the url string:
String protocol = isHttpOrHttps("http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999");
System.out.println(protocol);
The output to console will be: https. The isHttpOrHttps() method has determined that the https protocol is the successful protocol to use in order to acquire data (or whatever) even though http was supplied.
To pull the page source from the web page you can perhaps use a method like this:
/**
* Returns a List ArrayList containing the page source for the supplied web
* page link.<br><br>
*
* #param link (String) The URL address of the web page to process.<br>
*
* #return (List ArrayList) A List ArrayList containing the page source for
* the supplied web page link.
*/
public static List<String> getWebPageSource(String link) {
if (link.equals("")) { return null; }
try {
URL url = new URL(link);
URLConnection yc = null;
//If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
if (link.startsWith("https:")) {
yc = new URL(link).openConnection();
//send request for page data...
yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
yc.connect();
}
//and if not a SLL Endpoint (just http)...
else { yc = url.openConnection(); }
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
List<String> sourceText = new ArrayList<>();
while ((inputLine = in.readLine()) != null) {
sourceText.add(inputLine);
}
in.close();
return sourceText;
}
catch (MalformedURLException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
catch (IOException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
return null;
}
In order to utilize both the methods supplied here you can try something like this:
String netLink = "http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999";
String protocol = isHttpOrHttps(netLink);
String netLinkProtocol = netLink.substring(0, netLink.indexOf(":"));
if (!netLinkProtocol.equals(protocol)) {
netLink = protocol + netLink.substring(netLink.indexOf(":"));
}
List<String> list = getWebPageSource(netLink);
for (int i = 0; i < list.size(); i++) {
System.out.println(list.get(i));
}
And the console output will display:
{"msg":"success","msg_text":{"9999999999":"N"}}

Bing Search with Jsoup - how can I avoid captcha?

keywordexist = false;
try {
res = Jsoup
.connect(
bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""))
.userAgent(
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15")
.referrer("http://www.bing.com")
.method(Connection.Method.GET).execute();
doc = res.parse();
System.out.println(bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""));
elements = doc.select("li[class^=b_algo]");
System.out.println(doc.html());
System.out.println(elements.html());
// String divContents =
// doc.select(".id-app-orig-desc").first().text();
// elements.remove("div");
if (elements.html().contains("<strong>" + keyword + "</strong>")) {
keywordexist = true;
System.out.println("keyword exists");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I'm trying to use jsoup to check a list of keywords I have in Bing Search but whenever I run my program jsoup will always connect to Bing's captcha page, is there any way I can avoid this? I thought this would be remedied by adding a useragent and referrer but it doesn't seem to have any effect.
I used a code similar to yours and get all the results. However here are two points I noticed:
I think you should slow down between two searches. For example, add a random pause from 3000 to 5000 ms.
Don't forget to escape the query parameters
SAMPLE CODE
String bingSearchUrl = "http://www.bing.com/search?q=keyword";
String keyword = "stackoverflow jsoup";
String uaString = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15";
String url = bingSearchUrl.replaceAll("keyword", URLEncoder.encode("intitle:\"" + keyword + "\"", "UTF-8"));
Document doc = Jsoup.connect(url).userAgent(uaString).get();
System.out.println(doc.select("li h2"));

Using Regex Pattern with Jsoup on a Weblink

I'm trying to make a parser to get products info on a Website. I've made a similar tool with Php and Regex, and I wish to do the same with Java. The objective is to get a parent link, to make child products links with regex for getting their products info in a loop
String curl = TextField1.getText();
URL url = new URL(curl);
URLConnection spoof = url.openConnection();
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream(),"UTF-8"));
String strLine = "";
while ((strLine = in.readLine()) != null){
Pattern pattern = Pattern.compile("style='color:#000000;font-weight:bold;'>(.*?)</a>");
strLine = strLine.replaceAll(" ","_");
strLine = strLine.replaceAll("d'","d");
Matcher m = pattern.matcher(strLine);
while(m.find()){
String enfurl = "http://www.exemple.com/fr/"+m.group(1)+".htm";
System.out.println(enfurl);
}
}
This code works, but someone tell me that Jsoup is a better solution to parse html. I'm reading the Jsoup documentation, but after establish a connexion, I don't know which syntax I must choose. Could you help me ?
EDIT : Ok, with this code :
Elements links = doc.select("a[href][title*=Cliquer pour obtenir des détails]");
for (Element link : links) {
System.out.println(link.attr("href"));
String urlenf = link.attr("href");
Document docenf = Jsoup.connect(urlenf).get();
System.out.println(docenf.body().text());
}
I've got the links... but now, I must open another Jsoup connexion to get product info, and this test don't works. How Could I use another Jsoup in the for loop ? thanks
Try to get the urls (and generally, the content) like this.
String url = "PAGE_URL_GOES_HERE";
InputStream is = new URL(url).openStream();
String encoding = "UTF-8";
Document doc = Jsoup.parse(is , encoding , url);
Update
Are you sure the problem is with the encoding of the url?
I tried the below code, and it works just fine.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
try {
String url = "http://www.larousse.fr/dictionnaires/francais-anglais/écrémer/27576?q=écrémé";
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
.get();
System.out.println(doc.toString());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Update 2
In any case, try this one too, Jsoup.connect(new String(url.getBytes("UTF-8")))
There are plenty of examples of jsoup usage on the net.
Document document = Jsoup.connect(targerUrl).get(); //get html page
Elements descElements = document
.select("table#searchResult td:nth-child(2) font.detDesc"); // find elemets by css selectors
for (int i = 0; i < descElements.size(); i++) {
String torrentDesc = descElements.get(i).html(); //get tag content
}

Resolve HtmlCleaner issue of getting HTTP respond code 403

I'm using html cleaner to get data from a website...but I keep getting this error.
Server returned HTTP response code: 403 for URL: http://www.groupon.com/browse/chicago?z=skip
I'm not sure what I do wrong because I've use the same code before and its work perfectly.
is anyone able to help me please?.
Code is below:
public ArrayList ParseGrouponDeals(ArrayList arrayList) {
try {
CleanerProperties props = new CleanerProperties();
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
TagNode root = new HtmlCleaner(props).clean(new URL("http://www.groupon.com/browse/chicago?z=skip"));
//Get the Wrapper.
Object[] objects = root.evaluateXPath("//*[#id=\"browse-deals\"]");
TagNode dealWrapper = (TagNode) objects[0];
//Get the childs
TagNode[] todayDeals = dealWrapper.getElementsByAttValue("class", "deal-list-tile grid_5_third", true, true);
System.out.println("++++ Groupon Deal Today: " + todayDeals.length + " deals");
for (int i = 0; i < todayDeals.length; i++) {
String link = String.format("http://www.groupon.com%s", todayDeals[i].findElementByAttValue("class", "deal-permalink", true, true).getAttributeByName("href").toString());
arrayList.add(link);
}
return arrayList;
} catch (Exception e) {
System.out.println("Error parsing Groupon:" + e.getMessage());
e.printStackTrace();
}
return null;
}
For me adding the 'User-Agent' solves the problem; use it like this snippet:
final URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
final URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0");
urlConnection.connect();
final HtmlCleaner cleaner = new HtmlCleaner();
final CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
final TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());

Categories