some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?
Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).
Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.
Related
I am using Api in my Java App and triggering this URL (http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999). I am getting HTTP 403 error in console but in web browser no error occurs and getting the expected response. I also tried other URL and they work fine without problem or any errors.
So, what is the problem in URL and what should I do?
Here is source code :
Main.java
import org.json.simple.*;
import org.json.simple.parser.*;
public class Main
{
public static void main(String[] args) throws Exception
{
String numb = "9999999999,8888888888";
String response = new http_client("http://checkdnd.com/api/check_dnd_no_api.php?mobiles="+numb).response;
System.out.println(response);
// encoding response
Object obj = JSONValue.parse(response);
JSONObject jObj = (JSONObject) obj;
String msg = (String) jObj.get("msg");
System.out.println("MESSAGE : "+msg);
JSONObject msg_text = (JSONObject) jObj.get("msg_text");
String[] numbers = numb.split(",");
for(String number : numbers)
{
if(number.length() != 10 || number.matches(".*[A-Za-z].*")){
System.out.println(number+" is invalid.");
}else{
if(msg_text.get(number).equals("Y"))
{
System.out.println(number+" is DND Activated.");
}else{
System.out.println(number+" is not DND Activated.");
}
}
}
}
}
Now , http_client.java
import java.net.*;
import java.io.*;
public class http_client
{
String response = "";
http_client(String URL) throws Exception
{
URL url = new URL(URL);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
BufferedReader bs = new BufferedReader(new InputStreamReader(con.getInputStream()));
String data ="";
String response = "";
while((data = bs.readLine()) != null){
response = response + data;
}
con.disconnect();
url = null;
con = null;
this.response = response;
}
}
Without you showing us the code you're using to access the supplied URL (http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999) it makes it a wee bit difficult to determine exactly where your problem lies but my first guess would be that the link you provided is only accessible through a Secure Socket Layer (SSL). In other words, the link should start with https:// instead of http://
To validate this simply make the change to your url string: https://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999 and try again.
You're not going to have an issue with a browser for the simple reason that generally browsers will always try both protocols to make a connection. It is also up to the Website what protocol is acceptable, lots allow for both and some just don't.
To check if a url string is utilizing a valid protocol you can use this little method I quickly whipped up:
/**
* This method will take the supplied URL String regardless of the protocol (http or https)
* specified at the beginning of the string, and will return whether or not it is an actual
* "http" (no SSL) or "https" (is SSL) protocol. A connection to the URL is attempted first
* with the http protocol and if successful (by way of data acquisition) will then return
* that protocol. If not however, then the https protocol is attempted and if successful then
* that protocol is returned. If neither protocols were successful then Null is returned.<br><br>
*
* Returns null if the supplied URL String is invalid, a protocol does not
* exist, or a valid connection to the URL can not be established.<br><br>
*
* #param webLink (String) The full link path.<br>
*
* #return (String) Either "http" for Non SLL link, "https" for a SSL link.
* Null is returned if the supplied URL String is invalid, a protocol does
* not exist, or a valid connection to the URL can not be established.
*/
public static String isHttpOrHttps(String webLink) {
URL url;
try {
url = new URL(webLink);
} catch (MalformedURLException ex) { return null; }
String protocol = url.getProtocol();
if (protocol.equals("")) { return null; }
URLConnection yc;
try {
yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
in.close();
return "http";
} catch (IOException e) {
// Do nothing....check for https instead.
}
try {
yc = new URL(webLink).openConnection();
//send request for page data...
yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
yc.connect();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
in.close();
return "https";
} catch (IOException e) {
// Do Nothing....allow for Null to be returned.
}
return null;
}
To use this method:
// Note that the http protocol is supplied within the url string:
String protocol = isHttpOrHttps("http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999");
System.out.println(protocol);
The output to console will be: https. The isHttpOrHttps() method has determined that the https protocol is the successful protocol to use in order to acquire data (or whatever) even though http was supplied.
To pull the page source from the web page you can perhaps use a method like this:
/**
* Returns a List ArrayList containing the page source for the supplied web
* page link.<br><br>
*
* #param link (String) The URL address of the web page to process.<br>
*
* #return (List ArrayList) A List ArrayList containing the page source for
* the supplied web page link.
*/
public static List<String> getWebPageSource(String link) {
if (link.equals("")) { return null; }
try {
URL url = new URL(link);
URLConnection yc = null;
//If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
if (link.startsWith("https:")) {
yc = new URL(link).openConnection();
//send request for page data...
yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
yc.connect();
}
//and if not a SLL Endpoint (just http)...
else { yc = url.openConnection(); }
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
List<String> sourceText = new ArrayList<>();
while ((inputLine = in.readLine()) != null) {
sourceText.add(inputLine);
}
in.close();
return sourceText;
}
catch (MalformedURLException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
catch (IOException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
return null;
}
In order to utilize both the methods supplied here you can try something like this:
String netLink = "http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9999999999";
String protocol = isHttpOrHttps(netLink);
String netLinkProtocol = netLink.substring(0, netLink.indexOf(":"));
if (!netLinkProtocol.equals(protocol)) {
netLink = protocol + netLink.substring(netLink.indexOf(":"));
}
List<String> list = getWebPageSource(netLink);
for (int i = 0; i < list.size(); i++) {
System.out.println(list.get(i));
}
And the console output will display:
{"msg":"success","msg_text":{"9999999999":"N"}}
I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)
The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)
This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
I'm trying to read html file from URL. My code works with most of sites except some of them, such as http://dota2.gamepedia.com/Dota_2_Wiki. I guess I need to set java proxy or something?...
Here's my code:
try {
URL webPage = new URL("http://dota2.gamepedia.com/Dota_2_Wiki");
URLConnection con = webPage.openConnection();
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
BufferedReader in = new BufferedReader(
newInputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
catch (MalformedURLException exc){exc.printStackTrace();}
catch (IOException exc){exc.printStackTrace();}
As the result:
java.io.IOException: Server returned HTTP response code: 403 for URL: http://dota2.gamepedia.com/Dota_2_Wiki
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1838)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at com.Popov.Main.main(Main.java:17)
Error code 403: How can I get access to it? Btw, it works correctly in browser
Most likely your problem is because of not setting up user agent properly. for you guys who love vanilla java. these are the codes
private void sendGet() throws Exception {
String url = "http://dota2.gamepedia.com/Dota_2_Wiki";
URL obj = new URL(url);
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println("\nSending 'GET' request to URL : " + url);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
System.out.println(response.toString());
}
note that you also need to setup the cookie because when i try it without it, the code will give me to many redirect loop
You can simple try using jsoup html parser.See sample code;
public static void main(String[] args) throws IOException {
Document doc = Jsoup
.connect("http://dota2.gamepedia.com/Dota_2_Wiki")
.userAgent(
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
.timeout(0).followRedirects(true).execute().parse();
Elements titles = doc.select(".entrytitle");
// print all titles in main page
for (Element e : titles) {
System.out.println("text: " + e.text());
System.out.println("html: " + e.html());
}
// print all available links on page
Elements links = doc.select("a[href]");
for (Element l : links) {
System.out.println("link: " + l.attr("abs:href"));
}
}
I think your problem here is that the server doesn't accept your "user agent" string and returns a 403 forbidden code.
One answer suggested using Jsoup and setting the user agent manually, but didn't explain that setting the user agent is the crucial step. You could use that approach.
Or, you could read Setting user agent of a java URLConnection and set the user agent of the URLConnection yourself. This approach doesn't need any external libraries.
I'm trying to make a parser to get products info on a Website. I've made a similar tool with Php and Regex, and I wish to do the same with Java. The objective is to get a parent link, to make child products links with regex for getting their products info in a loop
String curl = TextField1.getText();
URL url = new URL(curl);
URLConnection spoof = url.openConnection();
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream(),"UTF-8"));
String strLine = "";
while ((strLine = in.readLine()) != null){
Pattern pattern = Pattern.compile("style='color:#000000;font-weight:bold;'>(.*?)</a>");
strLine = strLine.replaceAll(" ","_");
strLine = strLine.replaceAll("d'","d");
Matcher m = pattern.matcher(strLine);
while(m.find()){
String enfurl = "http://www.exemple.com/fr/"+m.group(1)+".htm";
System.out.println(enfurl);
}
}
This code works, but someone tell me that Jsoup is a better solution to parse html. I'm reading the Jsoup documentation, but after establish a connexion, I don't know which syntax I must choose. Could you help me ?
EDIT : Ok, with this code :
Elements links = doc.select("a[href][title*=Cliquer pour obtenir des détails]");
for (Element link : links) {
System.out.println(link.attr("href"));
String urlenf = link.attr("href");
Document docenf = Jsoup.connect(urlenf).get();
System.out.println(docenf.body().text());
}
I've got the links... but now, I must open another Jsoup connexion to get product info, and this test don't works. How Could I use another Jsoup in the for loop ? thanks
Try to get the urls (and generally, the content) like this.
String url = "PAGE_URL_GOES_HERE";
InputStream is = new URL(url).openStream();
String encoding = "UTF-8";
Document doc = Jsoup.parse(is , encoding , url);
Update
Are you sure the problem is with the encoding of the url?
I tried the below code, and it works just fine.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
try {
String url = "http://www.larousse.fr/dictionnaires/francais-anglais/écrémer/27576?q=écrémé";
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
.get();
System.out.println(doc.toString());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Update 2
In any case, try this one too, Jsoup.connect(new String(url.getBytes("UTF-8")))
There are plenty of examples of jsoup usage on the net.
Document document = Jsoup.connect(targerUrl).get(); //get html page
Elements descElements = document
.select("table#searchResult td:nth-child(2) font.detDesc"); // find elemets by css selectors
for (int i = 0; i < descElements.size(); i++) {
String torrentDesc = descElements.get(i).html(); //get tag content
}
I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class DataGetter {
public static void main(String[] args) throws IOException {
getResultAmount("test");
}
private static int getResultAmount(String query) throws IOException {
BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
.getInputStream()));
String line;
String src = "";
while ((line = r.readLine()) != null) {
src += line;
}
System.out.println(src);
return 1;
}
}
And the error:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at DataGetter.getResultAmount(DataGetter.java:15)
at DataGetter.main(DataGetter.java:10)
Why is it doing this?
You just need to set user agent header for it to work:
URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();
BufferedReader r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
sb.append(line);
}
System.out.println(sb.toString());
The SSL was transparently handled for you as could be seen from your exception stacktrace.
Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.
String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
String url = m.group(1);
connection = new URL(url).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.setRequestProperty("Cookie", cookie );
connection.connect();
r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
sb = new StringBuilder();
while ((line = r.readLine()) != null) {
sb.append(line);
}
response = sb.toString();
pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
m = pattern.matcher(response);
if( m.find() ) {
long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
return amount;
}
}
Running the full code I get 2930000000L as a result.
For me it worked by adding the header:
"Accept": "*/*"
You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.
It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.