I'm trying to connect and retrieve the page title from here. The code works fine if I remove everything after ".com" from the link. The following code does not work:
try {
Document doc = Jsoup.connect("https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en")
.data("query", "Java")
.userAgent("Chrome")
.cookie("auth", "token")
.timeout(3000)
.post();
String title = doc.title();
Log.d("hellomate", title);
}
catch (IOException e) {
Log.d("hellomatee", e.toString());
}
If the code worked, the title returned should be "Sammamish Washington - Google News".
The error returned from the code is: "org.jsoup.HttpStatusException: HTTP error fetching URL. Status=405, URL=https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en"
What does status 405 mean? Does Jsoup not allow the kind of url I used?
Thanks.
Status 405 is an http error code that means "Method Not allowed". You can find some documentation from microsoft on it here. As #Andreas said, you can fix this by changing .post(); to .get();.
If you look at the jsoup docs under example, it shows you how you would probably want to structure your requests:
Jsoup.connect("http://en.wikipedia.org/").get();
Related
I'm trying to send an http request to bing's spell checking api using a GET request. I checked my parameters and headers on https://www.hurl.it/ and it returned a proper json with the spelling errors properly, however when I send the request from my java app it returns this json with NO spelling errors detected (therefore, text parameter HAS to be empty somehow). I'm definitely passing the correct key in the header because that part isn't sending an error and the code is 200 (success).
My string: "my funger is harting me"
My code returned:
{"_type":"SpellCheck","flaggedTokens":[]}
Hurl.it returned:
{
"_type":"SpellCheck",
"flaggedTokens":[
{
"offset":3,
"token":"funger",
"type":"UnknownToken",
"suggestions":[
{
"suggestion":"finger",
"score":0.903614003311793
}
]
},
{
"offset":13,
"token":"harting",
"type":"UnknownToken",
"suggestions":[
{
"suggestion":"hurting",
"score":0.903614003311793
}
]
}
]
}
This is my java code using Apache's HTTPClient library:
(note: "command.getAfter()" is the passed string I mentioned above. I debugged it and even hard coded a string to test it out. Same output obviously.)
HttpClient httpclient = HttpClients.createDefault();
try {
URIBuilder builder = new URIBuilder("https://api.cognitive.microsoft.com/bing/v5.0/spellcheck/");
builder.setParameter("text", command.getAfter());
URI uri = builder.build();
HttpGet request = new HttpGet(uri);
request.setHeader("Ocp-Apim-Subscription-Key", "XXXXXXXX");
HttpResponse response = httpclient.execute(request);
HttpEntity entity = response.getEntity();
if (entity != null) {
System.out.println(EntityUtils.toString(entity));
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
EDIT: It turns out the URI returned in the request object returns this:
https://api.cognitive.microsoft.com/bing/v5.0/spellcheck/?text=my+funger+is+harting+me
So the parameter is not empty? But when fed no text parameter in hurl.it, the api returns an error of no parameters. When the text parameter is a space " ", it returns an identical result to mine. Unsure what this means since the URI seems to be valid and not empty and my subscription key is working because i would get an error if it weren't...
EDIT: I'm starting to suspect the Apache library is ignoring the parameters I'm passing in HttpGet(uri). I'm unsure, but I'm going to try a different solution to send the request with a header and see what happens.
EDIT: I tried the following code below:
String url = "https://api.cognitive.microsoft.com/bing/v5.0/spellcheck/?text=" + command.getAfter().replace(" ", "+");
try {
URL request_url = new URL(url);
//URIBuilder uri = new URIBuilder("https://api.cognitive.microsoft.com/bing/v5.0/spellcheck/");
//uri.setParameter("text", command.getAfter());
HttpURLConnection con = (HttpURLConnection) request_url.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("Ocp-Apim-Subscription-Key", Keys.BING_SPELL_CHECK_API);
con.setConnectTimeout(100000);
con.setReadTimeout(100000);
con.setInstanceFollowRedirects(true);
String theString = IOUtils.toString(con.getInputStream(), "UTF-8");
System.out.println(theString);
} catch (IOException e) {
e.printStackTrace();
}
It returned the same result as the Apache one... :/ What else should I try?
EDIT:
This is the output of the request as well:
https://api.cognitive.microsoft.com/bing/v5.0/spellcheck/?text=my+funger+is+hartingme - [Ocp-Apim-Subscription-Key: <XXXXXXXXXXXX>]
HTTP/1.1 200 OK - en_US
{"_type": "SpellCheck", "flaggedTokens": []}
I don't get it.... Why is the json outputted empty when hurl.it returns the correct json for this same request? Is this a java issue or something?
EDIT:
I just tried UniRest's api. Exact same result... What am I doing wrong here?!
I'm so lost...
Separate Issue:
I do want to note the following: When I set the bing api's version to 7.0, I get the following error:
Received http status code 401 with message Access Denied and body {"message":"Access denied due to invalid subscription key. Make sure to provide a valid key for an active subscription.","statusCode":401}
This is not the case with v5.0. I'm getting the correct key from my Azure portal. (The page called Keys and lists 2 keys you can use and regenerate)
Answer to getting v7.0 to work:
This is not the case with v5.0. I'm getting the correct key from my Azure portal. (The page called Keys and lists 2 keys you can use and regenerate)
You get 2 keys per version. So if you are seeing 2 keys, they are likely both for v5.0. It should explicitly mention v7.0.
There should be different sections, also with different endpoints.
Use these in combination with each other to get the desired result.
I am trying to send a request using jsoup with manually inserted cookies for the purpose of detecting SQL injection vulnerability.
The problem seams to be that only one of the cookies works and i don't understand why.
I first authenticate myself manually and get the cookies. Example:
PHPSESSID : b74302c3c6af62d23047a450a40cbf5a
security : high
After i got the cookies i send my request (which whould look like this from the browser http://localhost:8090/dvwa/vulnerabilities/sqli/?id='&Submit=Submit#) using the same PHPSESSID but with Security : low. The purpose is to force a "You have an error in your SQL syntax" response that signals SQL Injection vulnerability. The problem is that the PHPSESSID is received good (since I retrieve the dvwa/vulnerabilities/sqli page and not Login page thus it recognizes the PHPSESSID as valid after authentication) but the "security : low" seems not to work. I can't find the problem.
The jsoup code for an initial connection so that i can parse the forms on the page looks like this. I supply the cookies manually.
Connection connection = Jsoup.connect(urlDTO.getUrl())
.userAgent(StringConstants.USER_AGENT)
.cookies(cookies)//Map<String,String>
.referrer(StringConstants.REFERRER);
Document htmlDocument = connection.get();
For sending the form i use this code:
Connection connection = Jsoup.connect(formDTO.getUrl())
.userAgent(StringConstants.USER_AGENT)
.cookies(cookies)
.data(listToMap(formDTO.getInputList()))// id = ' , Submit = Submit
.method(getMethod(formDTO.getMethod()))
.referrer(StringConstants.REFERRER);
Connection.Response res = connection.execute();
Document doc = res.parse();
Does anyone know what I'm doing wrong?
After much debugging I fund out the origin of the problem and the odd behavior. There was no problem with the cookies/headers/url, the problem was at the .method(). The default value of .method() is Method.GET. But since i was sending dynamic requests i had to construct this dynamically also. For this purpose i was parsing the forms to get the method and then adding the type needed in the connection construction.
if (method.equals("post")) {
return Method.POST;
}
if (method.equals("get")) {
return Method.GET;
}
return Method.POST;
This worked until now and surprisingly when sending the wrong method it still sent a response almost valid, so i overlooked it.
Here is the fix.
if (method.toLowerCase().equals("post")) {
return Method.POST;
}
if (method.toLowerCase().equals("get")) {
return Method.GET;
}
return Method.GET;
It was my mistake and not relay a Jsoup problem but since I overlooked it others might too so here is a reminder.
I want to scrape the redirected tumblr site which comes up if you try to go to a tumblr page that doesnt exist. If I put the URL in the browser I get to that redirected site. Jsoup however just gives back a " HTTP error fetching URL. Status=404" Error. Any suggestions?
String userAgent = "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
Document doc = Jsoup.connect("http://www.faszokvagyunk.tumblr.com").userAgent(userAgent).followRedirects(true).get();
Thank you.
Your code seem to handle other types of redirects just fine however, with tumblr you get a 404 page that causes a 404 status hence, the exception and there could be many reasons to this:
Redirect might not happen at all
Tumblr does redirect in a strange way
Tumblr unnecessary returns 404 which causes the exception
Other possibilities
I don't know if this solution can help you but, you actually can instruct your JSOUP connection to ignoreHttpErrors by chaining the method as follow (this at least allow you to validate the http errors):
Document doc = Jsoup.connect("http://oddhouredproductivity.tumblr.com/tagged/tips").userAgent(userAgent).followRedirects(true).ignoreHttpErrors(true).get();
ignoreHttpErrors instructs the connection not to throw Http error when it comes across 404, 500, etc error status codes.
Connection ignoreHttpErrors(boolean ignoreHttpErrors)
Configures the
connection to not throw exceptions when a HTTP error occurs. (4xx -
5xx, e.g. 404 or 500). By default this is false; an IOException is
thrown if an error is encountered. If set to true, the response is
populated with the error body, and the status message will reflect the
error.
Parameters: ignoreHttpErrors - - false (default) if HTTP errors should
be ignored.
Returns: this Connection, for chaining
if you set ignoreHttpErrors to true then you will get the Document. If not then Document will be null.
I also came across this site that might actually demonstrate actual tumblr redirect. You might want to use URLs in that page to do your test as they are proper tumblr redirect. If you look inside the retrieved document for this page then you see a JavaScript direct function that triggers after 3 seconds as follow:
//redirect to new blog
setTimeout( redirectTumblr, 3000 );
function redirectTumblr() {
location.replace('http://oddhour.tumblr.com' + location.pathname);
}
When I connect to the URL that you have given your in your question. I see 404 page and the content of the 404 page returned in Document by connection contains no sign of redirect (like the other page have).
In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.
I am getting a 404 error when using Jsoup. The call is Document doc = Jsoup.parse(url, 30000) and the URL string is http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94
and the URL displays fine in Chrome. The error I am getting is java.io.IOException: 404 error loading URL http://www.myland.co.il/vmchk/××ש×-×שק××
Any ideas?
Don't use parse()-method for websites, use connect() instead. So you can set more connection settings.
final String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url).get();
However the problem is the url-encoding:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.myland.co.il/vmchk/××ש×-×שק××
Even decoding the url back to utf-8 doesn't solve this.
Do you have an "alternative" url?
try decodeURL()
String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url.decodeURL()).get();