Jsoup - how many KB it takes? - java

I have for example:
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Response response1 = Jsoup.connect("http://example.com").method(Connection.Method.GET)
.data("name", "peksak")
.execute();
How can I check in these three examples, how much they received the following KB from the internet?
I use this in my Android application. For internet I use WI-FI in my mobile.

Well you can simply get the length of the document and check it's length which is in bytes:
Log.e("TAG", "Document size is " +doc.outerHtml().length());
With the third example:
Log.e("TAG", "Document size is " +response1.body().length());

Then you can do an http header request and get the size of each image. Just grab all the images on the page using jsoup, iterate through them and make the request.

Related

Google queries with jsoup returns 429

I'm trying to query Goolge using Jsoup. Unfortunately, I now get the error shown below after about 300 queries. I use the following code snippet. How can I fix the problem?
String request = "http://www.google.com/search?q=" + query + "&num=1";
System.out.println("Sending request..." + request);
Document doc = Jsoup
.connect(request)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
HTTP error fetching URL. Status=429 (Too many requests)

How do I solve HTML Error 500 in Jsoup Java

i have two parts running perfectly fine.
First I connect to the website and put my login name and password into the form.
I get the cookies and store them.
Connection.Response login = Jsoup.connect("website")
.data("name", "name")
.data("password", "password")
.method(Connection.Method.POST)
.execute();
Map<String, String> cookies = login.cookies();
this works just fine, even the next connect
Document doc1 = Jsoup.connect("website/subpages")
.cookies(cookies)
.get();
doc1 is running perfect and I can get the text with:
String pages1 = doc.toString();
but on my last request, I get Server Error 500
Document pages2 = Jsoup.connect(website/anothersubpage)
.cookies(cookies)
.get();
I guess the Problem is that the last url "website/anothersubpage" is no set Url.
Each time I login and get a new Session Key (which is the Cookie I store in the Cookies) the URL to the subpage changes.
After I thought about it, I parsed the hole pages into an String and used substring to get the new variable URL.
String newLink = text.substring(text.indexOf("Start href"),text.indexOf("End href"));
It worked, so I stored into the String newLink, the link (href"") from the website.
But now if I use the same code as before:
Document pages2 = Jsoup.connect(my parsed href"..." link)
.cookies(cookies)
.get();
I get the error 500, I tried so much stuff but I can't get it to work for 3 days now.
I am really grateful for every suggestion or tip

Java - Jsoup HTTP error fetching URL. Status=405

I'm trying to connect and retrieve the page title from here. The code works fine if I remove everything after ".com" from the link. The following code does not work:
try {
Document doc = Jsoup.connect("https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en")
.data("query", "Java")
.userAgent("Chrome")
.cookie("auth", "token")
.timeout(3000)
.post();
String title = doc.title();
Log.d("hellomate", title);
}
catch (IOException e) {
Log.d("hellomatee", e.toString());
}
If the code worked, the title returned should be "Sammamish Washington - Google News".
The error returned from the code is: "org.jsoup.HttpStatusException: HTTP error fetching URL. Status=405, URL=https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en"
What does status 405 mean? Does Jsoup not allow the kind of url I used?
Thanks.
Status 405 is an http error code that means "Method Not allowed". You can find some documentation from microsoft on it here. As #Andreas said, you can fix this by changing .post(); to .get();.
If you look at the jsoup docs under example, it shows you how you would probably want to structure your requests:
Jsoup.connect("http://en.wikipedia.org/").get();

getting invalid json from twitter api

i'm trying to get data from twitter api with java jsoup and json-simple libs
Document doc = Jsoup.connect("https://api.twitter.com/1.1/search/tweets.json")
.header("Authorization", "Bearer " + token)
.header("charset", "utf-8")
.data("q", q)
.data("count", "2")
.data("max_id", currentStartId)
.ignoreContentType(true)
.get();
Then i'm receiving some json object. But when i try to parse it
String response = doc.text();
JSONObject requestObj = (JSONObject) parser.parse(response);
i'm getting this error
Exception in thread "main" Unexpected character (\) at position 3535.
at org.json.simple.parser.Yylex.yylex(Yylex.java:610)
at org.json.simple.parser.JSONParser.nextToken(JSONParser.java:269)
at org.json.simple.parser.JSONParser.parse(JSONParser.java:118)
at org.json.simple.parser.JSONParser.parse(JSONParser.java:81)
at org.json.simple.parser.JSONParser.parse(JSONParser.java:75)
in json position 3535
"description":""\u0412\u0435\u0434\u043e\u043c\u043e\u0441\u0442\u0438". \u0415\u0436\u0435\u0434\u043d\u0435\u0432\u043d\u0430\u044f \u0434\u0435\u043b\u043e\u0432\u0430\u044f \u0433\u0430\u0437\u0435\u0442\u0430"
You shouldn't be using Jsoup as its designed for parsing and cleaning HTML pages. It's unlikely that whatever it spits out is useful Json for you to process.
https://jsoup.org/apidocs/org/jsoup/Jsoup.html#connect-java.lang.String-
Use to fetch and parse a HTML page.
As the comment above suggests, you should use Twitter4J instead for this. Or even just process the JSON directly after fetching with URLConnection or OkHttp.

JSOUP throws url status 503 in Eclipse but URL works fine in browser

In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.

Categories