I'm developing an application that login in a website. I'm with problem 'cause, when i read the browser's request header, there is a cookie that the browser sends. I need to know how can i do that in my application, i mean, when i start a connection, it defines by itself the cookies of request. I tried to use this CookieHandler.setDefault( new CookieManager( null, CookiePolicy.ACCEPT_ALL ) ); but didn't work.
Source:
CookieHandler.setDefault( new CookieManager( null, CookiePolicy.ACCEPT_ALL ) );
URL url2 = new URL("https://m.example.com.br/login.jhtml");
HttpURLConnection conn = (HttpURLConnection) url2.openConnection();
conn.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
conn.setRequestMethod("POST");
conn.setRequestProperty("User-Agent","User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0");
conn.setRequestProperty("Content-Length", parameters + Integer.toString(parameters.getBytes().length));
conn.setFollowRedirects(true);
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setUseCaches(false);
DataOutputStream wr = new DataOutputStream(conn.getOutputStream());
wr.writeBytes(parameters);
wr.flush();
wr.close();
if (conn.getResponseCode()== 200){
InputStream in = conn.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(in));
String line=null;
StringBuffer response = new StringBuffer();
while((line = rd.readLine()) != null) {
response.append(line);
response.append('\r');
}
rd.close();
System.out.println(response.toString());
}
Request Header of my application:
Content-Type: application/x-www-form-urlencoded
User-Agent: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Connection: Keep-Alive
Accept-Encoding: gzip
Cookie: TS0163e05c="01ed0a5ec20a04efb37decf4185e55cfe68e06164c32f1a95d1d5b8f12c72abbee029ed64985c09681a55832e444c61821a1eb6fb22d6ed9880314fa0c342074316e309642";$Path="/";$Domain="example.com"; ps-website-switching-v2=%7B%22ps-website-switching%22%3A%22ps-website%22%7D; TS015a85bd=01ed0a5ec25aecf271e4e08c02f852e9ea6199a117a0a8e0339b3e98fd1d51518e5f09ead481039d4891f66e9cc48a13ced14792de
Content-Length: 198
Request Header of Browser:
Host: m.example.com
Connection: keep-alive
Content-Length: 197
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Linux; Android 5.0.2; LG-D337 Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: _ga=GA1.3.313511484.1517525889; _gid=GA1.3.507266479.1517525889; DEretargeting=563; CSASF=; JS_SESS=; BT=%3B106%3B; DN....
Pay attention to the Cookies, why are they so difference? What can i do to send cookies like this without to have setting using the conn.setRequestProperty("Cookie",cookie); ?
HttpURLConnection is not a very reliable way to scrape or interact with websites, for the following reasons:
HttpURLConnection doesn't understand JavaScript. JavaScript can set cookies as well as provide major parts of functionality of the website.
HttpURLConnection doesn't download all resources associated with a page, like other .html files in frames, images (0 px images can sometimes also add cookies, but if you never get them, you'll never get the cookie), JavaScript, and so on.
CookieHandler only works for cookies that are passed to you directly in the HTTP Response Headers. If anything within the content of the site (including embedded content like images) would cause more cookies to be created, you're not getting them with CookieHandler, because it doesn't understand HTML/JS/etc.
You should use Selenium instead. Selenium automates a real web browser (or at least something closer to a real web browser) that can parse HTML and behaves according to the expectations of the web standards.
As far as which browser driver (backend) to use, here are a few options:
HtmlUnit, which is perhaps the fastest driver (at least it has the least memory overhead), but with a downside that not all the latest web standards and technologies are supported. If it works with your site, it's probably the best choice because it doesn't require any native driver component. It does support a fair chunk of JavaScript (see the link), but with perhaps less up-to-date feature support compared to the latest Firefox or Chrome. HtmlUnit is headless.
PhantomJS, which is based on a fairly dated version of WebKit. Your web standards support will be current as of about 2013, which is probably fine for most websites, but some cutting-edge features won't work. It also has a fair number of bugs with certain types of content. However, it's also headless, has a pretty big user base, and is generally lower overhead than a full-blown browser.
Firefox, Chrome, Edge, Opera, or IE. Firefox and Chrome now have headless support as an option.
The difference between "headless" and "not headless" (or "headed", if you prefer) is that a headless browser does not create any GUI windows. If you're running on a headless Linux box, this is practically a requirement unless you want to create an Xvfb virtual X server or something. If you're running this from a computer with a graphical interface (Windows, MacOS, or desktop Linux), it's up to you if you want to see the browser pop up when you run your code.
Headless browsers do tend to be relatively faster, and you can scale out more instances of them in parallel because they aren't taking up any graphics resources on your system as you use them. They just use the browser engine itself to process the web content and allow you to access/drive it through Selenium.
If you do want headless, but you need the very latest web platform features and standards support, look into using Headless Chrome or Headless Firefox.
Headless Chrome intro: https://developers.google.com/web/updates/2017/04/headless-chrome
Headless Firefox intro: https://developer.mozilla.org/en-US/Firefox/Headless_mode
Related
I am trying using JSoup to login to a website that uses Microsoft-IIS/7.5 server powered by ASP.NET
Whenever I open the website it opens this Login prompt
JSoup throws org.jsoup.HttpStatusException: HTTP error fetching URL. Status=401 when establishing a connection. I know nothing about web servers or how this login process goes I am just a Java developer and I encountered this issue when I tried to build an app that tries download some data from this website.
The answers here using basic authentication did not solve my problem and it keeps sending the same exception.
Some servers refuse serving content for requests without any headers.
Try setting some headers, example values:
Document doc = Jsoup.connect(loginPageUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36")
.header("Accept", "text/html,application/xhtml+xml text/html,application/xhtml+xml")
.header("Accept-Language", "en-US")
.header("Accept-Encoding", "gzip, deflate")
.header("Referer", loginPageUrl)
.get();
I am trying to read a file from my website using an URL. This is my code:
URLConnection download = url.openConnection();
download.addRequestProperty(
"User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
I have set the useragent, which I thought would fix the problem. Apparantly not... Perhaps I did it wrong? Need solutions :)
403 Means forbidden... are you sure you don't need some kind of basic authentication in your headers or anything like that?
I would suggest you to try first to call the URL you want to check with some kind of web browser http client ( Advanced Rest Client for Chrome is pretty good - https://chrome.google.com/webstore/detail/advanced-rest-client/hgmloofddffdnphfgcellkdfbfbjeloo?hl=en-US - but any other is fine ), and try first to call from there. In that way, you can quickly figure out if the headers or anything from your request is wrong before fighting with your code.
I'm using JSoup to connect to a webpage and scrape data from it, but it's giving me an HTTP 403 response (unlike my browser, which loads it successfully). From what I understand, I have to make my scraper pretend that it's a browser. I've tried two things: setting the user agent to be a browser, and setting the referrer to be the same website (both of these I got by browsing StackOverflow). I still, however, get a 403. My code looks like this (I know the browser is old, I just copypasted it, surely it shouldn't matter?):
Document doc = Jsoup.connect("http://www.website.com/subpage/")
.userAgent("\"User-Agent\", \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11\"")
.referrer("http://www.website.com")
.get();
What else do I need to do to trick the server into thinking that I'm a browser?
Since you can load the page successfully(a 200?) with your browser, you can use that information to create a Jsoup connection.
Open up your browser's network tab in development view, have a look at the request and imitate it. For example, a GET to this page looks like
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0)
Gecko/20100101 Firefox/46.0
Accept: application/json, text/javascript; q=0.01
Accept-Language: sv-SE,sv;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: http://stackoverflow.com/questions/37134906/fake-being-a-browser-to-avoid-a-403-error
Content-Length: 263
Cookie: x; prov=x; acct=t=wx
DNT: 1
Connection: keep-alive
All these have corresponding Jsoup methods. This should be enough to fool the web server.
If you are still experiencing trouble you should log the actual request to see it is sent as is expected.
There are several ways to distinguish web browsers from robot user agents. One possibility that comes to mind is checking for the Accept header content.
I suggest that you use Firefox developer tools to inspect your requests and add headers/cookies to your scraper application.
Additionally you can use a packet sniffer (ngrep, wireshark) and compare your requests with the requests of a real browser session to determine what signals are used.
A web server may return a 403 Forbidden HTTP status code in response to a request from a client for a web page or resource to indicate that the server can be reached and understood the request, but refuses to take any further action. Status code 403 responses are the result of the web server being configured to deny access, for some reason, to the requested resource by the client.
It is working in browser, in browser he might take some header or
cookies.
Please check which are the header or any other param require using Fiddler or simple browser & set those value in Jsoup it will resolve your issue.
I'm working on a Spring MVC application, and I need to access client browser name and version.
I have an instance of HttpServletRequest in my action as a parameter and use request.getHeader("User-Agent") method, but this returned Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko in Internet explorer 9.
I need to exact number and version. Is there any tools for doing that?
Acknowledging that the user agent is unsafe. Still, in the lack of other ways, you should parse a user-agent header, which in fact is not as easy, as the number of combinations is overwhelming. Unless you want to roll your own, I would suggest
http://www.bitwalker.eu/software/user-agent-utils
source is available at
https://github.com/HaraldWalker/user-agent-utils
the usage is quite straightforward
UserAgent userAgent = UserAgent.parseUserAgentString(request.getHeader("User-Agent"));
System.out.println(userAgent.getBrowser().getName() + " " + userAgent.getBrowserVersion());
Useful library for parsing the result of User-Agent Http Header : browscap-java
I am trying to reverse engineer a web app (which uses flash object to communicate with server). I have grabbed the network traffic via fiddler i.e., Browsed the app using IE and captured network traffic in Fiddler. This first time i am doing something this so i might be asking very basic questions :(
Now, I have those events/requests in Fiddler but I am having hard time understanding them (besides basic HTTP requests). So I am going to post the traffic flow and then its respective traffic here and at the end the questions
FLOW ON IE
Entered the url website.com/app/app-subdomain/web-app
An HTML page is displayed with user & password request
After login, an HTML page is displayed with Flash object in it (the original app)
IN FIDDLER
(requests in order)
The first thing i see is request to URL:www.website.com:443 which results in 200 status. Fiddler shows there are no cookies or whatsoever. Only the "Clients" are there
The second request i see is to URL:app/app-subdomain/web-app. However, here is the part where i am confused at. In fiddler request body, I see a cookie. which is something like this
GET https://www.website.com/app/app-subdomain/web-app HTTP/1.1
Accept: text/html, application/xhtml+xml, \*/\*
Accept-Language: en-US
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: www.website.com
Cookie:
NAME-WEB-SESSION=akcxxxpxkxfaxdxccajkjumxax;
hl=us;
dxxxxcxx=0;
NAMESSO=fdxdfxdfdxcfabxxxcxxcdbfexxxxfxfxxxefxecxxaxxxxxxefxxxxxxxxfxaxx;
XSRF-TOKEN=vXXnjhHE-ptnvmYfKfQVxscHyrGrfbwxyxkGzfWZGoU
So far, the thing which is confusing me, is who generated this cookie ? so lets say i am using Apache HTTPClient, would this cookie be generated by it or do I have to ? If i have to, then how to generate the values of those key-value cookie ?
You didn't specify what the first request was exactly, but more than likely this was just a HTTP CONNECT tunnel through which secure traffic flows. You should NEVER see a cookie on a CONNECT tunnel. Have you cleared your browser's cookies and cache? If not, the cookie you saw was likely set on a previous visit to the site and stored in the client's cookie jar. If you have cleared the cache and cookies, that implies that something on the client (e.g. Flash) generated the cookie via some other, non-standard, process.