Fake being a browser to avoid a 403 error - java

I'm using JSoup to connect to a webpage and scrape data from it, but it's giving me an HTTP 403 response (unlike my browser, which loads it successfully). From what I understand, I have to make my scraper pretend that it's a browser. I've tried two things: setting the user agent to be a browser, and setting the referrer to be the same website (both of these I got by browsing StackOverflow). I still, however, get a 403. My code looks like this (I know the browser is old, I just copypasted it, surely it shouldn't matter?):
Document doc = Jsoup.connect("http://www.website.com/subpage/")
.userAgent("\"User-Agent\", \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11\"")
.referrer("http://www.website.com")
.get();
What else do I need to do to trick the server into thinking that I'm a browser?

Since you can load the page successfully(a 200?) with your browser, you can use that information to create a Jsoup connection.
Open up your browser's network tab in development view, have a look at the request and imitate it. For example, a GET to this page looks like
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0)
Gecko/20100101 Firefox/46.0
Accept: application/json, text/javascript; q=0.01
Accept-Language: sv-SE,sv;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: http://stackoverflow.com/questions/37134906/fake-being-a-browser-to-avoid-a-403-error
Content-Length: 263
Cookie: x; prov=x; acct=t=wx
DNT: 1
Connection: keep-alive
All these have corresponding Jsoup methods. This should be enough to fool the web server.
If you are still experiencing trouble you should log the actual request to see it is sent as is expected.

There are several ways to distinguish web browsers from robot user agents. One possibility that comes to mind is checking for the Accept header content.
I suggest that you use Firefox developer tools to inspect your requests and add headers/cookies to your scraper application.
Additionally you can use a packet sniffer (ngrep, wireshark) and compare your requests with the requests of a real browser session to determine what signals are used.

A web server may return a 403 Forbidden HTTP status code in response to a request from a client for a web page or resource to indicate that the server can be reached and understood the request, but refuses to take any further action. Status code 403 responses are the result of the web server being configured to deny access, for some reason, to the requested resource by the client.
It is working in browser, in browser he might take some header or
cookies.
Please check which are the header or any other param require using Fiddler or simple browser & set those value in Jsoup it will resolve your issue.

Related

Can't login to IIS login prompt using JSoup

I am trying using JSoup to login to a website that uses Microsoft-IIS/7.5 server powered by ASP.NET
Whenever I open the website it opens this Login prompt
JSoup throws org.jsoup.HttpStatusException: HTTP error fetching URL. Status=401 when establishing a connection. I know nothing about web servers or how this login process goes I am just a Java developer and I encountered this issue when I tried to build an app that tries download some data from this website.
The answers here using basic authentication did not solve my problem and it keeps sending the same exception.
Some servers refuse serving content for requests without any headers.
Try setting some headers, example values:
Document doc = Jsoup.connect(loginPageUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36")
.header("Accept", "text/html,application/xhtml+xml text/html,application/xhtml+xml")
.header("Accept-Language", "en-US")
.header("Accept-Encoding", "gzip, deflate")
.header("Referer", loginPageUrl)
.get();

Cookie session using HttpURLConnection

I'm developing an application that login in a website. I'm with problem 'cause, when i read the browser's request header, there is a cookie that the browser sends. I need to know how can i do that in my application, i mean, when i start a connection, it defines by itself the cookies of request. I tried to use this CookieHandler.setDefault( new CookieManager( null, CookiePolicy.ACCEPT_ALL ) ); but didn't work.
Source:
CookieHandler.setDefault( new CookieManager( null, CookiePolicy.ACCEPT_ALL ) );
URL url2 = new URL("https://m.example.com.br/login.jhtml");
HttpURLConnection conn = (HttpURLConnection) url2.openConnection();
conn.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
conn.setRequestMethod("POST");
conn.setRequestProperty("User-Agent","User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0");
conn.setRequestProperty("Content-Length", parameters + Integer.toString(parameters.getBytes().length));
conn.setFollowRedirects(true);
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setUseCaches(false);
DataOutputStream wr = new DataOutputStream(conn.getOutputStream());
wr.writeBytes(parameters);
wr.flush();
wr.close();
if (conn.getResponseCode()== 200){
InputStream in = conn.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(in));
String line=null;
StringBuffer response = new StringBuffer();
while((line = rd.readLine()) != null) {
response.append(line);
response.append('\r');
}
rd.close();
System.out.println(response.toString());
}
Request Header of my application:
Content-Type: application/x-www-form-urlencoded
User-Agent: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Connection: Keep-Alive
Accept-Encoding: gzip
Cookie: TS0163e05c="01ed0a5ec20a04efb37decf4185e55cfe68e06164c32f1a95d1d5b8f12c72abbee029ed64985c09681a55832e444c61821a1eb6fb22d6ed9880314fa0c342074316e309642";$Path="/";$Domain="example.com"; ps-website-switching-v2=%7B%22ps-website-switching%22%3A%22ps-website%22%7D; TS015a85bd=01ed0a5ec25aecf271e4e08c02f852e9ea6199a117a0a8e0339b3e98fd1d51518e5f09ead481039d4891f66e9cc48a13ced14792de
Content-Length: 198
Request Header of Browser:
Host: m.example.com
Connection: keep-alive
Content-Length: 197
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Linux; Android 5.0.2; LG-D337 Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: _ga=GA1.3.313511484.1517525889; _gid=GA1.3.507266479.1517525889; DEretargeting=563; CSASF=; JS_SESS=; BT=%3B106%3B; DN....
Pay attention to the Cookies, why are they so difference? What can i do to send cookies like this without to have setting using the conn.setRequestProperty("Cookie",cookie); ?
HttpURLConnection is not a very reliable way to scrape or interact with websites, for the following reasons:
HttpURLConnection doesn't understand JavaScript. JavaScript can set cookies as well as provide major parts of functionality of the website.
HttpURLConnection doesn't download all resources associated with a page, like other .html files in frames, images (0 px images can sometimes also add cookies, but if you never get them, you'll never get the cookie), JavaScript, and so on.
CookieHandler only works for cookies that are passed to you directly in the HTTP Response Headers. If anything within the content of the site (including embedded content like images) would cause more cookies to be created, you're not getting them with CookieHandler, because it doesn't understand HTML/JS/etc.
You should use Selenium instead. Selenium automates a real web browser (or at least something closer to a real web browser) that can parse HTML and behaves according to the expectations of the web standards.
As far as which browser driver (backend) to use, here are a few options:
HtmlUnit, which is perhaps the fastest driver (at least it has the least memory overhead), but with a downside that not all the latest web standards and technologies are supported. If it works with your site, it's probably the best choice because it doesn't require any native driver component. It does support a fair chunk of JavaScript (see the link), but with perhaps less up-to-date feature support compared to the latest Firefox or Chrome. HtmlUnit is headless.
PhantomJS, which is based on a fairly dated version of WebKit. Your web standards support will be current as of about 2013, which is probably fine for most websites, but some cutting-edge features won't work. It also has a fair number of bugs with certain types of content. However, it's also headless, has a pretty big user base, and is generally lower overhead than a full-blown browser.
Firefox, Chrome, Edge, Opera, or IE. Firefox and Chrome now have headless support as an option.
The difference between "headless" and "not headless" (or "headed", if you prefer) is that a headless browser does not create any GUI windows. If you're running on a headless Linux box, this is practically a requirement unless you want to create an Xvfb virtual X server or something. If you're running this from a computer with a graphical interface (Windows, MacOS, or desktop Linux), it's up to you if you want to see the browser pop up when you run your code.
Headless browsers do tend to be relatively faster, and you can scale out more instances of them in parallel because they aren't taking up any graphics resources on your system as you use them. They just use the browser engine itself to process the web content and allow you to access/drive it through Selenium.
If you do want headless, but you need the very latest web platform features and standards support, look into using Headless Chrome or Headless Firefox.
Headless Chrome intro: https://developers.google.com/web/updates/2017/04/headless-chrome
Headless Firefox intro: https://developer.mozilla.org/en-US/Firefox/Headless_mode

Spring Boot and security: How to extend response for 302 status code?

Setup:
Spring Boot application
OAuth2 security
ReactJS for UI implementation
Use case:
Login to application
open other tab with same application in same browser
Logout from application in one of the tabs, user are redirected to Login view
Go to first tab(user already logout and if I refresh page I will get login form) and do any action that triggers POST/PUT/PATCH request. Example of request and response below:
Request:
Request URL:http://localhost:8080/api/info
Request Method:PUT
Status Code:302 Found
Remote Address:[::1]:8080
Referrer Policy:no-referrer-when-downgrade
accept:application/json
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8,sv;q=0.6,ru;q=0.4,uk;q=0.2,fr;q=0.2
Cache-Control:no-cache
Connection:keep-alive
Content-Length:66
Content-Type:application/json
Cookie:_ga=GA1.1.1868465923.1505828166; _gid=GA1.1.612220229.1507272075; session=e4oSW4Kq; prod_user_session=4d6b615f-521704; user_session=g3ggLxJDomyZ
Host:localhost:8080
mode:cors
Origin:http://localhost:8080
Pragma:no-cache
Referer:http://localhost:8080/profile
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Response:
Cache-Control:no-cache, no-store, max-age=0, must-revalidate
Connection:keep-alive
Content-Length:0
Date:Fri, 06 Oct 2017 12:12:26 GMT
Expires:0
Location:http://localhost:8080/login
Pragma:no-cache
X-Content-Type-Options:nosniff
X-Frame-Options:DENY
X-XSS-Protection:1; mode=block
After this response system triggers PUT request to http://localhost:8080/login and fails because PUT method not allowed for http://localhost:8080/login request.
Question:
I understands that I'm getting 302 status and Location:http://localhost:8080/login header because I'm already logged out. I want extend response for this case with JSON body or at least ensure that for this case I will get 401 Unauthorised status code instead of 302.
If I understood your question correctly, it sounds like you need two different responses for unauthorized requests coming via a regular webpage load (produces="text/html") vs an AJAX call (produces="application/json"). Currently your unauthorized AJAX call gets redirected to the login page which is a legit page, hence no 401 response code. Here's an example of a setup that accomplishes what you want using XML configs Spring Security Ajax login and #Configuration config https://stackoverflow.com/a/27300215/1718213
Another option that is even more user-friendly is to use Spring's WebSocket support to signal logout events to all the tabs a given user might have open (across all devices and browsers) that would trigger each tab to redirect to the login page.

How to get the exact client browser name and version in Spring MVC?

I'm working on a Spring MVC application, and I need to access client browser name and version.
I have an instance of HttpServletRequest in my action as a parameter and use request.getHeader("User-Agent") method, but this returned Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko in Internet explorer 9.
I need to exact number and version. Is there any tools for doing that?
Acknowledging that the user agent is unsafe. Still, in the lack of other ways, you should parse a user-agent header, which in fact is not as easy, as the number of combinations is overwhelming. Unless you want to roll your own, I would suggest
http://www.bitwalker.eu/software/user-agent-utils
source is available at
https://github.com/HaraldWalker/user-agent-utils
the usage is quite straightforward
UserAgent userAgent = UserAgent.parseUserAgentString(request.getHeader("User-Agent"));
System.out.println(userAgent.getBrowser().getName() + " " + userAgent.getBrowserVersion());
Useful library for parsing the result of User-Agent Http Header : browscap-java

Understanding Fiddler grabbed network traffic to build HTTP Client

I am trying to reverse engineer a web app (which uses flash object to communicate with server). I have grabbed the network traffic via fiddler i.e., Browsed the app using IE and captured network traffic in Fiddler. This first time i am doing something this so i might be asking very basic questions :(
Now, I have those events/requests in Fiddler but I am having hard time understanding them (besides basic HTTP requests). So I am going to post the traffic flow and then its respective traffic here and at the end the questions
FLOW ON IE
Entered the url website.com/app/app-subdomain/web-app
An HTML page is displayed with user & password request
After login, an HTML page is displayed with Flash object in it (the original app)
IN FIDDLER
(requests in order)
The first thing i see is request to URL:www.website.com:443 which results in 200 status. Fiddler shows there are no cookies or whatsoever. Only the "Clients" are there
The second request i see is to URL:app/app-subdomain/web-app. However, here is the part where i am confused at. In fiddler request body, I see a cookie. which is something like this
GET https://www.website.com/app/app-subdomain/web-app HTTP/1.1
Accept: text/html, application/xhtml+xml, \*/\*
Accept-Language: en-US
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: www.website.com
Cookie:
NAME-WEB-SESSION=akcxxxpxkxfaxdxccajkjumxax;
hl=us;
dxxxxcxx=0;
NAMESSO=fdxdfxdfdxcfabxxxcxxcdbfexxxxfxfxxxefxecxxaxxxxxxefxxxxxxxxfxaxx;
XSRF-TOKEN=vXXnjhHE-ptnvmYfKfQVxscHyrGrfbwxyxkGzfWZGoU
So far, the thing which is confusing me, is who generated this cookie ? so lets say i am using Apache HTTPClient, would this cookie be generated by it or do I have to ? If i have to, then how to generate the values of those key-value cookie ?
You didn't specify what the first request was exactly, but more than likely this was just a HTTP CONNECT tunnel through which secure traffic flows. You should NEVER see a cookie on a CONNECT tunnel. Have you cleared your browser's cookies and cache? If not, the cookie you saw was likely set on a previous visit to the site and stored in the client's cookie jar. If you have cleared the cache and cookies, that implies that something on the client (e.g. Flash) generated the cookie via some other, non-standard, process.

Categories