I need to validate the PDF report. I need to get the report embeded in HTML. If I read that URL using:
File file = new File("url");
or
HttpWebConnection.getResponse();
it requests the URL in separate session, hence it cannot get the file.
Does ieDriver have something like HtmlUnit?
HttpWebConnection.getResponse()
or somebody can suggest alternative.
Unfortunately it does not.
If you want to get the response code you will need a proxy. If you are using Java then Browser Mob is what you need. You may also try making XmlHttpRequest from javascript and get the status code in that way.
You could also stick to the method you are using right now (separate request from Java) but pass the cookie with session (you can obtain the cookie from WebDriver)
Related
I am trying to upload a sound to myinstants.com using java and OkHttp.
I used the Chrome dev tools to look at what requests get made and try to recreate them using OkHttp but I'm failing at the login part.
Chrome dev tools tells me that this is the form post, with a content-type of application/x-www-form-urlencoded.
And I try to replicate the post using the following code:
RequestBody loginBody = new FormBody.Builder()
.add("csrfmiddlewaretoken", token) //this token is comes from inside the <input> tag that is retrieved in the HTML of a normal get request to https://myinstants.com/accounts/login and is diffrent every time you load the page
.add("login", username)
.add("password", password)
.add("remember", "on")
.add("next", "/new/")
.build();
Request login = new Request.Builder()
.url("https://www.myinstants.com/accounts/login/?next=/new/")
.addHeader("cookie", CookieHandler.getCookie()) // cookie that is generated from the "set-cookie" response headers of the get request to https://myinstants.com/accounts/login
.addHeader("content-type", "application/x-www-form-urlencoded")
.post(loginBody)
.build();
Response response = new OkHttpClient().newCall(login).execute();
According to the chrome dev tools, the response of the above post request should have a couple of set-cookie response headers, but they are not present for me.
I don't think the issue is with the cookie I'm using because when comparing to what is found in the chrome dev tools, the cookie matches that exaclty (except for some things that are new every time you visit the site), so I think the issue is with the form post. Any ideas what I am doing wrong?
Some servers block the requests if they see you're making the requests from outside a browser.
What often works (but not always) is to try tricking the server into thinking you're using a browser. You can do this by setting the "User-Agent" header.
To do so, open the dev tools of your browser (F12), access the "Network" tab and make a request to any site. Then, look into the "Request Headers" section for the "User-Agent" value. Just copy it and send it with your request.
If all this fail, it's possible that the site has a bot protection based in Javascript. In this kind of site, the login page contains a javascript that triggers right before the login process and generates a random token that you need to send along with the credentials in order to login successfully. Since you're accessing without a browser, you can't run JavaScript and thus, you can't generate this token.
If that's the case, the best thing you can do is to use a real browser controlled programmatically. For Java you can use Selenium, but I personally prefer to use Puppeteer from NodeJS. In essence they're the same thing, an API to remote control a modified version of the chromium/chrome browser.
But with Puppeteer you have a little more flexibility than with Java because you don't need to convert between Java and Javsacritp objects and vice-versa.
I don't know what may be wrong and I don't have time to test your code now. But as a suggestion, you could try to make the upload using another library and see if it works.
I recommend Apache Fluent API:
https://mvnrepository.com/artifact/org.apache.httpcomponents/fluent-hc
I want to download a source of a webpage to a file (*.htm) (i.e. entire content with all html markups at all) from this URL:
http://isap.sejm.gov.pl/DetailsServlet?id=WDU20061831353
which works perfectly fine with FileUtils.copyURLtoFile method.
However, the said URL has also some links, for instance one which I'm very interested in:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
This link works perfectly fine If open it with a regular browser, but when I try to download it in Java by means of FileUtils -- I got only a no-content page with single message "trwa ladowanie danych" (which means: "loading data...") but then nothing happens, the target page is not loaded.
Could anyone help me with this? From the URL I can see that the page uses Servlets -- is there a special way to download pages created with servlets?
Regards --
This isn't a servlet issue - that just happens to be the technology used to implement the server, but generally clients don't need to care about that. I strongly suspect it's just that the server is responding with different data depending on the request headers (e.g. User-Agent). I see a very different response when I fetch it with curl compared to when I load it in Chrome, for example.
I suggest you experiment with curl, making a request which looks as close as possible to a request from a browser, and then fiddling until you can find out exactly which headers are involved. You might want to use Wireshark or Fiddler to make it easy to see the exact requests/responses involved.
Of course, even if you can fetch the original HTML correctly, there's still all the Javascript - it would be entirely feasible for the HTML to contain none of the data, but for it to include Javascript which does the actual data fetching. I don't believe that's the case for this particular page, but you may well find it happens for
try using selenium webdriver to the main page
HtmlUnitDriver driver = new HtmlUnitDriver(true);
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
driver.get(baseUrl);
and then navigate to the link
driver.findElement(By.name("name of link")).click();
UPDATE: I checked the following: if I turn off the cookies in Firefox and then try to load my page:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
then I yield the incorrect result just like in my java app (i.e. page with "loading data" message instead of the proper content).
Now, how can I manage the cookies in java to download this page properly then?
I am rewriting some code that used to use GET and replaced it with POST.
The download URL used to be a GET request to
https://myurl/getfile?fileid=1234&filetype=pdf
Now, I changed that to
https://myurl/getfile
and put the fileid=1234&filetype=pdf in the POST body.
I did this using jquery's post method as:
function postCall(url, param) {
$.post(url, param);
}
The server side is written using Java and I tried to reused the old code for GET, which write the file binary into the servlet's stream.
However, my browser does not prompt user for download, which used to do for GET.
Previous posts on stackoverflow did suggest that AJAX should not be used for file download. But what is the alternative way for me to use? The request is not generated by a form though.
Many thanks.
I would suggest creating a form on the page (or create one dynamically using jQuery), and then have that form do the post submission (using jQuery's "submit" function or "trigger('submit')" on the form). This way the request won't be done asynchronously in the background. If the "getfile" script responds with a file with Content-disposition: attachment, it should download.
That said, I'm not sure the browser will "prompt" the user in this scenario--this is dependent on the browser (whether or not a dialog appears to save the download, or if it automatically downloads the file without a prompt).
I trying to automate few things in my workplace where we are not allowed to use internet (Not all website very few allowed).
Req: I have a form which has a single text box & a single submit button, I have to put something in the text box and submit the form. The response I need to parse the HTML and get a specific text. The pages are written in JSP
Constraint: I don't have access to third party libraries & have to work with Java 6.
Please put me in right direction.
HttpURLConnection comes default with java. You may consider using this API. This API does most of the functionality as Apache HTTPClient. Here is simple example on how to use HTTPURLConnection.
I would use something like tamperdata from Firefox to capture the HTTP request that gets sent to the server, and then use HTTPUrlConnection (part of the JDK) to re-create that request.
I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url