How to download file using headless (gui-less) Selenium WebDriver - java

I need to download files using headless web browser in Java. I checked HtmlUnit where I was able to download file in some simple cases but I was not able to download when Ajax initialized downloading (actually it is more complicated as there are two requests, the first one download the URL where the second request actually download file from the given URL). I have replaced HtmlUnit with Selenium. I already checked two WebDrivers, HtmlUnitDriver and ChromeDriver.
HtmlUnitDriver - similar behaviour like HtmlUnit
ChromeDriver - I am able to download files in visible mode but when I turned on headless mode, files are no longer downloaded
ChromeOptions lChromeOptions = new ChromeOptions();
HashMap<String, Object> lChromePrefs = new HashMap<String, Object>();
lChromePrefs.put("profile.default_content_settings.popups", 0);
lChromePrefs.put("download.default_directory", _PATH_TO_DOWNLOAD_DIR);
lChromeOptions.setExperimentalOption("prefs", lChromePrefs);
lChromeOptions.addArguments("--headless");
return new ChromeDriver(lChromeOptions);
I know that downloading files in headless mode is turned off because of security reasons but there must be some workaround
I used 2.28 httpunit before, few minutes ago I started to work with 2.29 but still it seems that Ajax function stops somewhere. This is the way I retrieve data after click and expect a file data: _link.click().getWebResponse().getContentAsStream()
Does WebConnectionWrapper shows all the requests/responses that are made on the website? Do You know how can I debug this to have better insight? I see that the first part of the Ajax function after link is clicked is being properly called (there are 2 http requests in this function). I even tried to create my custom http request to retrive data/file after first response is fetched inside WebConnectionWrapper -> getResponse but it returns 404 error which indicates that this second request had been somehow done but I dont see any log/debug information neither in _link.click().getWebResponse().getContentAsStream() nor WebConnectionWrapper -> getResponse()

Regarding HtmlUnit you can try this:
Calling click() on a dom element is a sync call. This means, this returns after the response of this call is retrieved and processed. Usually all the JS libs out there doing some async magic (like starting some processing with setTimeout(,10) for various (good) reasons. Your code will be aware of this.
A better approach is to do something like this
Page page = _link.click();
webClient.waitForBackgroundJavaScript(1000);
Sometimes the Ajax requests are doing an redirect to the new content. We have to address this new stuff by checking the current window content
page = page.getEnclosingWindow().getEnclosedPage();
Or maybe better
In case of downloads the (binary) response might be opened in a new window
WebWindow tmpWebWindow = webClient.getCurrentWindow();
tmpWebWindow = tmpWebWindow.getTopWindow();
page = tmpWebWindow.getEnclosedPage();
This might be the response you are looking for.
page.getWebResponse().getContentAsStream();
Its a bit tricky to guess what is going on with your web application. If you like you can reach me via private mail or discuss this in the HtmlUnit user mailing list.

Related

Enriching headers via a proxy using java and Selenium

I'm working on an automated web test stack using Selenium, Java and testNG.
For authentication and safety reasons, I need to enrich the headers of the website I am accessing through Kubernetes.
For example, I can successfully use this CURL command in terminal to retrieve the page I want to access: curl -H 'Host: staging.myapp.com' -H 'X-Forwarded-Proto: https' http://nginx.myapp.svc.cluster.local.
So as you can see, I only need to add 2 headers for Host and X-Forwarded-Proto.
I'm trying to create a proxy that will enrich headers in my #BeforeMethod method for a couple of days, but I'm still stuck, and there are so many shadow areas that I can't find a way to debug anything and understand what's wrong. For now, no matter my code, I keep getting a "No internet" (ERR_PROXY_CONNECTION_FAILED) error page in my driver when I launch it.
For example, one version of my code:
BrowserMobProxy browserMobProxy = new BrowserMobProxyServer();
browserMobProxy.setTrustAllServers(true);
browserMobProxy.addHeader("Host", "staging.myapp.com");
browserMobProxy.addHeader("X-Forwarded-Proto", "https");
browserMobProxy.start();
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setProxy(ClientUtil.createSeleniumProxy(browserMobProxy));
driver = new ChromeDriver(chromeOptions);
driver.get("http://nginx.myapp.svc.cluster.local");
I tried several other code structures like:
defining browserMobProxyServer.addRequestFilter to add headers in requests
using only org.openqa.selenium.Proxy
setting up proxy with setHttpProxy("http://nginx.myapp.svc.cluster.local:8888");
But nothing works, I always get ERR_PROXY_CONNECTION_FAILED.
Anybody have any clue about that? Thanks!
OK so, after days of research, I found out 2 things:
Due to whatever configuration on my Mac, I need to force host Address (other people running the same code had no issue...):
proxy.setHttpProxy(Inet4Address.getLocalHost().getHostAddress() + ":" + browserMobProxyServer.getPort());
I have to manually alter headers via a response filter instead of using .addHeader method:
browserMobProxyServer.addResponseFilter((response, content, messageInfo)->{
//Do something here related to response.headers()
});
I hope it will help some lost souls here.

Setting Cookies without visiting a page in Selenum

Anyone able to explain to me how I would be able to set cookies for a domain not visited with the use of a plugin with selenium for gecko driver? I have been trying to set a cookie to prevent seeing a login page, but the domain for the cookie is redirecting so I cannot set it by visiting it and cannot figure out how to do it.
I have tried this but looks as though I cannot specify this in selenium as I cannot visit this page.
Cookie cookie11 = new Cookie("SID",
"cookievalue",
".google.com",
"/",
expiry1,
false,
false);
Found a plugin called Cookies Export/import that I am trying to figure out if its possible to use to import the cookies from..
Any help would be appreciated!
If you wish to use the specified extension in order to do this, I recommend looking at the SO Answer on How do you use a firefox plugin within a selenium webdriver program written in java? and you should be good from there.
However, I believe you can achieve this without using an extension, using addCookie() method.
WebDriver driver = new FirefoxDriver();
Cookie cookie = new Cookie("SID",
"cookievalue",
".example.com",
"/",
expiry1,
false,
false);
driver.manage().addCookie(cookie);
driver.get("http://www.example.com/login");
Assuming your cookie details are correct, you should be able to get past the login redirect.
See also:
WebDriver – How to Restore Cookies in New Browser Window
You cannot do that. See https://w3c.github.io/webdriver/webdriver-spec.html#add-cookie
I opened this issue with the spec https://github.com/w3c/webdriver/issues/1238
You need to rebuild the browser without those validations if you want to get passed this issue:
Here is the changes to make to FireFox (marionette) to get passed this:
https://gist.github.com/nddipiazza/1c8cc5ec8dd804f735f772c038483401

Java Logging in to a website that uses complex javascript

I'd first like to start by saying, I've managed this using phantomJS and Selenium. I load phantomjs, load the url (sports.coral.co.uk) and then check my balance. I am however trying to find a more lightweight option.
I have tried manually sending http get/post requests using apache's HttpClient. Monitoring the login process, using postman for chrome, shows 4 requests sent once the login button has been pressed. I have tried editing and re-sending them using postman. However, from what I can tell there's a requestID that gets sent along with the requests. This is generated using the javascript on the page.
var requestId = (new Date().getTime()) + Math.round(Math.random() * 1000000);
var failedTimer = setTimeout('iapiRequestFailed(' + requestId + ')', iapiConf['loginDomainRetryInterval'] * 1000);
iapiRegisterRequestId(requestId, iapiCALLOUT_MESSAGES, failedTimer, request[3], request[4], request[5]);
return;
It looks like the original ID is a random generated number, that then gets registered using another javascript function. I'm guessing the login is partly failing due to me not being able to provide an acceptable requestID. When I re-send the old requests the user is part logged in. Once i click on my account it says an error occurred. The only explanation would be the requestID.
I then decided to give HtmlUnit a go. This seems like the type of thing I require. I did some research on using HttpClient with a javascript engine, such as Rhino and it seems HtmlUnit is the tool for that.
Before I even try to log in to the page, I get errors caused by the javascript on the page.
Heres the simple bit of code I use to connect to the page;
#Test
public void htmlunit() throws Exception {
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = client.getPage("http://sports.coral.co.uk");
System.out.println(page.asText());
client.close();
}
When I comment out the LogFactory bit I can see that there are loads of Warnings thrown,
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Feb 09, 2016 4:33:34 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded
WARNING: Script is not JavaScript (type: application/ld+json, language: ). Skipping execution. etc...
I'm guessing this means that HtmlUnit isn't compatible with the javascript thats being executed on the page?
I'm not very good with javascript and the scripts on the page are obfuscated, which makes it even harder to read. What I don't understand is, why does the JS get executed without error when using phantomJS or chromeDriver but not HtmlUnit? Is it because the Rhino engine isn't good enough to execute it? Am I missing something obvious?
This code will turn off all the javascript warnings caused by the htmlunit library and not by your code.
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = webClient.getPage("http://sports.coral.co.uk");

Establish Connection First, Redirect User Second

I have an idea to make something pretty sweet but I'm not sure if it's possible. Here is an example of a very basic ajax function that I might use to establish a connection a server...
function getFakePage(userId)
{
var ajaxObject, path, params;
ajaxObject = getAjaxObject();
params = "?userId=" + userId
path = getInternalPath() + "someServlet" + params;
ajaxObject.open("GET", path, true);
ajaxObject.send();
// On ready state change stuff here
}
So let's say I have a URL like this...
https://localhost:8443/Instride/user/1/admin
And I wanted to use javascript to redirect the user to this this URL. Normally I would just do this...
window.location = "https://localhost:8443/Instride/user/1/admin";
But my idea is to create a javascript (no js frameworks please) function that could combine the ajax code with the window.location code. Basically what I would like to accomplish is to create a connection with the server via ajax, send a servlet on that server the url I would like the user to be redirected to, and then redirect the user to that URL. So that for however long it takes the user to connect to my server from wherever they are in the world they see a loading icon instead of a blank white page.
So to clarify exactly what I am trying to accomplish; I do not want to put window.location within the success of my ajax function (because that would be encompass two round trips), and I do not want to return a huge chunk of HTML for the requested resource and add it to the page. I want to establish a connection to the server with ajax, send a servlet the URL the user wants to go to, and then somehow override the ajax function to redirect that user. Is this possible?
And I know some of you might think this is stupid but it's not when you're talking about overseas users with slow dial up connections staring at white pages. If it's possible, I'd love to hear some insight. Thank you very much!
First, let me say that the best solution is finding what is causing the slowness and fixing it.
Now as to your question, yes you could do it. You could even shoehorn it onto an existing application. But it wouldn't be pretty. And it comes with it's own set of problems. But here are the steps:
Browser calls ajax cache service requesting "somepage.html"
Browser loads loading icon
Server creates somepage.html and caches it in a temporary cache, (ehcache or other library would be good, probably with file backing for the cache depending on size)
Server responds to ajax request with ID for cached page
Browser now redirects to "somepage.html?cacheId={cacheId}" where the id is from the ajax call.
Server uses a filter to see if any cache can be served up for the page instead of the actual page, thus speeding up the request.
Having said that, it would be better to just have the new page load quickly with a loading icon while it did any of the heavy lifting through ajax.
You can't do an AJAX request and a location change in one. If you want to do only one request you have to choose one of those methods. ie. return some data and replace content on your current page, or load a completely new page.
It doesn't make any sense to want to want to do both. What you could want is stateful URLs; where your URL matches the content displayed, even if that content comes from an AJAX request. In that case an easy solution is the use the # part of the URL which you can change freely (window.location.hash). Some modern browsers support changing the whole URL without causing the page to reload. I've used # with great success myself.

Simple example of JQuery Address to manage application state

I'm using the jQuery Address library to re-write my URL depending on what the user is doing in the page. The intention is to allow users to bookmark pages and come back later. The page never refreshes as all server interaction is done via ajax.
jQuery Address is writing URLs like this:
http://localhost:9000/#/u/scott_tiger
I need a to set up a route in Play to be able to route that request through to the appropriate controller. So I set this up:
GET /#/u/{username} Controller.showUser
This doesn't work though, the route definition gets ignored. I've tried loads of things such as trying to escape the "#" and replacing it with a variable that I've populated with Character.toString(35). None of this works.
Does anyone know how I can either define the route properly or get jQuery Address not to write the "#".
EDIT: The "#" doesn't get sent to the server does it. Doh! OK, question is revised.
No. The # and the part of the URL after that is not sent to the server. So your play app on the server will never see such URLs.
HTML5 solution
You need to handle these URLs on the client side using JavaScript. In modern browsers with good HTML5 support, you can modify the address without reloading the page. See Manipulating the browser history on how to do it for these browsers. And see When can I use... for browser support.
#-URLs
On Internet Explorer and older versions of other browsers you need to use # URLs and use JavaScript to load the state (e.g. get the user page /u/scott_tiger in your example). See How to run a JavaScript function when the user is visiting an hash link (#something) using JQuery? for how to do this in JavaScript. Also if a user bookmarks a page with a #-URL you need to reload the state.
See also: What's the shebang/hashbang (#!) in Facebook and new Twitter URLs for?
JavaScript libraries
You may use JavaScript libraries to handle this for you history.js is an example. Also bigger frameworks like Backbone.js handle this.
Does anyone know how I can get jQuery Address not to write the "#".
If you don't write the #-part of the URL, the state can not be linked. So you can not get back to e.g. Scott Tigers profile page if you bookmark the page, because the URL is only http://localhost:9000/ and you will arrive on the front page, while the user though he would arrive on the profile page.
Armed with my new understanding of URLs (thanks #Jonas) I realised that I'd missed half of the story.
I'm using JQuery Address to change the URL depending on what you click in the application. This works great and on lots of browsers. What I was missing was using JQuery Address to watch for external address changes (bookmarks, history, back/forward) and respond accordingly. i.e. set the page up correctly by firing the appropriate Ajax calls and rendering that data appropriately.
Changing the address
$.address.title("new title describing application state");
$.address.parameter("q", "val1");
$.address.parameter("g", "val2");
$.address.update();
Restoring the state
$.address.externalChange(function(event) {
var val1 = event.parameters["q"];
var val2 = event.parameters["g"];
// do something with those values
});

Categories