How to speed up page parsing in Selenium

How to speed up page parsing in Selenium - java

What can I do in case if I load the page in Selenium and then I have to do like 100 different parsing requests to this page?
At this moment I use different driver.findElement(By...) and the problem is that every time it is a http (get/post) request from java into selenium. From this case one simple page parsing costs me like 30+ seconds (too much).
I think that I must get source code (driver.getPageSource()) from first request and then parse this string locally (my page does not change while I parse it).
Can I build some kind of HTML object from this string to keep working with WebElement requests?
Do I have to use another lib to build HTML object? (for example - jsoup) In this case I will have to rebuild my parsing requests from webelement's and XPath.
Anything else?

When you call findElement, there is no need for Selenium to parse the page to find the element. The parsing of the HTML happens when the page is loaded. Some further parsing may happen due to JavaScript modifications to the page (like when doing element.innerHTML += ...). What Selenium does is query the DOM with methods like .getElementsByClassName, .querySelector, etc. This being said, if your browser is loaded on a remote machine, things can slow down. Even locally, if you are doing a huge amount of round-trip to between your Selenium script and the browser, it can impact the script's speed quite a bit. What can you do?
What I prefer to do when I have a lot of queries to do on a page is to use .executeScript to do the work on the browser side. This can reduce dozens of queries to a single one. For instance:
List<WebElement> elements = (List<WebElement>) ((JavascriptExecutor) driver)
.executeScript(
"var elements = document.getElementsByClassName('foo');" +
"return Array.prototype.filter.call(elements, function (el) {" +
" return el.attributes.whatever.value === 'something';" +
"});");
(I've not run the code above. Watch out for typos!)
In this example, you'd get a list of all elements of class foo that have an attribute named whatever which has a value equal to something. (The Array.prototype.filter.call rigmarole is because .getElementsByClassName returns something that behaves like an Array but which is not an Array so it does not have a .filter method.)
Parsing locally is an option if you know that the page won't change as you examine it. You should get the page's source by using something like:
String html = (String) ((JavascriptExecutor) driver).executeScript(
"return document.documentElement.outerHTML");
By doing this, you see the page exactly in the way the browser interpreted it. You will have to use something else than Selenium to parse the HTML.

Maybe try evaluating your elements only when you try to use them?
I dont know about the Java equivalent, but in C# you could do something similar to the following, which would only look for the element when it is used:
private static readonly By UsernameSelector = By.Name("username");
private IWebElement UsernameInputElement
{
get { return Driver.FindElement(UsernameSelector); }
}

Related

How to Add a <script> into Head Using Selenium's JavascriptExecutor

Summary
I want to figure out a way to add a <script> tag into the head of DOM using Selenium's JavascriptExecutor, or any other way of doing this would be nice.
I have tried many ways and also found a few similar topics and none of them solved my problem which is why I felt the need to ask it on here.
For example :
Suggested solutions in this question did not solve my problem. Some people say it worked for them but nope, it didn't for me.
What I've been trying to execute?
Here is the small snippet of the code that I want to execute:
WebDriver driver = new FirefoxDriver();
JavascriptExecutor jse = (JavascriptExecutor) driver;
jse.executeScript("var s = document.createElement('script');");
jse.executeScript("s.type = 'text/javascript';");
jse.executeScript("s.text = 'function foo() {console.log('foo')}';");
jse.executeScript("window.document.head.appendChild(s);");
I just skipped the code above where you navigate to a webpage using driver.get() etc. and then try to execute the scripts.
Also, s.text would contain the actual script that I want to use so I just put there a foo() function just to give the idea.
The above code throws this error when you run it:
Exception in thread "main" org.openqa.selenium.JavascriptException: ReferenceError: s is not defined
So far I've tried every possible solution I could find on the Internet but none of them seems to work.

OP came up with the following solution:
jse.executeScript("var s=window.document.createElement('script');" +
"s.type = 'text/javascript';" + "s.text = function foo() {console.log('foo')};" +
"window.document.head.appendChild(s);");

For one, this line is invalid.
jse.executeScript("s.text = 'function foo() {console.log('foo')}';");
Note how you wrap single-quote text in single quotes. Use one set as "\""
I would personally do this by doing (edited to make it a global function):
using OpenQA.Selenium.Support.Extensions;
driver.ExecuteJavascript("window.foo = function foo() {console.log('foo')}");
It's as simple as that. You are registering foo as a method by doing this. After you execute this javascript, you can manually go in to the browser developer tools and call "foo()" to check. Additionally, you can check this by registering it directly in the console. Just enter "function foo() {console.log('foo')}" into your browser console, and then call "foo()".
No need to add this as a script tag.
EDIT #2: I fixed my above code suggestion so that the method is assigned to the window, and thus accessible globally, and outside of the anonymous script that javascript executor runs the code in. The original issues with this not working are resolved by this, at least in my testing of it.

HtmlUnit and HTTPS pages

I'm trying to make a program that checks avaliable positions and books the first avaliable one. I started writing it and i ran into a problem pretty early.
The problem is that when I try to connect with the site (which is https) the program doesn't do anything. It doesn't throw an error, it doesn't crash. And the weirdest thing is that it works with some https websites and with some it doesn't.
I've spent countless hours trying to resolve this problem. I tried using htmlunitdriver and it still doesn't work. Please help.
private final WebClient webc = new WebClient(BrowserVersion.CHROME);
webc.getCookieManager().setCookiesEnabled(true);
HtmlPage loginpage = webc.getPage(loginurl);
System.out.println(loginpage.getTitleText());
I'm getting really frustrated with this. Thank you in advance.

As far as i can see this has nothing to do with HttpS. It is a good idea to do some traffic analysis using Charles or Fiddler.
What you can see....
The page returned from the server as response to your first call to https://online.enel.pl/ loads some external javascript. And then the story begins:
This JS looks like
(function() {
var z = "";
var b = "766172205f3078666.....";
eval((function() {
for (var i = 0; i < b.length; i += 2) {
z += String.fromCharCode(parseInt(b.substring(i, i + 2), 16));
}
return z;
})());
})();
As you can see someone likes to hide the real javascript that gets processed.
Next step is to check the javascript after this simple decoding
It is really huge and looks like this
var _0xfbfd = ['\x77\x71\x30\x6b\x77 ....
(function (_0x2ea96d, _0x460da4) {
var _0x1da805 = function (_0x55e996) {
while (--_0x55e996) {
_0x2ea96d['\x70\x75\x73\x68'](_0x2ea96d['\x73\x68\x69\x66\x74']());
}
};
.....
Ok now we have obfuscated javascript. If you like you can start with http://ddecode.com/hexdecoder/ to get some more readable text but this was the step where i have stopped my analysis. Looks like this script does some really bad things or someone still believes in security by obscurity.
If you run this with HtmlUnit, this codes gets interpreted - yes the decoding works and the code runs. Sadly this code runs endless (maybe because of an error or some incompatibility with real browsers).
If you like to get this working, you have to figure out, where the error is and open an bug report for HtmlUnit. For this you can simply start with a small local HtmlFile and include the code from the first external javascript. Then add some log statements to get the decoded version. Then replace this with the decoded version and try to understand what is going on. You can start adding alert statements and check if the code in HtmlUnit follows the same path as browsers do. Sorry but my time is to limited to do all this work but i really like to help/fix if you can point to a specific function in HtmlUnit that works different from real browsers.

Without the URL that you are querying it is dificult to say what could be wrong. However, having worked with HTML unit some time back I found that it was failing with many sites that I needed to get data from. The site owners will do many things to avoid you using programs to access them and you might have to resort to using some lower level library like Apache HTTP components where you have more control over what is going on under the hood.
Also check if the website is constructed using JavaScript which is getting more and more popular but making it increasingly dificult to use programs to interrogate the content.

Selenium/ Java how to verify the this complex text on page

I want to verify below text(HTML code) is present on page which as // characters , etc using selenium /jav
<div class="powatag" data-endpoint="https://api-sb2.powatag.com" data-key="b3JvYmlhbmNvdGVzdDErYXBpOjEyMzQ1Njc4" data-sku="519" data-lang="en_GB" data-type="bag" data-style="bg-act-left" data-colorscheme="light" data-redirect=""></div>
Appreciate any help on this

I believe you're looking for:
String textToVerify = "some html";
boolean bFoundText = driver.getPageSource.contains(textToVerify)
Assert.assertTrue(bFoundText);
Note, this checks the page source of the last loaded page as detailed here in the javadoc. I've found this to also take longer to execute, especially when dealing with large source codes. As such, this method is more prone to failure than validating the attributes and values and the answer from Breaks Software is what I utilize when possible, only with an xpath selector

As Andreas commented, you probably want to verify individual attributes of the div element. since you specifically mentioned the "//", I'm guessing that you are having trouble with the data-endpoint attribute. I'm assuming that your data-sku attribute will bring you to a unique element, so Try something like this (not verified):
String endpoint = driver.findElement(
new By.ByCssSelector("div[data-sku='519']")).getAttribute("data-endpoint");
assertTrue("https://api-sb2.powatag.com", endpoint);

Servlet not retrieve parameters from AJAX call

I used AJAX to call an action and pass parameters, the AJAX call occurs from xsl page and its as follows:
xmlHttp.open("GET","examcont?action=AJAX_SectionsBySessionId&sessionId="+sessionId,true);
I decided to put the amp; after & as xsl raises this error when I removed it:
The reference to entity "sessionId" must end with the ';' delimiter
the problem is that the action is unable to read the parameter sessionId however I tried the same action URL but without the amp; and the action reads the parameter successfully

The problem seems to be that the & represents & in the style sheet but gets expanded/escaped to & again during output (because it is HTML/XML). You may try to use the following in XSL to avoid escaping:
xmlHttp.open("GET","examcont?action=AJAX_SectionsBySessionId<xsl:text disable-output-escaping="yes">&</xsl:text>sessionId="+sessionId,true);
However, note that - if you happen to let your XSL run in the browser - this does not work (although it is correct XSL and it should) on Firefox according to https://bugzilla.mozilla.org/show_bug.cgi?id=98168.
As portable alternative, you can use the following which avoids mentioning & by inserting it at runtime with what you might call "Javascript-escaping":
xmlHttp.open("GET","examcont?action=AJAX_SectionsBySessionId"+String.fromCharCode(38)+"sessionId="+sessionId,true);
Also have a look at similar question with deeper discussion and other options using a html entity in xslt (e.g. )

Jsoup - CSS Query selector issue (?)

I´m with an odd issue here, I´ve been using Jsoup 1.7.2 for a while, with no issues, only now, when I try to retrieve the main headlines from this website: www.jornaldamarinha.pt, using this code:
// Connecting...
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.timeout(0)
.get();
// "*[class*=zincontent-wrap]" in "Jsoup idiom", means:
// Select all tags that contains classes with "zincontent-wrap" on its name.
Elements elems = doc.select("*[class*=zincontent-wrap]"); // Retrieves 0 results!
int t = elems.size();
Log.w("INFO", "Total Headlines: " + t);
// Loop trought all retrieved headlines:
for (Element e : elems) {
String headline = e.select("a").text().toString();
Log.w("HEADLINE", headline);
};
It fails!... Retrieves 0 results. (Should retrieve ~8)
The chances are that the issue is caused by:
Aliens... (Similar to androids, but uglier...)
Website encoding. (I tried to encode incoming HTML with ISO-8859-15, to handle portuguese special characters, but the issue remains)
Mal-formatted incoming HTML. (I doubt this could be the issue, since the selector works fine on "Try jsoup online webpage", and Jsoup usually handles broken HTML very well)
The use of the minus symbol in the class name ("-") is messing with Jsoup. (Seems, to me, to be the main (or at least, one) cause of the issue)
Something else... (Very probably!)
BUT... at http://try.jsoup.org if I fetch the URL: http://www.jornaldamarinha.pt using this CSS Query:
*[class*=zincontent-wrap]
Everything works just great, there! (Retrieves all the ~8 correct results!)
SO... to resume, all I need is to do exactly what that webpage does, but using code.
THANKS, in advance, for any light or workaround, about this! :)

SOLUTION!... After all, everything in the above code, was working correctly, as I suspected, except... That CSS Query breaks on Android´s "default user agent". I just figured that setting "userAgent" to Jsoup´s connection method is VERY important! So, I´ve edited my code on the following way, and... Works like a charm now !! (Exactly with same results, as in http://try.jsoup.org webpage)
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.userAgent("Mozilla/5.0 Gecko/20100101 Firefox/21.0")
.timeout(0)
.get();
Hope this helps anyone else too! :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to speed up page parsing in Selenium - java

Related

How to Add a <script> into Head Using Selenium's JavascriptExecutor

HtmlUnit and HTTPS pages

Selenium/ Java how to verify the this complex text on page

Servlet not retrieve parameters from AJAX call

Jsoup - CSS Query selector issue (?)

Categories

Resources