Jsoup - CSS Query selector issue (?) - java

I´m with an odd issue here, I´ve been using Jsoup 1.7.2 for a while, with no issues, only now, when I try to retrieve the main headlines from this website: www.jornaldamarinha.pt, using this code:
// Connecting...
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.timeout(0)
.get();
// "*[class*=zincontent-wrap]" in "Jsoup idiom", means:
// Select all tags that contains classes with "zincontent-wrap" on its name.
Elements elems = doc.select("*[class*=zincontent-wrap]"); // Retrieves 0 results!
int t = elems.size();
Log.w("INFO", "Total Headlines: " + t);
// Loop trought all retrieved headlines:
for (Element e : elems) {
String headline = e.select("a").text().toString();
Log.w("HEADLINE", headline);
};
It fails!... Retrieves 0 results. (Should retrieve ~8)
The chances are that the issue is caused by:
Aliens... (Similar to androids, but uglier...)
Website encoding. (I tried to encode incoming HTML with ISO-8859-15, to handle portuguese special characters, but the issue remains)
Mal-formatted incoming HTML. (I doubt this could be the issue, since the selector works fine on "Try jsoup online webpage", and Jsoup usually handles broken HTML very well)
The use of the minus symbol in the class name ("-") is messing with Jsoup. (Seems, to me, to be the main (or at least, one) cause of the issue)
Something else... (Very probably!)
BUT... at http://try.jsoup.org if I fetch the URL: http://www.jornaldamarinha.pt using this CSS Query:
*[class*=zincontent-wrap]
Everything works just great, there! (Retrieves all the ~8 correct results!)
SO... to resume, all I need is to do exactly what that webpage does, but using code.
THANKS, in advance, for any light or workaround, about this! :)

SOLUTION!... After all, everything in the above code, was working correctly, as I suspected, except... That CSS Query breaks on Android´s "default user agent". I just figured that setting "userAgent" to Jsoup´s connection method is VERY important! So, I´ve edited my code on the following way, and... Works like a charm now !! (Exactly with same results, as in http://try.jsoup.org webpage)
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.userAgent("Mozilla/5.0 Gecko/20100101 Firefox/21.0")
.timeout(0)
.get();
Hope this helps anyone else too! :)

Related

HtmlUnit and HTTPS pages

I'm trying to make a program that checks avaliable positions and books the first avaliable one. I started writing it and i ran into a problem pretty early.
The problem is that when I try to connect with the site (which is https) the program doesn't do anything. It doesn't throw an error, it doesn't crash. And the weirdest thing is that it works with some https websites and with some it doesn't.
I've spent countless hours trying to resolve this problem. I tried using htmlunitdriver and it still doesn't work. Please help.
private final WebClient webc = new WebClient(BrowserVersion.CHROME);
webc.getCookieManager().setCookiesEnabled(true);
HtmlPage loginpage = webc.getPage(loginurl);
System.out.println(loginpage.getTitleText());
I'm getting really frustrated with this. Thank you in advance.
As far as i can see this has nothing to do with HttpS. It is a good idea to do some traffic analysis using Charles or Fiddler.
What you can see....
The page returned from the server as response to your first call to https://online.enel.pl/ loads some external javascript. And then the story begins:
This JS looks like
(function() {
var z = "";
var b = "766172205f3078666.....";
eval((function() {
for (var i = 0; i < b.length; i += 2) {
z += String.fromCharCode(parseInt(b.substring(i, i + 2), 16));
}
return z;
})());
})();
As you can see someone likes to hide the real javascript that gets processed.
Next step is to check the javascript after this simple decoding
It is really huge and looks like this
var _0xfbfd = ['\x77\x71\x30\x6b\x77 ....
(function (_0x2ea96d, _0x460da4) {
var _0x1da805 = function (_0x55e996) {
while (--_0x55e996) {
_0x2ea96d['\x70\x75\x73\x68'](_0x2ea96d['\x73\x68\x69\x66\x74']());
}
};
.....
Ok now we have obfuscated javascript. If you like you can start with http://ddecode.com/hexdecoder/ to get some more readable text but this was the step where i have stopped my analysis. Looks like this script does some really bad things or someone still believes in security by obscurity.
If you run this with HtmlUnit, this codes gets interpreted - yes the decoding works and the code runs. Sadly this code runs endless (maybe because of an error or some incompatibility with real browsers).
If you like to get this working, you have to figure out, where the error is and open an bug report for HtmlUnit. For this you can simply start with a small local HtmlFile and include the code from the first external javascript. Then add some log statements to get the decoded version. Then replace this with the decoded version and try to understand what is going on. You can start adding alert statements and check if the code in HtmlUnit follows the same path as browsers do. Sorry but my time is to limited to do all this work but i really like to help/fix if you can point to a specific function in HtmlUnit that works different from real browsers.
Without the URL that you are querying it is dificult to say what could be wrong. However, having worked with HTML unit some time back I found that it was failing with many sites that I needed to get data from. The site owners will do many things to avoid you using programs to access them and you might have to resort to using some lower level library like Apache HTTP components where you have more control over what is going on under the hood.
Also check if the website is constructed using JavaScript which is getting more and more popular but making it increasingly dificult to use programs to interrogate the content.

How to speed up page parsing in Selenium

What can I do in case if I load the page in Selenium and then I have to do like 100 different parsing requests to this page?
At this moment I use different driver.findElement(By...) and the problem is that every time it is a http (get/post) request from java into selenium. From this case one simple page parsing costs me like 30+ seconds (too much).
I think that I must get source code (driver.getPageSource()) from first request and then parse this string locally (my page does not change while I parse it).
Can I build some kind of HTML object from this string to keep working with WebElement requests?
Do I have to use another lib to build HTML object? (for example - jsoup) In this case I will have to rebuild my parsing requests from webelement's and XPath.
Anything else?
When you call findElement, there is no need for Selenium to parse the page to find the element. The parsing of the HTML happens when the page is loaded. Some further parsing may happen due to JavaScript modifications to the page (like when doing element.innerHTML += ...). What Selenium does is query the DOM with methods like .getElementsByClassName, .querySelector, etc. This being said, if your browser is loaded on a remote machine, things can slow down. Even locally, if you are doing a huge amount of round-trip to between your Selenium script and the browser, it can impact the script's speed quite a bit. What can you do?
What I prefer to do when I have a lot of queries to do on a page is to use .executeScript to do the work on the browser side. This can reduce dozens of queries to a single one. For instance:
List<WebElement> elements = (List<WebElement>) ((JavascriptExecutor) driver)
.executeScript(
"var elements = document.getElementsByClassName('foo');" +
"return Array.prototype.filter.call(elements, function (el) {" +
" return el.attributes.whatever.value === 'something';" +
"});");
(I've not run the code above. Watch out for typos!)
In this example, you'd get a list of all elements of class foo that have an attribute named whatever which has a value equal to something. (The Array.prototype.filter.call rigmarole is because .getElementsByClassName returns something that behaves like an Array but which is not an Array so it does not have a .filter method.)
Parsing locally is an option if you know that the page won't change as you examine it. You should get the page's source by using something like:
String html = (String) ((JavascriptExecutor) driver).executeScript(
"return document.documentElement.outerHTML");
By doing this, you see the page exactly in the way the browser interpreted it. You will have to use something else than Selenium to parse the HTML.
Maybe try evaluating your elements only when you try to use them?
I dont know about the Java equivalent, but in C# you could do something similar to the following, which would only look for the element when it is used:
private static readonly By UsernameSelector = By.Name("username");
private IWebElement UsernameInputElement
{
get { return Driver.FindElement(UsernameSelector); }
}

Selenium/ Java how to verify the this complex text on page

I want to verify below text(HTML code) is present on page which as // characters , etc using selenium /jav
<div class="powatag" data-endpoint="https://api-sb2.powatag.com" data-key="b3JvYmlhbmNvdGVzdDErYXBpOjEyMzQ1Njc4" data-sku="519" data-lang="en_GB" data-type="bag" data-style="bg-act-left" data-colorscheme="light" data-redirect=""></div>
Appreciate any help on this
I believe you're looking for:
String textToVerify = "some html";
boolean bFoundText = driver.getPageSource.contains(textToVerify)
Assert.assertTrue(bFoundText);
Note, this checks the page source of the last loaded page as detailed here in the javadoc. I've found this to also take longer to execute, especially when dealing with large source codes. As such, this method is more prone to failure than validating the attributes and values and the answer from Breaks Software is what I utilize when possible, only with an xpath selector
As Andreas commented, you probably want to verify individual attributes of the div element. since you specifically mentioned the "//", I'm guessing that you are having trouble with the data-endpoint attribute. I'm assuming that your data-sku attribute will bring you to a unique element, so Try something like this (not verified):
String endpoint = driver.findElement(
new By.ByCssSelector("div[data-sku='519']")).getAttribute("data-endpoint");
assertTrue("https://api-sb2.powatag.com", endpoint);

htmlUnit behaving oddly. I am not able to retrieve correct number of anchor tag or link tags?

I am using HtmlUnit 2.10. I am creating a small link validator for a website. For crawling I am using this. during my research I was trying to crawl : loans.xxxxxxx.com. It has 58 anchor tag and 5 link tags.
I am writing a code like this
List<HtmlElement> elementsOfPage = (List<HtmlElement>) htmlPage.getElementsByTagName("link");
Iterator<HtmlElement> it = elementsOfPage.iterator();
System.out.println(elementsOfPage.size());
while(it.hasNext()) {
HtmlElement htmlElement = it.next();
System.out.println(htmlElement.toString());
}
I am also doing the same procedure for anchor tag i.e. a. For link it is just showing 3 and for anchor it is just showing 56 even though there are 5 and 58 respectively.
There are some portions in the code which are commented, I thought the web client ignores it, but if you actually print it will show some results actually are from commented code.
// Before running webclient, I have disabled applets,css, javascripts and increased the timeout to be 7seconds.
Why is this behavior odd ?
How do you get such numbers as 58 and 5? I tried to check URL you provided with HtmlUnit 2.10 + JSoup parser. Code is (Groovy, but almost Java):
def client = new WebClient(BrowserVersion.FIREFOX_3_6)
client.setThrowExceptionOnScriptError(false);
def page = (HtmlPage)client.getPage("http://loans.bankofamerica.com/en/index.html")
def doc = Jsoup.parse(page.asXml())
println doc.select("a").size()
println doc.select("link").size()
Results are 56 and 2. But with default UserAgent
def client = new WebClient()
Results are 56 and 3! Seems server gives different markup depends on useragent string (and maybe other headers).

JSoup not translating ampersand in links in html

In JSoup the following test case should pass, it is not.
#Test
public void shouldPrintHrefCorrectly(){
String content= "<li>Good<ul><li><a href=\"article.php?boid=1865&sid=53&mid=1\">" +
"Boss</a></li><li><a href=\"article.php?boid=186&sid=53&mid=1\">" +
"heavent</a></li><li><a href=\"article.php?boid=167&sid=53&mid=1\">" +
"hellos</a></li><li><a href=\"article.php?boid=181&sid=53&mid=1\">" +
"Mr.Jackson!</a></li>";
Document document = Jsoup.parse(content, "http://www.google.co.in/");
Elements links = document.select("a[href^=article]");
Iterator<Element> iterator = links.iterator();
List<String> urls = new ArrayList<String>();
while(iterator.hasNext()){
urls.add(iterator.next().attr("href"));
}
Assert.assertTrue(urls.contains("article.php?boid=181&sid=53&mid=1"));
}
Could any of you please give me the reason as to why it is failing?
There are three problems:
You're asserting that there's a bovikatanid parameter is present, while it's actually called boid.
The HTML source is using & instead of & in the source. This is technically invalid.
Jsoup is parsing &mid as | somehow. It should have scanned until ;.
To fix #1, you have to do it yourself. To fix #2, you have to report this issue to the serveradmin in question (it's their fault, however, since the average browser is forgiving on this, I'd imagine that Google is doing this to save bandwidth). To fix #3, I've reported an issue to the Jsoup guy to see what he thinks about this.
Update: see, Jonathan (the Jsoup guy) has fixed it. It'll be there in the next release.

Categories