JSoup not translating ampersand in links in html

JSoup not translating ampersand in links in html - java

In JSoup the following test case should pass, it is not.
#Test
public void shouldPrintHrefCorrectly(){
String content= "<li>Good<ul><li><a href=\"article.php?boid=1865&sid=53&mid=1\">" +
"Boss</a></li><li><a href=\"article.php?boid=186&sid=53&mid=1\">" +
"heavent</a></li><li><a href=\"article.php?boid=167&sid=53&mid=1\">" +
"hellos</a></li><li><a href=\"article.php?boid=181&sid=53&mid=1\">" +
"Mr.Jackson!</a></li>";
Document document = Jsoup.parse(content, "http://www.google.co.in/");
Elements links = document.select("a[href^=article]");
Iterator<Element> iterator = links.iterator();
List<String> urls = new ArrayList<String>();
while(iterator.hasNext()){
urls.add(iterator.next().attr("href"));
}
Assert.assertTrue(urls.contains("article.php?boid=181&sid=53&mid=1"));
}
Could any of you please give me the reason as to why it is failing?

There are three problems:
You're asserting that there's a bovikatanid parameter is present, while it's actually called boid.
The HTML source is using & instead of & in the source. This is technically invalid.
Jsoup is parsing &mid as | somehow. It should have scanned until ;.
To fix #1, you have to do it yourself. To fix #2, you have to report this issue to the serveradmin in question (it's their fault, however, since the average browser is forgiving on this, I'd imagine that Google is doing this to save bandwidth). To fix #3, I've reported an issue to the Jsoup guy to see what he thinks about this.
Update: see, Jonathan (the Jsoup guy) has fixed it. It'll be there in the next release.

Related

How to speed up page parsing in Selenium

What can I do in case if I load the page in Selenium and then I have to do like 100 different parsing requests to this page?
At this moment I use different driver.findElement(By...) and the problem is that every time it is a http (get/post) request from java into selenium. From this case one simple page parsing costs me like 30+ seconds (too much).
I think that I must get source code (driver.getPageSource()) from first request and then parse this string locally (my page does not change while I parse it).
Can I build some kind of HTML object from this string to keep working with WebElement requests?
Do I have to use another lib to build HTML object? (for example - jsoup) In this case I will have to rebuild my parsing requests from webelement's and XPath.
Anything else?

When you call findElement, there is no need for Selenium to parse the page to find the element. The parsing of the HTML happens when the page is loaded. Some further parsing may happen due to JavaScript modifications to the page (like when doing element.innerHTML += ...). What Selenium does is query the DOM with methods like .getElementsByClassName, .querySelector, etc. This being said, if your browser is loaded on a remote machine, things can slow down. Even locally, if you are doing a huge amount of round-trip to between your Selenium script and the browser, it can impact the script's speed quite a bit. What can you do?
What I prefer to do when I have a lot of queries to do on a page is to use .executeScript to do the work on the browser side. This can reduce dozens of queries to a single one. For instance:
List<WebElement> elements = (List<WebElement>) ((JavascriptExecutor) driver)
.executeScript(
"var elements = document.getElementsByClassName('foo');" +
"return Array.prototype.filter.call(elements, function (el) {" +
" return el.attributes.whatever.value === 'something';" +
"});");
(I've not run the code above. Watch out for typos!)
In this example, you'd get a list of all elements of class foo that have an attribute named whatever which has a value equal to something. (The Array.prototype.filter.call rigmarole is because .getElementsByClassName returns something that behaves like an Array but which is not an Array so it does not have a .filter method.)
Parsing locally is an option if you know that the page won't change as you examine it. You should get the page's source by using something like:
String html = (String) ((JavascriptExecutor) driver).executeScript(
"return document.documentElement.outerHTML");
By doing this, you see the page exactly in the way the browser interpreted it. You will have to use something else than Selenium to parse the HTML.

Maybe try evaluating your elements only when you try to use them?
I dont know about the Java equivalent, but in C# you could do something similar to the following, which would only look for the element when it is used:
private static readonly By UsernameSelector = By.Name("username");
private IWebElement UsernameInputElement
{
get { return Driver.FindElement(UsernameSelector); }
}

Selenium/ Java how to verify the this complex text on page

I want to verify below text(HTML code) is present on page which as // characters , etc using selenium /jav
<div class="powatag" data-endpoint="https://api-sb2.powatag.com" data-key="b3JvYmlhbmNvdGVzdDErYXBpOjEyMzQ1Njc4" data-sku="519" data-lang="en_GB" data-type="bag" data-style="bg-act-left" data-colorscheme="light" data-redirect=""></div>
Appreciate any help on this

I believe you're looking for:
String textToVerify = "some html";
boolean bFoundText = driver.getPageSource.contains(textToVerify)
Assert.assertTrue(bFoundText);
Note, this checks the page source of the last loaded page as detailed here in the javadoc. I've found this to also take longer to execute, especially when dealing with large source codes. As such, this method is more prone to failure than validating the attributes and values and the answer from Breaks Software is what I utilize when possible, only with an xpath selector

As Andreas commented, you probably want to verify individual attributes of the div element. since you specifically mentioned the "//", I'm guessing that you are having trouble with the data-endpoint attribute. I'm assuming that your data-sku attribute will bring you to a unique element, so Try something like this (not verified):
String endpoint = driver.findElement(
new By.ByCssSelector("div[data-sku='519']")).getAttribute("data-endpoint");
assertTrue("https://api-sb2.powatag.com", endpoint);

Jsoup - CSS Query selector issue (?)

I´m with an odd issue here, I´ve been using Jsoup 1.7.2 for a while, with no issues, only now, when I try to retrieve the main headlines from this website: www.jornaldamarinha.pt, using this code:
// Connecting...
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.timeout(0)
.get();
// "*[class*=zincontent-wrap]" in "Jsoup idiom", means:
// Select all tags that contains classes with "zincontent-wrap" on its name.
Elements elems = doc.select("*[class*=zincontent-wrap]"); // Retrieves 0 results!
int t = elems.size();
Log.w("INFO", "Total Headlines: " + t);
// Loop trought all retrieved headlines:
for (Element e : elems) {
String headline = e.select("a").text().toString();
Log.w("HEADLINE", headline);
};
It fails!... Retrieves 0 results. (Should retrieve ~8)
The chances are that the issue is caused by:
Aliens... (Similar to androids, but uglier...)
Website encoding. (I tried to encode incoming HTML with ISO-8859-15, to handle portuguese special characters, but the issue remains)
Mal-formatted incoming HTML. (I doubt this could be the issue, since the selector works fine on "Try jsoup online webpage", and Jsoup usually handles broken HTML very well)
The use of the minus symbol in the class name ("-") is messing with Jsoup. (Seems, to me, to be the main (or at least, one) cause of the issue)
Something else... (Very probably!)
BUT... at http://try.jsoup.org if I fetch the URL: http://www.jornaldamarinha.pt using this CSS Query:
*[class*=zincontent-wrap]
Everything works just great, there! (Retrieves all the ~8 correct results!)
SO... to resume, all I need is to do exactly what that webpage does, but using code.
THANKS, in advance, for any light or workaround, about this! :)

SOLUTION!... After all, everything in the above code, was working correctly, as I suspected, except... That CSS Query breaks on Android´s "default user agent". I just figured that setting "userAgent" to Jsoup´s connection method is VERY important! So, I´ve edited my code on the following way, and... Works like a charm now !! (Exactly with same results, as in http://try.jsoup.org webpage)
Document doc = Jsoup.connect("http://www.jornaldamarinha.pt")
.userAgent("Mozilla/5.0 Gecko/20100101 Firefox/21.0")
.timeout(0)
.get();
Hope this helps anyone else too! :)

How do I create with Java a well formatted request to publish on Wordpress with XML-RPC?

I've found this guide on internet to publish on Wordpress using XML-RPC inside my Java Project, for example I want a message on my Blog, every time it's specified date.
http://wordpress.rintcius.nl/post/look-how-this-wordpress-post-got-created-from-java
Now, I've followed the guide and I'm trying to let it run but I don't understand yet how exactly parameters for my post works.
For example using the method blogger.NewPost I call:
public Integer post(String contents) throws XmlRpcException {
Object[] params = new Object[] {
blogInfo.getApiKey(),
blogInfo.getBlogId(),
blogInfo.getUserName(),
blogInfo.getPassword(),
contents,
postType.booleanValue()
};
return (Integer) client.execute(POST_METHOD_NAME, params);
}
and my "contents" value is:
[title]Look how this wordpress post got created from java![/title]"
+ "[category]6[/category]"
+ FileUtils.getContentsOfResource("rintcius/blog/post.txt");
(I'm using "[" instead of "<" and "]" instead of ">" that are processed by stackoverflow)
Now, how could I use all parameters in this XML way?
Parameters here: http://codex.wordpress.org/XML-RPC_MetaWeblog_API#metaWeblog.newPost
And, it's the content only a "string" without any tag?
Thanks a lot to all!

Still don't know why it gives me back errors but i think it's only a bit outdated.
Found this other libraries that works perfectly
http://code.google.com/p/wordpress-java/
I advice all to use this since the other one is outdated
Thanks all

Regex submitting with empty string

I have the following REGEX that I'm serving up to java via an xml file.
[a-zA-Z -\(\) \-]+
This regex is used to validate server side and client side (via javascript) and works pretty well at allowing only alphabetic content and a few other characters...
My problem is that it will also allow zero lenth strings / empty through.
Does anyone have a simple and yet elegant solution to this?
I already tried...
[a-zA-Z -\(\) \-]{1,}+
but that didn;t seem to work.
Cheers!
UPDATE FOLLOWING INVESTIGATION
It appears the code I provided does in fact work...
String inputStr = " ";
String pattern = "[a-zA-Z -\\(\\) \\-]+";
boolean patternMatched = java.util.regex.Pattern.matches(pattern, inputStr);
if ( patternMatched ){
out.println("Pattern MATCHED");
}else{
out.println("NOT MATCHED");
}
After looking at this more closely I think the problem may well be within the logic of some of my java bean coding... It appears the regex is dropped out at the point where the string parse should take place, thereby allowing empty strings to be submitted... And also any other string... EEJIT that I am...
Cheers for the help in peer reviewing my initial stupid though....!

Have you tried this:
[a-zA-Z -\(\) \-]+

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup not translating ampersand in links in html - java

Related

How to speed up page parsing in Selenium

Selenium/ Java how to verify the this complex text on page

Jsoup - CSS Query selector issue (?)

How do I create with Java a well formatted request to publish on Wordpress with XML-RPC?

Regex submitting with empty string

Categories

Resources