HtmlUnit - Convert an HtmlPage into HTML string? - java

I'm using HtmlUnit to generate the HTML for various pages, but right now, the best I can do to get the page into the raw HTML that the server returns is to convert the HtmlPage into an XML string.
This is somewhat annoying because the XML output is rendered by web browsers differently than the raw HTML would. Is there a way to convert an HtmlPage into raw HTML instead of XML?
Thanks!

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

I'm not 100% certain I understood the question correctly, but maybe this will address your issue:
page.getWebResponse().getContentAsString()

I think there is no direct way to get the final page as HTML.
asXml() returns the result as XML, asText() returns the extracted text content.
The best you can do is to use asXml() and "transform" it to HTML:
htmlPage.asXml().replaceFirst("<\\?xml version=\"1.0\" encoding=\"(.+)\"\\?>", "<!DOCTYPE html>")
(Of course you can apply more transformations like converting <br/> to <br> - it depends on your requirements.)
Even the related Google documentation recommends this approach (although they don't apply any transformations):
// return the snapshot
out.println(page.asXml());

I dont know the answer short of a switch on Page type and for XmlPage and SgmlPage one must do an innerHTML on the HTML element and manually write out the attributes. Not elegant and exact (its missing the doctype) but it works.
Page.getWebResponse().getContentAsString()
This is incorrect as it returns the text form of the original unrendered, no js bytes. If javascript executes and changes stuff, then this method will not see the changes.
page.asXml() will return the HTML. page.asText() returns it rendered down to just text.
Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. If you wish to take the complete HTML this is not the good enuff.

Maybe you want to go with something like this, instead of using the HtmlUnit framework's methods:
try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream());
BufferedReader br = new BufferedReader(isr);){
String line ="";
String htmlSource ="";
while((line = br.readLine()) != null)
{
htmlSource += line + "\n";
}
return htmlSource;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Here is my solution that works for me:
ScriptResult scriptResult = htmlPage.executeJavaScript("document.documentElement.outerHTML;");
System.out.println(scriptResult.getJavaScriptResult().toString());

Related

Unable to retrieve table elements using jsoup

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?
The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same
The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.

Html parser to replace hyperlinks in html document by preserving original html tags and line breaks

I am using Jsoup html parser for replacing hyperlinks in a html document. I want actual case, elements and line breaks to be as is even after updating the html document. But, Jsoup is updating the case to lowercase, updating few elements and also removing the line breaks. I have tried with ParseSettings also. But with parse settings, doc.select("a[href]") is not returning the elements. Below is the code I am using.
Can someone help me with the right html parser using java to replace hyperlinks by retaining the html document as is?
File input = new File(fileEntry.getPath());
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true));
Document doc = parser.parseInput(input.toString(), "UTF-8");
Elements anchorLinks = doc.select("a[href]");
The documentation is your friend… even when there is no description in that documentation.
Notice the first argument is named html and the second argument is named baseUri.
The first argument needs to be actual HTML content, not a filename. Your code is trying to parse a filename as if it’s HTML.
The second argument needs to be a URI, or an empty string. "UTF-8" is not a valid URI at all, though since you aren’t trying to resolve the links, it may not be a critical mistake.
You probably want the Jsoup.parse method which takes both an InputStream and a customized Parser:
Document doc;
try (InputStream content = new BufferedInputStream(
new FileInputStream(input))) {
doc = Jsoup.parse(content, null, "", parser);
}

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

Replace text on page depending on page URL

We are already replacing text with images on the website but have run into a little problem due to the platform we're running on - which is proprietary and provides limited access.
Our goal is to replace the price with an image, ONLY for this specific brand and all items within it.
It seems that forming some sort of expression to look at the current URL and then if it fits to replace the text with the image.
Is this valid thinking and if so how do I go about doing this?
Here is a link to a sample product that is within the brand 'KW Suspension';
Yeah that shouldn't be too hard,
<script>
$(document).ready(function(){
if ( /.*\/kw_suspension\/.*/i.test(location.href) ) {
$(".yourprice").html("<img src='myimg.png' />");
}
});
</script>
you could also change it to check using a regexp, you just change what is within the () to fit your criteria.
EDIT
Added surrounding code and changed to regexp as suggested by OP.
You have access to the location.href which would return a string for the current windows url and you could use a regex match to see if the brand is in the ur. You can then do a replace to replace the pricing span:
var matcher = new Regexp(/kw_suspension/);
if(x.test(location.href){
$('#ctl00_MainContentPlaceHolder_YourPriceLabel').replace(better html here);
}
The above will just simply see if kw_suspension is in the url and then it replaces the span with the price with something else.
You can use indexOf to see if the url contains your keyword.
$(document).ready(function(){
var urlString = location.href; //get URL string
if( urlString.string.indexOf("kw_suspension") != -1){
$('div.yourprice').empty().html('<img src="/path/to/image.jpg" />');
}
});

How to serve a file with JSP?

This may sound totally stupid, but is a case of real life :(
I'm able to display a HTML table with a "virtual" link name.
Something like this:
Xyz description document.doc
Xyz description documentB.doc
Xyz description documentC.doc
This doc id represents an id in the database ( for these docs are stored in a blob as byte[] )
Anyway. I'm able to get that id, query the database and retrieve the byte[] ( and even store it in a tmp file )
What I can't figure out how to do, is, that when the user click on the link ( and after I perform the db retrieval ) "serve" the byte[] to the user.
Now the very worst part, and what makes me ask this question here is, I need to do this with JSP only ( no servlet ) and without 3rd party libraries ( just... don't ask me why I hate it too )
So. How do I serve in a jsp the binary content of a byte array stored in the server file system
My first guest is:
<%
InputStream read // read the file form the fle system
response.getOutputStream().write( theBytesReader );
%>
Am I close to the solution?
Would this work to the client as if he had clicked really in the server for a real file?
Thanks in advance.
To the point, just write the same code in JSP as you would do in a Servlet class. You can practically copypaste it. Only ensure that you are not writinig any template text to the stream, this includes linebreaks and whitespace outside the scriptlets. Otherwise it would get written to the binary file as well and corrupt it.
If you have multiple scriptlet blocks, then you need to arrange them so that there's no linebreak between the ending %> of a scriptlet and the starting <% of the next scriptlet. Thus, e.g.
<%#page import="java.io.InputStream" %><%
//...
%>
instead of
<%#page import="java.io.InputStream" %>
<%
//...
%>
You need to set the MIME type in the HTTP response like below in addition to the sample code you provided.
response.setContentType("application/octet-stream");
Note, the application/octet-stream MIME type is used to indicate a binary file.
Please, please, please don't do this.
You're doing a disservice to your users.
HTTP is amazingly rich in terms of what it can do with files. Caching, chunking, random access, etc.
Take a look at something like FileServlet, and hammer that to fit. Yes, it's a Servlet, rather than a JSP, but this is what you want to do to be a good HTTP citizen.
Some containers have other options you can use, you can hack Tomcats DefaultServlet, etc.
Something like this...
InputStream instr = null;
try {
instr = new BufferedInputStream( new FileInputStream("file.txt") );
for(int x=instr.read(); x!=-1; x=instr.read()){
out.write(x);
}
} finally {
out.close();
if( instr != null) instr.close();
}
You will need this as the response to the click (either on a page reload or in another jsp file).
There are better buffering solutions you can do with the write using byte arrays rather than one at a time... I will leave that for you.
Sorry you are stuck in JSP scriptlet land...Hope this helps.

Categories