Edit HTML Document with Java - java

I have an HTML document stored in memory (set on a Flying Saucer XHTMLPanel) in my java application.
xhtmlPanel.setDocument(Main.class.getResource("/mailtemplate/DefaultMail.html").toString());
html file below;
<html>
<head>
</head>
<body>
<p id="first"></p>
<p id="second"></p>
</body>
</html>
I want to set the contents of the p elements. I don't want to set a schema for it to use getDocumentById(), so what alternatives do I have?

XHTML is XML, so any XML parser would be my recommendataion. I maintain the JDOM library, so would naturally recommend using that, but other libraries, including the embedded DOM model in Java will work. I would use something like:
Document doc = new SAXBuilder().build(Main.class.getResource("/mailtemplate/DefaultMail.html"));
// XPath that finds the `p` element with id="first"
XPathExpression<Element> xpe = XPathFactory.instance().compile(
"//p[#id='first']", Filters.element());
Element p = xpe.evaluateFirst(doc);
p.setText("This is my text");
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(doc, System.out);
Produces the following:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>
<p id="first">This is my text</p>
<p id="second" />
</body>
</html>

use a fine graded Html parser and manipulation library like jsoup. You can easily create a Document by passing the html to jsoup.parse(String htmlContent) function. This library allows all of the DOM manupulation function including CSS or jquery-like selector syntax. doc.selct(String selector), where doc is an instance of Document.
For example you can select the first p using doc.select("p").first(). A minimal working solution would be:
Document doc = jsoup.parse(htmlContent);
Element p = doc.select("p").first();
p.text("My Example Text");
Reference:
Use selector-syntax to find elements

Related

Java sax parsing replacing a custom tag with the resolved value

I have an XML String which is actually an HTML. It contains few custom tags that should be read and replaced with actual value. I am unable to figure out how to do this using SAX parsing
<html>
<body>
<p>The joiner report for today</p>
<p><APP:FT value="THIS_WEEKDAY"/></p>
<p> </p>
</body>
</html>
This template would be evaluated using a SAX parsing and java code, where the value of the custom tag
<APP:FT>
would be evaluated using java code. For example
<APP:FT value="THIS_WEEKDAY"/>
should be replaced by TUESDAY considering today is 13-Dec-2016. It is easy to find the value, but I am unable to figure out a way to replace this in the HTML string. The final HTML should look like
<html>
<body>
<p>The joiner report for today</p>
<p>TUESDAY</p>
<p> </p>
</body>
</html>
Thank you folks for reading through. i solved the problem not by XML but by using freemarker template API - http://freemarker.org/

Convert XML document render to hard-code as HTML

I have a requirement to publish a HTML file from an XML file where the HTML file will show hard-coded values for the specific point in time they were present on the XML file (i.e. independent of XML changes after the HTML doc is created).
Example: XML File
<dvd>
<name>Titanic</name>
<price>10</price>
</dvd>
<dvd>
<name>Avatar</name>
<price>12</price>
</dvd>
Now I need to convert these into a HTML document whereby the values are hardcoded into the HTML
Example HTML File
<html>
<body>
<h1>DVD List</h1>
<table>
<tr ...>
<th>Name</th><th>Price</th>
<td>Titanic</td><td>10</td>
<td>Avatar</td><td>12</td>
I have tried using XSLT however this only provides a render of the XML document that is updated according to XML changes. I would require a point-in-time HTML document referring to the values as they were on the XML.
Perhaps there is an easy way to do this with existing technologies, or some simple custom Java code?

how to implement build a selector for HTML DOM elements by its class name using regexp

I have a question here. If I have a html file here.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title> New Document </title>
<meta name="Generator" content="EditPlus">
<meta name="Author" content="">
<meta name="Keywords" content="">
<meta name="Description" content="">
</head>
<body>
<h1>Welcome to My Homepage</h1>
<p class="intro">My name is Donald.</p>
<h1 class="intro"><p class="important">Note that this is an important paragraph.</p>
</h1>
<div class="intro important"><p class="apple">I live in apple.</p></div>
<div class="intro important">I like apple.</p></div>
<p>I live in Duckburg.</p>
</body>
</html>
Right now I want to get html element by class name.
If the class name is ".intro", it should return:
My name is Donald.
<p class="important">Note that this is an important paragraph.</p>
If the class name is ".intro.important" it should return:
Note that this is an important paragraph.
If the class name is ".intro.important>.apple", it should return:
I live in apple.
I know jquery has class selector this function, but now I want to implement this function.
Can I use java regexp to do this? It seems like that the class name is single string is ok. But if the class name has a child class name, it will make it hard.
One more question, can java get the dom structure of the html?
You can't parse [x]HTML with RegEx
It's that simple, RegExp was not built to cover the full grammar of XML and different tools need to be used for different jobs.
CSS Selectors not readily available
Unfortunately CSS selector parsers are not yet (afaik) a part of DOM parsers so you would need to use an XPath parser to achieve the same things as with CSS selectors.
There are however some projects such as jquery4j.org which port jQuery (+ widgets) to Java, but they don't bring CSS selectors to the table, the bring a lot more and I'm not sure if you really need all that.
XPath Selectors as an alternative to CSS Selectors
DOM parser + XPath parser for Java are the best approach. The DOM parser reads and load the HTML structure as DOM objects while the XPath parser uses (its own different type of selectors) to find objects within the DOM.
But be careful, don't feed the DOM parser huge amounts of HTML code (entire pages) unless you really need it to sift through it all. If you have a smaller piece of string that isolates the targeted area in the HTML where your info is present then it's better to use DOM with that. This is because DOM parsers are memory hungry beasts.
Can I use java regexp to do this?
You can create regex that selects nested content within tag with specific class name.
I can give you regex that finds content within a tag but it doesn't care of class name:
<([a-z][a-z0-9]*+)[^>]*>.*?</\\1>
But if the class name has a child class name, it will make it hard.
In such case it is easier to use java string.
can java get the dom structure of the html?
Yes, it can be done with jsoup at jsoup.org.

Extract HTML from xml

I want to extract html page from an xml file. Any ideas please ?
<?xml ....>
<first>
</first>
<second>
</second>
<xhtml>
<html>
.....some html code here
</html>
</xhtml>
I want to extract html page as it is from the above.
because xml and html markup is similar any xml parser might have issues with it. I would suggest when you save the html data in the xml file, you encode it to prevent the xml parser from having issues. Then when you recall the data from the xml you just need to decode it for use.
<?xml ....?
<first></first>
<second></second>
<markup>
<html>
code here
</html>
</markup>
when you decode the markup section it will look like this
<html>
code here
</html>
You might find this of some use:
http://www.w3schools.com/xml/xml_parser.asp
You can extract the HTML from the XML using JavaScript. You can then create an element on your HTML page in JavaScript and dump the HTML in there. The only issue with this is that it seems that the XML data you're receiving has a HTML tag.
If you want to add the content to an existing page, then you would have to strip the html and body tags.
If you use python, extraction can be very easy.
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<?xml >
<first>
</first>
<second>
</second>
<xhtml>
<html>
.....some html code here
</html>
</xhtml>
'''
doc = SimplifiedDoc(html)
html = doc.xhtml.html
print (html)
First you need to install simplified_scrapy using pip.
pip install simplified_scrapy

Sum values in HTML page using XQuery / XPath 2.0

I have an HTML page like:
<html>
<head><title>Hello</title></head>
<body>
<div id="foo">
<h6>9</h6>
<h6>3</h6>
<h6>5</h6>
</div>
</body>
</html>
I'd like to use XQuery (or xpath 2.0) to sum the values in the <h6> elements. I'm using xmlbeans (with saxon as the engine) and I tried the following which just gives me a null pointer exception;
XmlObject xml = XmlObject.Factory.parse(xmlFile);
XmlCursor htmlCursor = xml.newCursor();
XmlCursor result = htmlCursor.execQuery("sum(for $val in $this//h6 return number($val))");
System.out.println(result.getObject());
Any ideas?
Use the XPath Sum Function:
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(xmlFile);
XPath path = XPathFactory.newInstance().newXPath();
System.out.println(path.evaluate("sum(//h6)", doc));
prints:
17
I'm guessing here but the $this in your query looks a bit odd. In standard XQuery there is no variable in scope called $this. I assume you want the context item, so your query would look like:
sum(for $val in .//h6 return number($val))
or more simply:
sum(.//h6/number(.))
or just:
sum(.//h6)
Omitting the dot would mean that the XPath starts at the root of the document, not at the context item.

Categories