Extract HTML from xml - java

I want to extract html page from an xml file. Any ideas please ?
<?xml ....>
<first>
</first>
<second>
</second>
<xhtml>
<html>
.....some html code here
</html>
</xhtml>
I want to extract html page as it is from the above.

because xml and html markup is similar any xml parser might have issues with it. I would suggest when you save the html data in the xml file, you encode it to prevent the xml parser from having issues. Then when you recall the data from the xml you just need to decode it for use.
<?xml ....?
<first></first>
<second></second>
<markup>
<html>
code here
</html>
</markup>
when you decode the markup section it will look like this
<html>
code here
</html>

You might find this of some use:
http://www.w3schools.com/xml/xml_parser.asp
You can extract the HTML from the XML using JavaScript. You can then create an element on your HTML page in JavaScript and dump the HTML in there. The only issue with this is that it seems that the XML data you're receiving has a HTML tag.
If you want to add the content to an existing page, then you would have to strip the html and body tags.

If you use python, extraction can be very easy.
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<?xml >
<first>
</first>
<second>
</second>
<xhtml>
<html>
.....some html code here
</html>
</xhtml>
'''
doc = SimplifiedDoc(html)
html = doc.xhtml.html
print (html)
First you need to install simplified_scrapy using pip.
pip install simplified_scrapy

Related

Java sax parsing replacing a custom tag with the resolved value

I have an XML String which is actually an HTML. It contains few custom tags that should be read and replaced with actual value. I am unable to figure out how to do this using SAX parsing
<html>
<body>
<p>The joiner report for today</p>
<p><APP:FT value="THIS_WEEKDAY"/></p>
<p> </p>
</body>
</html>
This template would be evaluated using a SAX parsing and java code, where the value of the custom tag
<APP:FT>
would be evaluated using java code. For example
<APP:FT value="THIS_WEEKDAY"/>
should be replaced by TUESDAY considering today is 13-Dec-2016. It is easy to find the value, but I am unable to figure out a way to replace this in the HTML string. The final HTML should look like
<html>
<body>
<p>The joiner report for today</p>
<p>TUESDAY</p>
<p> </p>
</body>
</html>
Thank you folks for reading through. i solved the problem not by XML but by using freemarker template API - http://freemarker.org/

Convert XML document render to hard-code as HTML

I have a requirement to publish a HTML file from an XML file where the HTML file will show hard-coded values for the specific point in time they were present on the XML file (i.e. independent of XML changes after the HTML doc is created).
Example: XML File
<dvd>
<name>Titanic</name>
<price>10</price>
</dvd>
<dvd>
<name>Avatar</name>
<price>12</price>
</dvd>
Now I need to convert these into a HTML document whereby the values are hardcoded into the HTML
Example HTML File
<html>
<body>
<h1>DVD List</h1>
<table>
<tr ...>
<th>Name</th><th>Price</th>
<td>Titanic</td><td>10</td>
<td>Avatar</td><td>12</td>
I have tried using XSLT however this only provides a render of the XML document that is updated according to XML changes. I would require a point-in-time HTML document referring to the values as they were on the XML.
Perhaps there is an easy way to do this with existing technologies, or some simple custom Java code?

Edit HTML Document with Java

I have an HTML document stored in memory (set on a Flying Saucer XHTMLPanel) in my java application.
xhtmlPanel.setDocument(Main.class.getResource("/mailtemplate/DefaultMail.html").toString());
html file below;
<html>
<head>
</head>
<body>
<p id="first"></p>
<p id="second"></p>
</body>
</html>
I want to set the contents of the p elements. I don't want to set a schema for it to use getDocumentById(), so what alternatives do I have?
XHTML is XML, so any XML parser would be my recommendataion. I maintain the JDOM library, so would naturally recommend using that, but other libraries, including the embedded DOM model in Java will work. I would use something like:
Document doc = new SAXBuilder().build(Main.class.getResource("/mailtemplate/DefaultMail.html"));
// XPath that finds the `p` element with id="first"
XPathExpression<Element> xpe = XPathFactory.instance().compile(
"//p[#id='first']", Filters.element());
Element p = xpe.evaluateFirst(doc);
p.setText("This is my text");
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(doc, System.out);
Produces the following:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>
<p id="first">This is my text</p>
<p id="second" />
</body>
</html>
use a fine graded Html parser and manipulation library like jsoup. You can easily create a Document by passing the html to jsoup.parse(String htmlContent) function. This library allows all of the DOM manupulation function including CSS or jquery-like selector syntax. doc.selct(String selector), where doc is an instance of Document.
For example you can select the first p using doc.select("p").first(). A minimal working solution would be:
Document doc = jsoup.parse(htmlContent);
Element p = doc.select("p").first();
p.text("My Example Text");
Reference:
Use selector-syntax to find elements

The element type "META" must be terminated by the matching end-tag "</META>"

I've got the following error sometimes when I'm try to parse a XML file with Java (within GAE server):
Parse: org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 3; The element type "META" must be terminated by the matching end-tag "</META>".
Yet it is not happening all the time, sometimes It's works alright. The program parsing xml files and I've no problem with them.
This is the XML file I'm trying to parse:
http://www.fulhamchronicle.co.uk/london-chelsea-fc/rss.xml
Any help will be appreciated. Thanks.
Update:
Thanks for the answer. I changed my code to a different parser and the good news the file is now parsing correctly.
The bad it now moved for another feed same problem, same line despite completely different feed and it worked perfectly before. Could anyone think of why it's happening?
That looks like it is a live document; i.e. one that changes fairly frequently. There is also no sign of a <meta> tag in it.
I can think of two explanations for what is happening:
Sometimes the document is being generated or created incorrectly.
Sometimes you are getting an HTML error page instead of the document you are expecting, and the XML parser can't cope with a <meta> tag in the HTML's <head>. That is because the <meta> tag in (valid) HTML does not need to have a matching / closing </meta> tag. (And for at least some versions of HTML, it is not allowed to have a closing tag.)
To track this down, you are going to have to capture the precise input that is causing the parse to fail.
There are two solutions:
You can try <meta/> instead of <meta>.
Add spring.thymeleaf.mode=LEGACYHTML5 in your application.properties file.
and added this dependency in you pom.xml or build.gradle file.
pom.xml:
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.21</version>
</dependency>
gradle:
compile 'net.sourceforge.nekohtml:nekohtml:1.9.21'
just apply (/) after every line with meta
<meta name=" " content=" " />
when using ,
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
and really it works
It is not XML but HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
The XML parser will not parse it.
I see the file hasn't any content and it doesn't look like valid RSS file. May be any server-side error occurs.
can you use this tag
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

java dom xml parser get html tags(<p color="something">some text</p>) from xml

I have an xml file with html tags like:
<?xml version="1.0" encoding="utf-8" ?>
<blog>
<blogid>49</blogid>
<title>[FIXED] Job requests page broken</title>
<fulltext>
<img title="page broken" src="images/west/blog/site-broken.jpg" alt="page broken" />
<p><span style="background-color: #ccffcc;">Update 28/05/2011</span>: Job requests page seems to be working OK now. If you find any issues please use the contact page to notify us. Thank you for your patience!</p>
<p>Â </p>
<p>Well, what can I say? Why does it always have to be that way? You are trying to create something new and something else gets broken on the way...</p>
</fulltext>
Now I want the whole html part between tag as it is.
What I get right now is blank as I think dom is parsing html tags as well.
I tried xpath but it is not working with android.
I don't think you can get this not well-formed XML into a DOM as-is. (EDIT: or is it well-formed?)
You would need to a) either escape the characters - making the XML well-formed and parseable (but probably not into a DOM you want, I guess you want to display the HTML in a different system) or b) parse it using a stream processor or c) fix it using string manipulation (add <[[CDATA .. ]]>) and then parse it into a DOM.
HTH
HTML is a sub-language of XML (without getting into details related to XHTML). Therefore, there is no reason for the DOM parser not to treat those inner tags as XML tags.
Maybe what you're looking for is a way to flatten what's inside <fulltext>?
use a library like Jsoup for this purpose.
public static void main(String args[]){
String html = "<?xml version="1.0"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}

Categories