Converting java xml sax event calls to an xml string

Converting java xml sax event calls to an xml string - java

Does java xml sax api provide a ContentHandler subclass which would convert the event calls to an xml string. For example, the following calls to this handler should produce the following xml:
XMLPrinterHandler h;
String data = "hello";
h.startDocument();
h.startElement("", "element", "element", new Attributes());
h.characters(h.toCharArray(), 0, h.size());
h.endElement("", "element", "element");
h.endDocument();
System.out.println(h.getXml());
This should print:
<element>hello</element>
I'm dealing with some code which encodes some data as xml and would like to know the intermediate output. The encoding class takes a ContentHandler and calls the appropriate methods on it to encode the data.

You want:
SAXTransformerFactory f = new SAXTransformerFactory();
TransformerHandler t = new f.newTransformerHandler();
t.setResult(System.out);
t.startDocument();
etc
The TransformerHandler performs a "null transformation" from SAX input to lexical XML output.
You can also use
h.getTransformer().setOutputProperty()
to set serialization properties such as indenting, based on the properties defined in the XSLT specification. (The standard JDK TransformerHandler gives you XSLT 1.0 serialization properties, if you want the extended set defined in XSLT 3.0 plus Saxon extensions, use the Saxon implementation.)
Personally I find that writing Java code as a direct client of the SAX ContentHandler interface is very clumsy. I much prefer the XMLStreamWriter interface.

Related

EXI get JAXB unmarshaller

I wish to know the EXI equivalent of the JAXB unmarshaller.
I have looked at the EXI examples, where I have successfully obtained EXIFactory, set the grammar, get the XMLReader.
The example then creates a transformer to transform EXI stream to XML stream.
However, I do not need the output stream. I just need the unmarshalled result to stay as in-memory POJOs. I need the result to be direct unmarshall of EXI. I am using EXI marshall/unmarshall as a faster alternative to text XML.
Forgot to say which library I was using. Here it is:
<groupId>com.siemens.ct.exi</groupId>
<artifactId>exificient</artifactId>
<version>0.9.6</version>

JAXB Marshaller/Unmarshaller let you set various input/output mechanism
e.g.
Unmarshaller.unmarshal( javax.xml.transform.Source source )
or
Marshaller.marshal( Object jaxbElement, javax.xml.transform.Result result )
EXIficient implements
javax.xml.transform.Source (see com.siemens.ct.exi.api.sax.EXISource)
javax.xml.transform.Result (see com.siemens.ct.exi.api.sax.EXIResult)
Both, EXISource and EXIResult, can be initialized with the EXIFactory.
Hope this helps,
-- Daniel

JAXB and XSLT processor

I am using JAXB and maven-jaxb2-plugin and I am able right now to bind my schemas to Java code successfully.
I also have a .xsl file "annotate_schemas.xsl" that modifies a specific schema adding some additional information.
Finally, on the schema that I want transformed, I added the header:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="annotate_schemas.xsl"?>
...
The problem is that, while the .xsl is correct (if I open my schema file in a browser, the transformation is done flawlessly), JAXB ignores it and binds an untouched version of my schema.
My question is: Does JAXB (and/or its plugin) have an XSLT processor?? Is there a way to tell JAXB to bind the result of the XSLT transformation instead of the original?
Thank you very much

JAXB, like the vast majority of XML-consuming applications, takes no notice of an <?xml-stylesheet?> processing instruction. If you want to transform a document before passing it to JAXB, you need to transform it explicitly, for example by using the JAXP transformation API. (There is an option in JAXP to request transformation according to the value of the xml-stylesheet PI if that's how you want to control it: TransformerFactory.useAssociatedStylesheet()).

You can try something like this:
TransformerFactory transFact = TransformerFactory.newInstance();
Templates displayTemplate = transFact.newTemplates(new StreamSource(new File("your_xsl_file")));
TransformerHandler handler =
((SAXTransformerFactory) transFact).newTransformerHandler(displayTemplate);

Saxon in Java: XSLT for CSV to XML

Mostly continued from this question: XSLT: CSV (or Flat File, or Plain Text) to XML
So, I have an XSLT from here: http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html
And it converts a CSV file to an XML document. It does this when used with the following command on the command line:
java -jar saxon9he.jar -xsl:csv-to-xml.csv -it:main -o:output.xml
So now the question becomes: How do I do I do this in my Java code?
Right now I have code that looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
StreamSource xsltSource = new StreamSource(new File("location/of/csv-to-xml.xsl"));
Transformer transformer = transformerFactory.newTransformer(xsltSource);
StringWriter stringWriter = new StringWriter();
transformer.transform(documentSource, new StreamResult(stringWriter));
String transformedDocument = stringWriter.toString().trim();
(The Transformer is an instance of net.sf.saxon.Controller.)
The trick on the command line is to specify "-it:main" to point right at the named template in the XSLT. This means you don't have to provide the source file with the "-s" flag.
The problem starts again on the Java side. Where/how would I specify this "-it:main"? Wouldn't doing so break other XSLT's that don't need that specified? Would I have to name every template in every XSLT file "main?" Given the method signature of Transformer.transform(), I have to specify the source file, so doesn't that defeat all the progress I've made in figuring this thing out?
Edit: I found the s9api hidden inside the saxon9he.jar, if anyone is looking for it.

You are using the JAXP API, which was designed for XSLT 1.0. If you want to make use of XSLT 2.0 features, like the ability to start a transformation at a named template, I would recommend using the s9api interface instead, which is much better designed for this purpose.
However, if you've got a lot of existing JAXP code and you don't want to rewrite it, you can usually achieve what you want by downcasting the JAXP objects to the underlying Saxon implementation classes. For example, you can cast the JAXP Transformer as net.sf.saxon.Controller, and that gives you access to controller.setInitialTemplate(); when it comes to calling the transform() method, just supply null as the Source parameter.
Incidentally, if you're writing code that requires a 2.0 processor then I wouldn't use TransformerFactory.newInstance(), which will give you any old XSLT processor that it finds on the classpath. Use new net.sf.saxon.TransformerFactoryImpl() instead, which (a) is more robust, and (b) much much faster.

Storing html values in xml

Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.

There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.

I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.

Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.

TagSoup
JSoup
Beautiful Soup

Java XML Parsing and original byte offsets

I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.
For example, if I had an XML document with the content something like:
<html>
<body>
<div>text</div>
</body>
</html>
I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.
Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?

The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator interface. When you use the SAX API, you subclass DefaultHandler and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator into your DefaultHandler via setDocumentLocator(). As the parsing proceeds, the various callback methods on your ContentHandler are invoked (e.g. startElement()), at which point you can consult the Locator to find out the parsing position (via getColumnNumber() and getLineNumber())
Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.
Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.
edit: Found this example.

Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.
import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class Runner {
public static void main(String argv[]) {
XMLInputFactory factory = XMLInputFactory.newInstance();
try{
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("D:\\BigFile.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
Location location = streamReader.getLocation();
System.out.println("byte location: " + location.getCharacterOffset());
}
}
} catch(Exception e){
e.printStackTrace();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting java xml sax event calls to an xml string - java

Related

EXI get JAXB unmarshaller

JAXB and XSLT processor

Saxon in Java: XSLT for CSV to XML

Storing html values in xml

Java XML Parsing and original byte offsets

Categories

Resources