How to convert a Java object to String using Saxon

How to convert a Java object to String using Saxon - java

I am facing a problem with Xalon while converting Java object to String, i.e empty open close tags are converted to self closing tags. eg. <span></span> gets converted to </span>.
I have fixed simliar problem while using Saxon XSL transformer. Is it possible to use Saxon to convert a java Object to String instead of Xalon.

First, I'm sure you mean <span/> for the self-closing tag.
Second: why is this a problem? If you are generating XML, <span></span> means exactly the same as <span/>, and will be treated the same by any XML parser. (If you're reading the XML without an XML parser, then DON'T). On the other hand, if you are generating HTML, then specifying method="html" should be all you need to do, whether you are using Xalan or Saxon.
Third: I can't see any relationship between your serialization problem and the task of converting Java objects to strings.
You can certainly do such things in Saxon. The documentation for calling Java methods from Saxon can be found here: http://www.saxonica.com/documentation/extensibility/intro.xml (Sorry there's so much of it, but I don't know enough about your situation to give you a more precise pointer).

Related

Generate code from antlr tokens

We are currently working on trying to generate a new code using antlr. We have a grammar file that pretty much can recognize everything. Now, our problem is that we want to be able to create code again using the tokens that we generate to create this new file.
We have a .txt file with our tokens that looks like this:
[#0,0:6=' ',<75>,channel=1,1:0]
[#1,7:20='IDENTIFICATION',<6>,1:7]
[#2,21:21=' ',<75>,channel=1,1:21]
[#3,22:29='DIVISION',<4>,1:22]
[#4,30:30='.',<3>,1:30]
[#5,31:40='\n \t ',<75>,channel=1,1:31]
[#6,41:50='PROGRAM-ID',<16>,2:9]
[#7,51:51='.',<3>,2:19]
[#8,52:52=' ',<75>,channel=1,2:20]
[#9,53:59='testpro',<76>,2:21]
[#10,60:60='.',<3>,2:28]
[#11,61:70='\n \t ',<75>,channel=1,2:29]
[#12,71:76='AUTHOR',<31>,3:9]
[#13,77:77='.',<3>,3:15]
Or is there another way to create the old code using tokens?
Thanks in advance, Viktor

The most straight forward way to make the lexer output portable is to serialize the tokenized output of the lexer for transport and storage. You could equally serialize the entire parser generated parse tree. In either case, you will be capturing the full text of the source input.
The intrinsic complexity of the lexer stream object is a single class. The parse tree object complexity is also quite small, involving just a handful of standard classes. Consequently, the complexity of the serialization & deserialization is almost entirely a linear function of size of the parsed source input.
Google Gson is a simple-to-use, relatively fast Java object serialization library.
If your parser is generating some intermediate representation of the parsed source input, you could directly transport the IR using a defined record serialization library like Google FlatBuffers to save & restore IR model instances.

Difference between SolrJ's ResponseParsers

The SolrJ library offers different parsers for Solr's responses.
Namely:
BinaryResponseParser
StreamingBinaryResponseParser
NoOpResponseParser
XMLResponseParser
Sadly the documentation doesn't say much about them, other than:
SolrJ uses a binary format, rather than XML, as its default format.
Users of earlier Solr releases who wish to continue working with XML
must explicitly set the parser to the XMLResponseParser, like so:
server.setParser(new XMLResponseParser());
So it looks like the XMLResponseParser is there mainly for legacy purposes.
What are the differences between the others parsers?
Can I expect performance improvements by using an other parser over the XMLResponseParser?

The Binary Stream Parsers is meant to work directly with the Java Object Format (the binary POJO format) to make the creation of data objects as smooth as possible on the client side.
The XML parser was designed to work with the old response format where there wasn't any real alternatives (as there was no binary response format in Solr). It's a lot more work to consider all the options for an XML format than use the binary format directly.
The StreamingBinaryResponseParser does the same work as the BinaryResponseParser, but has been designed to make streaming documents (i.e. not creating a list of documents and returning that list, but instead return each document by itself without having to hold them all in memory at the same time) possible. See SOLR-2112 for a description of the feature and why it was added.
Lastly, yes, if you're using SolrJ, use the binary response format, unless you have a very good reason for using the XML based one. If you have to ask the question, you're probably better off with the binary format.

What is the use of static fields PI_ENABLE_OUTPUT_ESCAPING & PI_DISABLE_OUTPUT_ESCAPING and how can we use them?

I am new to jaxp and has no idea of using the above static fields and what they mean ?
Need its explanation along with examples.
Thanks in advance

(Disclaimer - I maintain the JDOM XML Library) - These PI's (ProcessingInstructions) are designed to indicate to XML outputting programs that they should break compatibility with the XML standard, and produce invalid XML.
Under certain conditions, this can be useful.
Here is a test-case in the JDOM test harness. It basically has input like (I have added some whitepsace to it to make it easier to see):
<root>
&
<?javax.xml.transform.disable-output-escaping ?>
&&
<?javax.xml.transform.enable-output-escaping ?>
&
</root>
In the above example, we have valid XML. If you were to process this data through a system that recognizes the processing instrucitons, it should output (something like)
<root>
&
&&
&
</root>
Note that this is no longer valid XML..... the & characters between the PI's have not been escaped correctly.
From a JDOM perspective, this is documented here in the javadoc
These instructions are normally used in XML Transformations to produce output that is 'pretty, and is not consumed by machines, but by people. Use it with caution.
Hope that gives you some insight.... all the best.

XSLT has a feature called "disable output escaping" that tells the serializer to output <a> as <a> whereas it would normally output <a>. This is a hack that is best avoided, for many reasons, one of which is that it requires a special side-channel for the transformation engine to communicate with the serializer (so the transformer can tell the serializer to switch doe on and off).
In JAXP, to allow one vendor's transformation engine to talk to another vendor's serializer, the protocol for passing these doe-on and doe-off requests is this pair of processing instructions.
You don't need this feature and you can safely ignore its existence. Never be tempted to imagine that just because a feature is there, you must be missing something if you never use it.

Is there any XPath processor for SAX model?

I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes.
Thank you all for the support!
For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" (http://www.saxpath.org/), but I can't find any implementing project.

My current list (compiled from web search results and the other answers) is:
http://code.google.com/p/xpath4sax/
http://spex.sourceforge.net/
https://github.com/santhosh-tekuri/jlibs/wiki/XMLDog (also contains a performance chart)
http://www.cs.umd.edu/projects/xsq/ (uniersity project, dead since 10 years, GPL)
MIT-Licensed approach http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html
Other parsers/memory models supporting fast XPath:
http://vtd-xml.sourceforge.net/ ("The world's fastest XPath 1.0 implementation.")
http://jaxen.codehaus.org/ (contains http://www.saxpath.org/)
http://www.saxonica.com/documentation/sourcedocs/streaming/streamable-xpath.html
The next step is to use the examples of XMLDog and compare the performance of all these approaches. Then, the test cases should be extended to the supported XPath expressions.

We regularly parse 1GB+ complex XML files by using a SAX parser which extracts partial DOM trees that can be conveniently queried using XPath. I blogged about it here: http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html - Sources are available on github - MIT License.

XPath DOES work with SAX, and most XSLT processors (especially Saxon and Apache Xalan) do support executing XPath expressions inside XSLTs on a SAX stream without building the entire dom.
They manage to do this, very roughly, as follows :
Examining the XPath expressions they need to match
Receiving SAX events and testing if that node is needed or will be needed by one of the XPath expressions.
Ignoring the SAX event if it is of no use for the XPath expressions.
Buffering it if it's needed
How they buffer it is also very interesting, cause while some simply create DOM fragments here and there, others use very optimized tables for quick lookup and reduced memory consumption.
How much they manage to optimize largely depends on the kind of XPath queries they find. As the already posted Saxon documentation clearly explain, queries that move "up" and then traverse "horizontally" (sibling by sibling) the document obviously requires the entire document to be there, but most of them require just a few nodes to be kept into RAM at any moment.
I'm pretty sure of this because when I was still making every day webapp using Cocoon, we had the XSLT memory footprint problem each time we used a "//something" expression inside an XSLT, and quite often we had to rework XPath expressions to allow a better SAX optimization.

SAX is forward-only, while XPath queries can navigate the document in any direction (consider parent::, ancestor::, preceding:: and preceding-sibling:: axis). I don't see how this would be possible in general. The best approximation would be some sort of lazy-loading DOM, but depending on your queries this may or may not give you any benefit - there is always a worst-case query such as //*[. != preceding::*].

Sorry, a slightly late answer here - it seems that this is possible for a subset of XPath - in general it's very difficult due to the fact that XPath can match both forwards and backwards from the "current" point. I'm aware of two projects that solve it to some degree using state machines: http://spex.sourceforge.net & http://www.cs.umd.edu/projects/xsq. I haven't looked at them in detail but they seem to use a similar approach.

I'll toss in a plug for a new project of mine, called AXS. It's at https://code.google.com/p/annotation-xpath-sax/ and the idea is that you annotate methods with (forward-axis-only) XPath
statements and they get called when the SAX parser is at a node that matches it. So with a document
<doc>
<nodes>
<node name="a">text of node 1</node>
<node name="b">text of node 2</node>
<node otherattr="I have attributes!">text of node 3</node>
</nodes>
</doc>
you can do things like
#XPath("/nodes/node")
void onNode(String nodeText)
{
// will be called with "text of node [123]"
}
or
#XPathStart("//node[#name='']")
void onNode3(Attrs node3Attrs) { ... }
or
#XPathEnd("/nodes/node[2]")
void iDontCareAboutNode3() throws SAXExpression
{
throw new StopParsingExpression();
}
Of course, the library is so new that I haven't even made a release of it yet, but it's MIT licensed, so feel free to give it a try and see if it matches your need. (I wrote it to
do HTML screen-scraping with low enough memory requirements that I can run it on
old Android devices...) If you find bugs, please let me know by filing them on the
googlecode site!

There are SAX/StAX based XPath implementations, but they only support a small subset of XPath expressions/axis largely due to SAX/StAX's forward only nature.. the best alternative I am aware of is extended VTD-XML, it supports full xpath, partial document loading via mem-map.. and a max document size of 256GB, but you will need 64-bit JVM to use it to its full potential

What you could do is hook an XSL transformer to a SAX input source. Your processing will be sequential and the XSL preprocessor will make an attempt to catch the input as it comes to fiddle it into whatever result you specified. You can use this to pull a path's value out of the stream. This would come in especially handy if you wanted to produce a bunch of different XPATH results in one pass.
You'll get (typically) an XML document as a result, but you could pull your expected output out of, say, a StreamResult with not too much hassle.

Have a look at the streaming mode of the Saxon-SA XSLT-processor.
http://www.saxonica.com/documentation/sourcedocs/serial.html
"The rules that determine whether a path expression can be streamed are:
The expression to be streamed starts with a call on the document() or doc() function.
The path expression introduced by the call on doc() or document must conform to a subset of XPath defined as follows:
any XPath expression is acceptable if it conforms to the rules for path expressions appearing in identity constraints in XML Schema. These rules allow no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis.
In addition, Saxon allows the expression to contain a union, for example doc()/(*/ABC | /XYZ). Unions can also be expressed in abbreviated form, for example the above can be written as doc()//(ABC|XYZ).
The expression must either select elements only, or attributes only, or a mixture of elements and attributes.
Simple filters (one or more) are also supported. Each filter may apply to the last step or to the expression as a whole, and it must only use downward selection from the context node (the self, child, attribute, descendant, descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position() or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it will not be numeric). Filters cannot be applied to unions or to branches of unions. Any violation of these conditions causes the expression to be evaluated without the streaming optimization.
These rules apply after other optimization rewrites have been applied to the expression. For example, some FLWOR expressions may be rewritten to a path expression that satisfies these rules.
The optimization is enabled only if explicitly requested, either by using the saxon:stream() extension function, or the saxon:read-once attribute on anXSLT xsl:copy-of instruction, or the XQuery pragma saxon:stream. It is available only if the stylesheet or query is processed using Saxon-SA."
Note: It is most likely in the commercial version this facility is available. I've used Saxon extensively earlier, and it is a nice piece of work.

Mmh I don't know if I really understand you. As far as I know, the SAX model is event oriented. That means, you do something if a certain node is encountered during the parsing. Yeah, it is better for memory but I don't see how you would like to get XPath into it. As SAX does not build a model, I don't think that this is possible.

I don't think xpath works with SAX, but you might take a look at StAX which is an extended streaming XML API for Java.
http://en.wikipedia.org/wiki/StAX

The standard javax xpath API technically already works with streams; javax.xml.xpath.XPathExpression can be evaluated against an InputSource, which in turn can be constructed with a Reader. I don't think it constructs a DOM under the covers.

Did you have tried also QuiXPath https://code.google.com/p/quixpath/ ?

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.

It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.

I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.

Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.