The SolrJ library offers different parsers for Solr's responses.
Namely:
BinaryResponseParser
StreamingBinaryResponseParser
NoOpResponseParser
XMLResponseParser
Sadly the documentation doesn't say much about them, other than:
SolrJ uses a binary format, rather than XML, as its default format.
Users of earlier Solr releases who wish to continue working with XML
must explicitly set the parser to the XMLResponseParser, like so:
server.setParser(new XMLResponseParser());
So it looks like the XMLResponseParser is there mainly for legacy purposes.
What are the differences between the others parsers?
Can I expect performance improvements by using an other parser over the XMLResponseParser?
The Binary Stream Parsers is meant to work directly with the Java Object Format (the binary POJO format) to make the creation of data objects as smooth as possible on the client side.
The XML parser was designed to work with the old response format where there wasn't any real alternatives (as there was no binary response format in Solr). It's a lot more work to consider all the options for an XML format than use the binary format directly.
The StreamingBinaryResponseParser does the same work as the BinaryResponseParser, but has been designed to make streaming documents (i.e. not creating a list of documents and returning that list, but instead return each document by itself without having to hold them all in memory at the same time) possible. See SOLR-2112 for a description of the feature and why it was added.
Lastly, yes, if you're using SolrJ, use the binary response format, unless you have a very good reason for using the XML based one. If you have to ask the question, you're probably better off with the binary format.
Related
Hello i am looking for an binary serialisation for java that:
- use xsd for schema
- output very small byte streams
- the byte stream should not contain the field names and data types
- pojos should be generated like it is possible gor jaxb
- nice to have: an implantation in java script
Do any some one know a solution for this?
If you're using an XSD schema, the conventional expectation is that you'd be serialising to/from XML. That's not a very small byte stream; plain text is quite inefficient for representing binary data (ints, floats, etc).
However there is an option. XSD schema and ASN.1 schema are interchangeable, more or less. There's even an official translation between the two defined by the ITU. There are tools that translate between the two.
Why is this relevant? Well, with ASN.1 you have access to a variety of different wire formats. There's a bunch of binary ones, as well as text ones (including, yes, XML and JSON). The important thing is that one of the binary ones is uPER (unaligned Packed Encoding Rules), which will use the bare minimum of bits to represent the data being sent.
For example, suppose that you'd got a class with an integer field, and you'd constrained its value to be between 0 and 7. uPER would use only 3 bits for that field.
What you can have is an XSD schema being i) tranlated to ASN.1 and compiled by an ASN.1 compiler (OSS), or ii) compiled directly by an ASN.1 compiler (Obj-Sys), producing Java POJO classes that can be serialised to/from ASN.1's uPER wireformat (and all the other binary formats, and XML and JSON too for that matter, depending on the ASN.1 compiler one's using). It's a similar way of working as with jaxb / xjc.
The tooling I've suggested in the previous paragraph requires, AFAIK, the ASN.1 proprietary compilers and tools from either Objective Systems (obj-sys.com) or OSS Nokalva (www.oss.com), and they're not free (n.b. I've been a customer of both, not otherwise associated with them). I think that there's a free online converter for XSD<-->ASN1 schema, and there are some free ASN1 compilers (though they commonly target C, not Java).
Links: OSS's XSD translator, Objective System's Compiler reference for XSD translation, OSS Java support, Obj-Sys's Java support
Having wittered on about XSD, ASN.1, etc there are other options that might be usable but probably mean dropping the XSD schema and using something else.
Google Protocol Buffers
There are Java (and a load of others) bindings for Google Protocol Buffers, and GBP wireformat is binary. It's not as good as ASN.1's uPER for data size, but certainly smaller than XML text. See here. It has its own schema language, and as far as I know there's no translator between XSD and GPB's.
Capn Proto
Another interesting option (see this project), again a binary format. It won't quite beat uPER for size, but it is fast to serialise / deserialise (or at least it is in C/C++). Again, I know of no translation between its schema language and XSD.
We are currently working on trying to generate a new code using antlr. We have a grammar file that pretty much can recognize everything. Now, our problem is that we want to be able to create code again using the tokens that we generate to create this new file.
We have a .txt file with our tokens that looks like this:
[#0,0:6=' ',<75>,channel=1,1:0]
[#1,7:20='IDENTIFICATION',<6>,1:7]
[#2,21:21=' ',<75>,channel=1,1:21]
[#3,22:29='DIVISION',<4>,1:22]
[#4,30:30='.',<3>,1:30]
[#5,31:40='\n \t ',<75>,channel=1,1:31]
[#6,41:50='PROGRAM-ID',<16>,2:9]
[#7,51:51='.',<3>,2:19]
[#8,52:52=' ',<75>,channel=1,2:20]
[#9,53:59='testpro',<76>,2:21]
[#10,60:60='.',<3>,2:28]
[#11,61:70='\n \t ',<75>,channel=1,2:29]
[#12,71:76='AUTHOR',<31>,3:9]
[#13,77:77='.',<3>,3:15]
Or is there another way to create the old code using tokens?
Thanks in advance, Viktor
The most straight forward way to make the lexer output portable is to serialize the tokenized output of the lexer for transport and storage. You could equally serialize the entire parser generated parse tree. In either case, you will be capturing the full text of the source input.
The intrinsic complexity of the lexer stream object is a single class. The parse tree object complexity is also quite small, involving just a handful of standard classes. Consequently, the complexity of the serialization & deserialization is almost entirely a linear function of size of the parsed source input.
Google Gson is a simple-to-use, relatively fast Java object serialization library.
If your parser is generating some intermediate representation of the parsed source input, you could directly transport the IR using a defined record serialization library like Google FlatBuffers to save & restore IR model instances.
As the title says. I'm sending a message, from my server, into a proxy which is outside of my control which then sends it onto my application. All I can do is send and receive strings. Is it possible to serialize to a plain string and send in this way without an input/output stream as you would normally have?
TIA
A little more info:
public class myClass implements java.io.Serializable {
int h = "ccc";
int i = "bbbb";
String myString = "aaaa";
}
I have this class, for example. Now I want to serialize it and send it as a string inside my HTTPpost and send to the proxy, can't do anything about this stage:
HttpPost post = new HttpPost("http://www.myURL.com/send.php?msg="+msg);
Then receive the msg as a string on the other side and convert it back.
Is that easily done without to many other library?
Yes.
This is done every day using JSON and XML, just to name a few formats of strings that are easily formatted and parsed. (Read about JAXRS to know about a way to use JSON formatted strings to do this and do the transfers. Or, read about JAXB which will format as XML but doesn't halp with the communication of the strings.)
You can do it in CSV format.
You can do it in fixed with fields of characters.
Morse code isn't much of a different concept only it starts with strings and converts to short and long beeps.
The way it works is this:
There is some code to which you pass an object and it returns a string in a known format.
You send the string to the other server somehow. Some ways to send strings have limits on the length.
The other server receives the string.
Using its knowledge of the format, that other server parses out the string contents and uses it.
Some notes:
If both servers use Java (or C# or Python or PHP or whatever) the formatting and parsing become symetrical. You start with a Java object of some type and end up with a Java object in the other JVM of the same type. But that is not a given. You can store values in a custom POJO in one server and a Map in the other.
If you write code to format and parse, it seems really easy as long as the contents are simple and you don't run afoul of transmission rules. For example, if you send in the query part of an HTTP get, you can't have any ampersand characters in the string.
If you use an existing library, you take advantage of everyone else's acquired knowlege of how to do this without error.
If you use a standard format for the string, it is easy to explain what's going on to someone else. If your project works, a third server might want to be in the communication loop and if it's controlled by someone else ...
Formatting is easier than parsing. There are lots of pitfalls that other people have already solved. If you are doing this to learn ways not to do things and improve your own knowledge base, by all means, do it yourself. If you want rock solid performance, use an existing and standard library and format.
Take a look at XStream. It serializes into XML, and is very simple to use.
Take a look at there Two Minute Tutorial
Yes it is possible. You can use ajax to to serialize the string to a json object and have it back to the server using an ajax.post event (javascript event).
I am facing a problem with Xalon while converting Java object to String, i.e empty open close tags are converted to self closing tags. eg. <span></span> gets converted to </span>.
I have fixed simliar problem while using Saxon XSL transformer. Is it possible to use Saxon to convert a java Object to String instead of Xalon.
First, I'm sure you mean <span/> for the self-closing tag.
Second: why is this a problem? If you are generating XML, <span></span> means exactly the same as <span/>, and will be treated the same by any XML parser. (If you're reading the XML without an XML parser, then DON'T). On the other hand, if you are generating HTML, then specifying method="html" should be all you need to do, whether you are using Xalan or Saxon.
Third: I can't see any relationship between your serialization problem and the task of converting Java objects to strings.
You can certainly do such things in Saxon. The documentation for calling Java methods from Saxon can be found here: http://www.saxonica.com/documentation/extensibility/intro.xml (Sorry there's so much of it, but I don't know enough about your situation to give you a more precise pointer).
I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.
It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.
Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.
1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.
Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.
You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.
I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.
Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.
Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).