I am developing a socket application and my application needs to receive xml file over socket. The size of xml files received vary from 1k to 100k. I am now thinking of storing data that I received into a temporary file first, then pass it to the xml parser. I am not sure if it is a proper way to do it.
Another question is if I wanna do as mentioned above, should I pass file object or file path to xml parser?
Thanks in advance,
Regards
Just send it straight to the parser. That's what browsers do. Adding a temp file costs you time and space with no actual benefit.
Do you think it would work to put a BufferedReader around whatever input stream you have? It wouldn't put it into a temporary file, but it would let you hang onto that data. You can set whatever size BufferedReader you need.
Did you write your XML parser? If you didn't, what will it accept as a parameter? If you did write it, are you asking about efficiency. That is to say which object, the path or file, should your parser ask for to be most efficient?
You do not have to store the data from socket to any file. Just read whole the DataInputStream into a byte array and you can then do whatever you need. E.g. if needed create a String with the xml input to feed the parser. (I am assuming tcp sockets).
If there are preceding data you skip them so as to feed the actual xml data to the parser.
Related
I have a project in a concurrent and distributed programming course.
In this course we use Erlang.
I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html).
The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.
The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?
Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).
I will really appreciate any ideas or help!
Thanks!
You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML).
After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.
I have implemented the program that will transfer any txt file using the udp socket in java. I am using printwriter to write and read. But using that I am not able to transfer any file other than txt (say i want to transfer pdf). In this case what should be done. I am using the below function for file write.
Output_File_Write = new PrintWriter("dummy.txt");
Output_File_Write.print(new String(p.getData()));
Writers / PrintWriters are for writing text files. They take (Unicode-based) character data and encode it using the default character encoding (or a specified one), and write that to the file.
A PDF document (as you get it from the network) is in a binary format, so you need to use a FileOutputStream to write the file.
It is also a little bit concerning that you are attempting to transfer documents using UDP. UDP provides no guarantees that the datagrams sent will all arrive, or that they will arrive in the same order as they were sent. Unless you can always fit the entire document into a single datagram, you will have to do a significant amount of work to detect that datagrams have been dropped or have arrived in the wrong order ... and take remedial action.
Using TCP would be far simpler.
AFAIK PrintWriter is meant to be used with Text. Quote from doc
Prints formatted representations of objects to a text-output stream. This class implements all of the print methods found in PrintStream. It does not contain methods for writing raw bytes, for which a program should use unencoded byte streams.
To be able to send binary data you would need to use apt API for it, for example PrintStream
I have a socket connection for an external system which accept commands and sends results in XML. Every command and result is a standalone XML document.
Which Java parser (/combination) should i use to:
parse the stream continuously without closing the connection (i know it's stupid, but i tried DOMParser in the past and it throws an exception when an another document root encountered on the stream which is perfectly understandable). I need something like: continously read the stream and when a document is fully received, do it's processing. I don't know how big the document is, so i need to leave to the parser to figure out the end of the document.
deserialize every incoming document into bean instances (similary like XStream does)
serialize command object to the output stream from annotated class instances (similarly like XStream does). I don't want to use two separate libraries for sending and receiving.
Well... XStream.createObjectInputStream seems to be what you need. I'm not sure if the stream provided must enclose all objects into a root node, but anyway you could arrange an inputstreams that add some virtual content to accomodate to XStream needs. I'll expand this answer later...
http://x-stream.github.io/objectstream.html has some samples...
Root node
Indeed the reader needs a root node. So you need an inputstream that reads <object-stream> plus the real byte content, plus a </object-stream> at the end (if you mind about that end). Depending on what you need (inputstream, readers) the implementation can be slighly different but it can be done.
Sample
You can use SequenceInputStream to concatenate virtual content to the original inputstream:
InputStream realOne = ..
// beware of the encoding!
InputStream root = new ByteArrayInputStream("<object-stream>".toBytes("UTF-8"));
InputStream all = new SequenceInputStream(root, realOne);
xstream.createObjectInputStream(withRoot); // voi lá
If you use readers... well. There must be something equivalent :)
Your best bet is probably SAX parser. With it, you can implement ContentHandler document and in there, in endDocument method, do the processing and prepare for the next document. Have a look at this page: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html - for explanation and examples.
I'd say you read one full complete response, then parse it. Then read the other. I see no need to continuously read responses.
I have a servlet which reads BINARY file and sends it to a client.
byte[] binaryData = FileUtils.readFileToByteArray(path);
response.getWriter().print(new String(binaryData));
It works for NON BINARY files. When I have a BINARY file, I get receive file length bigger than origin or received file not the same. How I can read and send binary data?
Thanks.
Not via the Writer. Writers are for text data, not binary data. Your current code is trying to interpret arbitrary binary data as text, using the system default encoding. That's a really bad idea.
You want an output stream - so use response.getOutputStream(), and write the binary data to that:
response.getOutputStream().write(FileUtils.readFileToByteArray(path));
Do not use Writer, it will add encoding of your characters and there will not always be a 1:1 mapping (as you have experienced). Instead use the OutputStream directly.
And avoid reading the full content if you don't need it available at once. Serving many parallel requests will quickly consume memory. FileUtils have methods for this.
FileUtils.copyFile(path, response.getOutputStream());
I want to write multiple objects to a file, but the problem is that I dont have all the objects to write at once. I have to write one object and then close the file, and then maybe after sometime I want to add another object to the same file.
I am currently doing it as
FileOutputStream("filename", true)
so that it will append the object to the end of file and not overwrite it. But I get this error :
java.io.StreamCorruptedException: invalid type code: AC
any ideas how can I solve this issue ?
Thanks,
One option is to segment the file into individual messages. When you want to write a message, first serialize it to a ByteArrayOutputStream. Then open the file for appending with DataOutputStream - write the length with writeInt, then write the data.
When you're reading from the stream, you'd open it with DataInputStream, then repeatedly call readInt to find the length of the next message, then readFully to read the message itself. Put the message into ByteArrayInputStream and then deserialize from that.
Alternatively, use a nicer serialization format than the built-in Java serialization - I'm a fan of Protocol Buffers but there are lots of alternatives available. The built-in serialization is too brittle for my liking.
You can't append different ObjectOutputStreams to the same file. You would have to use a different form of serialization, or read the file in and write out all the objects plus the new objects to a new file.
You need to serialize/deserialize the List<T>. Take a look at this stackoverflow thread.