I want to transfer some database data through a TCP socket. The data is formatted to JSON.
Since the database size might grow, I'm afraid that the String object maximum size will not be enough to store the entire data with JSON formatting.
I already had an problem transferring the data using the DataOutput function writeUTF().
What should I do? Maybe convert the database rows to CSV and transfer it through the Internet line by line? Or do I not need to worry about String limits and solve the writeUTF() problem by getting the bytes of the String, transferring them through the socket and rebuilding the String from the bytes at the destination?
Java strings can be extremely long - you're unlikely to run into problems with the String type itself. If you convert the string to binary first, then use writeInt to write the number of bytes, then the bytes themselves, that should be fine. The problem with writeUTF is that it uses writeShort, so it only handles up to 64K of data.
Related
I have a ruby program that writes data to a socket with sock.write, and I'm reading the data with ObjectInputStream in a java file. I'm getting an invalid header error that translate to the first few characters of my stream.
I've read that if you use ObjectInputStream you must write with ObjectOutputStream, but since the writing file is in ruby im not sure how to accomplish this.
As you say, ObjectInputStream assumes that the bytes it's receiving have been formatted by an ObjectOutputStream. That is, it is expecting the incoming bytes to be a specific representation of a Java primitive or object.
Your Ruby code is unlikely to format bytes in such a way.
You need to define exactly the byte format of the message passing from the Ruby to the Java process. You could tell us more about that message format, but it's likely you will need to use Java's ByteArrayInputStream (https://docs.oracle.com/javase/7/docs/api/java/io/ByteArrayInputStream.html). The data will come into the Java program as a raw array of bytes, and you will need to parse/unpack/process these bytes into whatever objects are appropriate.
Unless performance is critical, you'd probably be best off using JSON or YAML as the intermediate format. They would make it simple to send simple objects such as strings, arrays, and hashes (maps).
I am building history parser, there's an application that already done the logging task (text based).
Now that my supervisor want me to create an application to read that log.
The log is is created at the end of the month, and is separated by [date]:
[19-11-2014]
- what goes here
- what goes here
[20-11-2014]
- what goes here
- what goes here
etc...
If the log file has small size, there's no problem processing the content by DataInputStream to get the byte[], and convert it to String and then do the filtering process (by doing substring and such).
But when the file has a large size (about 100mb), it throws JavaHeapSpace exception. I know that this is because the length of the content exceeds String maxlength, when I try not to convert the byte[] into string, no exception was thrown.
Now the question is, how do I split the byte[] into several byte[]?
Which is each new byte[] only contains single:
[date]
- what goes here
So if within a month we have 9 dates in log, it would be split into 9 byte[].
The splitting marker would be based on [\\d{2}-\\d{2}-\\d{4}] , if it is string I could just use Regex to find all the marker, get the indexOf and then substring it.
But how do I do this without converting to string first? As it would throws the JavaHeapSpace.
I think there are several concepts here that you're missing.
First, an InputStream is a Stream, which means it is a flow of bytes. What you do with that flow is up to you, but saving all of the stream to memory defies the point of the stream construct altogether.
Second, a DataInputStream is used to read objects from a binary file that were serialized there by a DataOutputStream. Reading just a string is overkill for this type of Stream, since a simple InputStream can do that.
As for your specific problem, I would use a BufferedFileReader, and read one line at a time, until reaching the next date. At that point you can do whatever processing you need on the last chunk of lines you read, and free the memory. Thus not running into the same problem.
I have a BufferedReader object and a PrintWriter object. So I can work passing String objects made by json-io of any type (e.g.: List, Map, MyOwnClass)
My class have a byte[] attribute, this byte[] will keep a file bytes, such as an image.
The json generated of my class is very very big, obviously... Then i started to think that must have a better way to transfer files.
Should I change all the mechanism to transfer only byte[] instead of String? Does someone know what is the mechanism used by chat programs? Should I reserve the first 20 bytes of the array for the message identification?
I would write it to the socket in binary:
Assuming a class with one String and one byte[].
The String
The length of the String is written with DataOutputStream.writeInt(int) (or methods for smaller integers) and then OutputStream.write(byte[]) on the return value of String.getBytes(String) with the charset explicitly specified.
The byte[]
The length is written with DataOutputStream.writeInt(int) (or methods for smaller integers) and then OutputStream.write(byte[]) for the byte[] to transfer.
On the other side you would do the exact opposite of this procedure.
I chose this binary approach over JSON because even though you could transmit the byte[] with JSON almost as efficiently as in binary, it would defeat the very purpose of JSON: being human-readable.
I have some data in bytes, and I want to put them into Redis, but Redis only accepts binary safe string, and my data has some binary non-safe bytes. So how can I convert these bytes into binary safe string so that I can save them to Redis?
Base64 works for me, but it makes data larger, any better idea?
UPDATE: I want to serialize my protobuf object to Redis, and the serialized data has '\x00', so when I read the data from Redis, I can not deserialize the data to object. Then I tried base64, it works fine, but with larger size.
So I want to figure out how to serialize binary data (protobuf object) to Redis safely and with smaller size
You could try ISO-8859-1 encoding. This uses a one to one mapping between bytes and chars.
This could still result in corruption depending on why Redis need this "binary safe" string. You may have to use base64.
The only safe way to serialize a binary object (such as a protobuf object) is to base64 encode it. Base64 has a 33% overhead but gives you the ability to safely convert from arbitrary binary data to text (such as for use in an xml file) and back.
I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.
You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.