This may sound foolish, but I'm wondering all the same...
Is it possible to take a string composed of a given character set and compress it by using a bigger character set, or composing it into a number then converting it back at one?
For example, if you had a string that you know what be composed of [a-z][A-Z][0-9]-_+=, could you turn that into a number, the swap it back using more characters in order to compress it?
This is an area I'm not familiar with, I still want to keep it as a string, just a shorter one. (for displaying/echoing/etc, not memory)
I wouldn't bother doing that, unless the string is huge. You can then try to compress it with commons-compress or java.util.zip
A String internally keeps an array of 16 bit characters, which for western european languages is a waste, you can convert to utf-8 which should give you 50% reduction by doing
String myString = .....
ByteArrayOutputStream baos = new ByteArrayOutputStream();
baos.write(myString.getBytes("UTF-8");
byte[] data = baos.toByteArray();
and hold onto it as a byte array.
Of course this is rather inconvienent if you actually want to use them as Strings, but if the point is long term storage, without much access, this would save you a bunch.
You would have to do the reverse to recreate a String.
String is a primitive type, you are unlikely to regain any space by converting unless you use Java's zip library, and even that will not yield the performance benefits you are presumably seeking.
Related
I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?
tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).
I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray() was banned, since it allocates space for every String.
As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?
I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.
A small stocktaking:
String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.
You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.
str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.
Given the following example:
String f="FF00000000000000";
byte[] bytes = DatatypeConverter.parseHexBinary(f);
String f2= new String (bytes);
I want the output to be FF00000000000000 but it's not working with this method.
You're currently trying to interpret the bytes as if they were text encoded using the platform default encoding (UTF-8, ISO-8859-1 or whatever). That's not what you actually want to do at all - you want to convert it back to hex.
For that, just look at the converter you're using for the parsing step, and look for similar methods which work in the opposite direction. In this case, you want printHexBinary:
String f2 = DatatypeConverter.printHexBinary(bytes);
The approach of "look for reverse operations near the original operation" is a useful one in general... but be aware that sometimes you need to look at a parallel type, e.g. DataInputStream / DataOutputStream. When you find yourself using completely different types for inverse operations, that's usually a bit of a warning sign. (It's not always wrong, it's just worth investigating other options.)
I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.
You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.
I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.
ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.
What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]
If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment
Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).
Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)