Is specifying String encoding when parsing byte[] really necessary?

Is specifying String encoding when parsing byte[] really necessary? - java

Supposedly, it is "best practice" to specify the encoding when creating a String from a byte[]:
byte[] b;
String a = new String(b, "UTF-8"); // 100% safe
String b = new String(b); // safe enough
If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

Different use cases have to be distinguished here: If you get the bytes from an external source via some protocol with a specified encoding then always use the first form (with explicit encoding).
If the source of the bytes is the local machine, for example a local text file, the second form (without explicit encoding) is better.
Always keep in mind, that your program may be used on a different machine with a different platform encoding. It should work there without any changes.

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?
But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)
And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?
If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:
If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".
If the requirement is to always encode (or decode) in using the platform default, then do that.
If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.
The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.
(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

Related

Java String.getBytes( charsetName ) vs String.getBytes ( Charset object )

I need to encode a String to byte array using UTF-8 encoding. I am using Google guava, it has Charsets class already define Charset instance for UTF-8 encoding. I have 2 ways to do:
String.getBytes( charsetName )
try {
byte[] bytes = my_input.getBytes ( "UTF-8" );
} catch ( UnsupportedEncodingException ex) {
}
String.getBytes( Charset object )
// Charsets.UTF_8 is an instance of Charset
byte[] bytes = my_input.getBytes ( Charsets.UTF_8 );
My question is which one I should use? They return the same result. For way 2 - I don't have to put try/catch! I take a look at the Java source code and I see that way 1 and way 2 are implemented differently.
Anyone has any ideas?

If you are going to use a string literal (e.g. "UTF-8") ... you shouldn't. Instead use the second version and supply the constant value from StandardCharsets (specifically, StandardCharsets.UTF_8, in this case).
The first version is used when the charset is dynamic. This is going to be the case when you don't know what the charset is at compile time; it's being supplied by an end user, read from a config file or system property, etc.
Internally, both methods are calling a version of StringCoding.encode(). The first version of encode() is simply looking up the Charset by the supplied name first, and throwing an exception if that charset is unknown / not available.

The first API is for situations when you do not know the charset at compile time; the second one is for situations when you do. Since it appears that your code needs UTF-8 specifically, you should prefer the second API:
byte[] bytes = my_input.getBytes ( Charsets.UTF_8 ); // <<== UTF-8 is known at compile time
The first API is for situations when the charset comes from outside your program - for example, from the configuration file, from user input, as part of a client request to the server, and so on. That is why there is a checked exception thrown from it - for situations when the charset specified in the configuration or through some other means is not available.

Since they return the same result, you should use method 2 because it generally safer and more efficient to avoid asking the library to parse and possibly break on a user-supplied string. Also, avoiding the try-catch will make your own code cleaner as well.
The Charsets.UTF_8 can be more easily checked at compile-time, which is most likely the reason you do not need a try-catch.

If you already have the Charset, then use the 2nd version as it's less error prone.

default number system and charecter set in java

Thi is a fundamental questuion about how java works and so i dont have any code to support it.
I am new to java development and want to know how the different number systems, charecter sets like UTF 8 and unicode come together in Java.
Lets say a user creates a new string and int with the same value.
int i=100;
String S="100";
The hardware of a computer understands zeros and ones. so it has to be converted to binary?(correct me if im wrong). this conversion should be done by the JVM(correct me if im wrong)? and to represent charecters of different languages into charecters that can be typed into the keyboard (english) UTF-8 and such conversions are used(correction needed)?
now how does this whole flow fit into the bigger picture of running a java web application?
how does a string/int get converted to a binary for the machine's hardware to understand?
how does it get converted to UTF-8 for a browser to understand?
and what are the default number format and charecterset in java? if im reading contents of a file? will they be read into binary or utf-8?

All computers run in binary. The conversion is done by the JVM and the computer that you have. You shouldn't worry about converting the code into the coordinating 1's and 0's. The browser has its own conversion hard code to change the universal 1's and 0's(used by all programs and computer software) into however it decides to display the given information. All languages are just a translation guide for the user to "speak" with the computer. And vice versa. Hope this helps though I don't think I really answered anything.

How java represents any data type in memory is the choice of the actual JVM. In practice, the JVM will chose the format native to the processor (e.g. chose between little/big endian for int), simply because it offers the best performance on that platform.
Basically, the JLS makes certain guarantees (like that a byte has 8 bits and the values range from -128 to 127) - the VM just maps that to the platform as it deems suitable (the JLS was specified to match common computing technology closely, so there is usually no magic needed to guess how primitive types map to the platform).
You should never care how the VM represents data in memory, java does not offer any legal way to access the data in a manner where you would need to know (bypassing most of the VM's logic by using sun.misc.Unsafe is not considered legal).
If you care for educational purposes, learn what binary representations the underlying platform (e.g. x86) uses and take a look at the VM. It has little to do with java really, its all VM and platform specific.
For java.lang.String, its the implementation of the class that defines how the String is stored internally - it went through quite some changes over major java versions - but what that String exposes is quite narrowly defined (see JDK javadoc for String.length(), String.charAt()).
As for how user input is translated to java standard types, thats actually platform specific. The JVM selects the default encoding (e.g. String.toBytes() can return quite different results for the same string, depending on the platform - thats why its recommended to explictly specify the desired encoding). Same goes for many other things (time zone, number format etc.).
CharSets and Formats are building blocks the program wires up to translate data from the outside world (file, http or user input) into javas representation of data (or vice versa). For example, a Web application will use the encoding from a HTTP header to determine what CharSet to use when interpreting the contents (the HTTP headers encoding is defined to be US-ASCII by the spec).

filename encoding using java

I want to write a reversible Encoder along with the corresponding Decoder, so that any string may be encoded to a legal file name corresponding to file naming rules of the Unix file system.
How to achieve this?
Example:
"xyz.txt" would be a valid file name, while "xyz/.txt" would not.

tl;dr: Your approach is flawed. Stick with the limitations of the file system. They're pretty hard to gracefully overcome (especially without introducing your own, even weirder limitations).
It's not possible to make one that is strictly decodable. You're trying to map a larger domain to a smaller domain which means that the reverse mapping cannot be known-correctly reversible.
This is easy to demonstrate with a simple example: how do you encode / such that it can be reversed? "Easy," you might say, "I'll just replace with the token x." But now how do you know when an x is an actual x and when your x is a 'special' x that should be converted to /? You can't.
You can of course make a system that is very unlikely to have any accidental clashes. For example, rather than changing / to - (which would be very error prone), you could change it to ---.
Oh, also, for what it's worth, most unix file systems actually consider any characters other than / or a null char a valid character (more). Obviously using that is a pain in the ass though.

Strings transcoding in Java

I've found a piece of code recently, which does the following:
String s = ... // whatever
...
s = new String(s.getBytes(myEncoding), myEncoding);
For me it appears to be absolutely non-sense.
Is it possible that under certain circumstances (some specific combination of locale settings, used technologies, etc.), this code will do something useful?
Thanks in advance

yes, that code is generally nonsense. yes, it's possible that that code could be doing "something" to the string (probably not anything good). generally speaking, if you have already incorrectly converted bytes to chars, trying to re-convert is rarely going to give you legitimate results. (there may be isolated instances where the right combination of character encodings may work).

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.

It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.

I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.

Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.