Java String.getBytes( charsetName ) vs String.getBytes ( Charset object )

Java String.getBytes( charsetName ) vs String.getBytes ( Charset object ) - java

I need to encode a String to byte array using UTF-8 encoding. I am using Google guava, it has Charsets class already define Charset instance for UTF-8 encoding. I have 2 ways to do:
String.getBytes( charsetName )
try {
byte[] bytes = my_input.getBytes ( "UTF-8" );
} catch ( UnsupportedEncodingException ex) {
}
String.getBytes( Charset object )
// Charsets.UTF_8 is an instance of Charset
byte[] bytes = my_input.getBytes ( Charsets.UTF_8 );
My question is which one I should use? They return the same result. For way 2 - I don't have to put try/catch! I take a look at the Java source code and I see that way 1 and way 2 are implemented differently.
Anyone has any ideas?

If you are going to use a string literal (e.g. "UTF-8") ... you shouldn't. Instead use the second version and supply the constant value from StandardCharsets (specifically, StandardCharsets.UTF_8, in this case).
The first version is used when the charset is dynamic. This is going to be the case when you don't know what the charset is at compile time; it's being supplied by an end user, read from a config file or system property, etc.
Internally, both methods are calling a version of StringCoding.encode(). The first version of encode() is simply looking up the Charset by the supplied name first, and throwing an exception if that charset is unknown / not available.

The first API is for situations when you do not know the charset at compile time; the second one is for situations when you do. Since it appears that your code needs UTF-8 specifically, you should prefer the second API:
byte[] bytes = my_input.getBytes ( Charsets.UTF_8 ); // <<== UTF-8 is known at compile time
The first API is for situations when the charset comes from outside your program - for example, from the configuration file, from user input, as part of a client request to the server, and so on. That is why there is a checked exception thrown from it - for situations when the charset specified in the configuration or through some other means is not available.

Since they return the same result, you should use method 2 because it generally safer and more efficient to avoid asking the library to parse and possibly break on a user-supplied string. Also, avoiding the try-catch will make your own code cleaner as well.
The Charsets.UTF_8 can be more easily checked at compile-time, which is most likely the reason you do not need a try-catch.

If you already have the Charset, then use the 2nd version as it's less error prone.

Related

Error which "shouldn't happen" caused by MalformedInputException when reading file to string with UTF-16

Path file = Paths.get("New Text Document.txt");
try {
System.out.println(Files.readString(file, StandardCharsets.UTF_8));
System.out.println(Files.readString(file, StandardCharsets.UTF_16));
} catch (Exception e) {
System.out.println("yep it's an exception");
}
might yield
some text
Exception in thread "main" java.lang.Error: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.lang.String.decodeWithDecoder(String.java:1212)
at java.base/java.lang.String.newStringNoRepl1(String.java:786)
at java.base/java.lang.String.newStringNoRepl(String.java:738)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2390)
at java.base/java.nio.file.Files.readString(Files.java:3369)
at test.Test2.main(Test2.java:13)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.String.decodeWithDecoder(String.java:1205)
... 5 more
This error "shouldn't happen". Here's the java.lang.String method:
private static int decodeWithDecoder(CharsetDecoder cd, char[] dst, byte[] src, int offset, int length) {
ByteBuffer bb = ByteBuffer.wrap(src, offset, length);
CharBuffer cb = CharBuffer.wrap(dst, 0, dst.length);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
return cb.position();
}
EDIT: As #user16320675 noted, this happens when an UTF-8 file with an odd number of characters is read as UTF-16. With an even number of characters, neither the Error nor the MalformedInputException happens. Why the Error?

This is a bug introduced in JDK 17.
Prior to this version, this Error throwing code was only used for the String constructor which indeed can never encounter a CharacterCodingException because it configures the decoder to substitute illegal content.
E.g., when you use
String s = new String(new byte[] { 50 }, StandardCharsets.UTF_16);
System.out.println(s.chars()
.mapToObj(c -> String.format(" U+%04x", c)).collect(Collectors.joining("", s, "")));
you’ll get
� U+fffd
In JDK 17, the code has been refactored and code duplication removed. Now, the same method decodeWithDecoder will be used for both, the String constructor and Files.readString. But Files.readString is supposed to report encoding errors instead of substituting the problematic content. Therefore, the decoder has not been configured to substitute malformed content, intentionally.
When you run
Path p = Files.write(Files.createTempFile("charset", "test"), new byte[] { 50 });
try(Closeable c = () -> Files.delete(p)) {
String s = Files.readString(p, StandardCharsets.UTF_16);
}
under JDK 16, you’ll correctly get
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.StringCoding.newStringNoRepl1(StringCoding.java:1053)
at java.base/java.lang.StringCoding.newStringNoRepl(StringCoding.java:1003)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2265)
at java.base/java.nio.file.Files.readString(Files.java:3353)
at first.test17.CharsetProblem.main(CharsetProblem.java:23)
The now-removed dedicated routine threw the MalformedInputException encapsulated in an IllegalArgumentException. The immediate caller looks like
/*
* Throws CCE, instead of replacing, if unmappable.
*/
static byte[] getBytesNoRepl(String s, Charset cs) throws CharacterCodingException {
try {
return getBytesNoRepl1(s, cs);
} catch (IllegalArgumentException e) {
//getBytesNoRepl1 throws IAE with UnmappableCharacterException or CCE as the cause
Throwable cause = e.getCause();
if (cause instanceof UnmappableCharacterException) {
throw (UnmappableCharacterException)cause;
}
throw (CharacterCodingException)cause;
}
}
and there lies the problem. When the code was refactored to use the same routine for the String constructor and Files.readString, this caller was not adapted. It still expects an IllegalArgumentException where the common method now throws an Error. Or the common method should have been adapted to better suit both cases, e.g. by having a parameter telling whether CharacterCodingException exceptions should be possible or not.
It’s worth noting that the charset decoding code has a lot of optimizations and shortcuts for commonly used charsets. That’s why you rarely get to this specific method. UTF-16 seems to be one (if not the) rare case where this method is used.

Different things going on here. But, yeah, it sure looks like you found a JVM bug! congratulations, I think :)
But, some context to explain precisely what's going on and what you found. I think your code's got bigger problems of your own making, and once you solve those, the JVM bug will no longer be a problem for you (but, by all means, do report it!). I'll try to cover all concerns:
Your code is broken because UTF-8 and UTF-16 are fundamentally incompatible. The upshot is that saving an even amount of characters as UTF-8 is likely to result in something that can be read with UTF-16 without error, although what you read will be utter gobbledygook. With an odd number of characters, you'll run into decoding errors.
The JVM is buggy! You found a JVM Bug - the effect of the decoding error should not be than an Error is thrown. The specific bug is that substitution doesn't actually cover all failure conditions, but the code is written with the assumption that it would.
The bug appears to be related to improper application of lenient mode, which requires explaining what substitution and underflow is.
UTF-8 vs. UTF-16
When you convert characters to bytes or vice versa, you are using a charset encoding.
Files are byte sequences, not characters.
There are no exceptions to these rules.
Hence, if you are typing characters, and saving, and you're not picking a charset encoding? Somebody is. If you're bashing on your keyboard in notepad.exe and saving, then notepad's picking one for you. You can't not have an encoding.
To try to explain the nuances of what happens here, forget about programming for a moment.
We decide on a protocol: You think of a way to describe a person using a single adjective; you write it down on a piece of paper (just the adjective) and give it to me. I then read it and guess which of our circle of friends you are attempting to describe. I happen to be bilingual, and speak fluent dutch and english. You don't know this, or you do but we never discussed this part of the protocol between us two.
You begin, and think of a particularly lanky person, so you decide to write down "slim", on the note. You leave the room, I enter, and I pick up the note.
I make a wrong assumption and I assume you wrote it in dutch instead, so I read this note, and, thinking you wrote it in dutch, I read 'slim', which is an actual dutch word, but it means "smart". Had you written down, say, "tall" on your note instead, this would not have occurred: "Tall" is not in the dutch dictionary, hence I'd know that you made an 'error' (you wrote an invalid word. It was valid to you, but I'm reading it assuming its dutch, so I'd think you made a mistake). But, "slim", those 4 exact letters, so happens to be both valid dutch AND valid english, but it doesn't mean the same thing at all.
UTF-8 vs UTF-16 is exactly like that: There are character sequences you can encode with UTF-16 that produce a byte stream, which so happens to also be entirely valid UTF-8, but it means something completely different, and vice versa! But there are also sequences of characters that, if saved as UTF-16 and then read as UTF-8 (or vice versa) would be invalid.
So, the "slim" situation can occur, and the "tall" situation can occur. Either one is mostly useless to you: When I read your note and see "Slim", and I thought that meant 'smart', we still 'lost' and I picked the wrong friend - no better a result. So what point is there, right? Anytime you convert chars to bytes and back again, every conversion step along the path needs to use the exact same encoding for all that beforehand or its never going to work.
But HOW it fails - that's the rub: When you wrote "slim" - I just picked the wrong friend. When you wrote "tall", I exclaimed that an error had occurred as that isn't a dutch word.
UTF-16 translates each character into a sequence of 2, 3, or 4 bytes depending on the character. When you save plain jane ascii characters as UTF-8, they all end up being 1 byte, and in general any 2 such bytes, decoded as a single UTF-16 character, 'is valid' (but a completely different character, completely unrelated to the input!), so if you save 8 ASCII chars as UTF-8 (or ASCII - boils down to the same stream of bytes), and then read it as UTF-16, it's highly likely to not throw any exceptions. You get a 4-length string of gobbledygook out, though.
Let's try it!
String test = "gerikg";
byte[] saveAsUtf8 = test.getBytes(StandardCharsets.UTF_8);
String readAsUtf16 = new String(saveAsUtf8, StandardCharsets.UTF_16);
System.out.println(test);
System.out.println(readAsUtf16);
... results in:
gerikg
来物歧
See? Complete gobbledygook - unrelated chinese characters came out.
But, now lets go with an odd number:
String test = "gerikgw";
byte[] saveAsUtf8 = test.getBytes(StandardCharsets.UTF_8);
String readAsUtf16 = new String(saveAsUtf8, StandardCharsets.UTF_16);
System.out.println(test);
System.out.println(readAsUtf16);
gerikgw
来物歧�
Note that weird question mark thing: That's a glyph (a glyph is an entry in a font: The symbol used as representing some character) that indicates: Something went wrong here - this isn't a real character, but an error in decoding.
But, shove gerikgw in a text file (make sure it has no trailing enter, as that's a symbol too), and run your code, and indeed - JVM BUG! Nice find!
Substitution
That weird question mark symbol thing is a 'substitution'. UTF encoders can encode any 32-bit value. The unicode system has 32-bits worth of addressable characters (actually, not quite, it's less, some slots are intentionally marked as not used and will never be, for fun reasons but too unrelated to go into), but not every single one of them available is 'filled'. There's room for new characters if we need em for later. Also, not every sequence of bytes is neccessarily valid UTF-8.
So, what to do when 'invalid' input is detected? One option, in strict parsing mode, is to crash (throw something). Another is to 'read' the error as the 'error' character (shown with that question mark glyph when you print it to a screen) and pick up where we left off. UTF is a pretty cool formatting system that 'knows' when a new character starts, thus, you can never get an offset issue (where we're 'offset by half' and keep reading stuff wrong because of misalignment).
The JVM bug
This explains the code you've pasted: That malformed encoding stuff 'cannot occur', as per the comment, because lenient mode is on, so any errors should just result in substitutions. Except it is right there, this is a really dumb error, one of those that really result in the author of this code visibly and audibly slapping their forehead in pure shame:
In this case, there's a single remaining byte in the sequence of bytes left, but in UTF-16 world, all valid byte representations are at least 2 bytes. This condition is called underflow and the decoder (CharsetDecoder cd) isn't buggy - it correctly detects this situation, thus, if (!cr.isUnderflow()) cr.throwException(); results in cr.throwException() being executed, which - naturally, throws MalformedInputException, which is a subtype of CharacterCodingException, thus, code hops straight to the catch 4 lines below which then says "This cannot happen".
Conclusion, author had a brainfart moment. Only 2 things can be true:
underflows cannot ever occur here, ever. The brainfart is that there's an if in there that checks for the impossible, that is pointless.
underflows CAN occur here, and the comment in the catch block is therefore incorrect. Substitution doesn't fix this problem.
The correct code would presumably be, instead:
private static int decodeWithDecoder(CharsetDecoder cd, char[] dst, byte[] src, int offset, int length) {
ByteBuffer bb = ByteBuffer.wrap(src, offset, length);
CharBuffer cb = CharBuffer.wrap(dst, 0, dst.length);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow()) cb.write(SUBSTITUTION_CHAR);
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
return cb.position();
}
In other words - if underflow occurs, emit one subtitution char (to represent the 'un-character' represented by that dangling single byte that doesn't mean anything), and just return the result. After all, that fits the strategy of lenient mode, and the comment says that we're evidently in lenient mode ("Substitution is enabled").
I suggest you file a bug at the open JDK project, or search for this first.
To work around it until its fixed...
The workaround
Replace:
Files.readString(file, StandardCharsets.UTF_16);
with:
fixedReadString(file, StandardCharsets.UTF_16);
...
public static String fixedReadString(Path file, Charset charset) {
try {
Files.readString(file, StandardCharsets.UTF_16);
} catch (Error e) {
if (!(e.getCause() instanceof MalformedInputException)) throw e;
// see notes
}
}
The one remaining question is what you want to do when this occurs. The input is definitely problematic, I generally despise 'lenient' mode. So I'd just throw new MalformedInputException and in general rewrite it all to use strict mode instead. However, if you want to duplicate the intended effect (which is: "来物歧�" - which isn't useful, but it is what the code was supposed to return), that's not actually all that easy to recreate. You can pray that just adding a random character at the end (say, a space) and re-parsing will hopefully at least produce something, you could rewrite the entire functionality of Files.readString itself (not too complicated), or just return "�"; - tossing away the entire string and just leaving that one substitution character, which should at least help someone debug into: Ah, right, I'm using the wrong charset to read this file.

JNA call with String behaves differently from one with byte[]

I have a JNA Java interface for a C function mpv_set_option_string defined as:
public interface MPV extends StdCallLibrary {
MPV INSTANCE = Native.loadLibrary("lib/mpv-1.dll", MPV.class, W32APIOptions.DEFAULT_OPTIONS);
long mpv_create();
int mpv_initialize(long handle);
int mpv_set_option_string(long handle, String name, String data);
}
When I call this like this:
System.setProperty("jna.encoding", "UTF8");
long handle = MPV.INSTANCE.mpv_create();
int error = MPV.INSTANCE.mpv_initialize(handle);
error = MPV.INSTANCE.mpv_set_option_string(handle, "keep-open", "always");
I get an error back (-5) from the last call, indicating the option (keep-open) is not found.
However, if I change the JNA function signature to:
int mpv_set_option_string(long handle, byte[] name, byte[] data);
...and then call it like this:
error = MPV.INSTANCE.mpv_set_option_string(
handle,
"keep-open\0".getBytes(StandardCharsets.UTF_8),
"always\0".getBytes(StandardCharsets.UTF_8)
);
...it returns no error (0) and works correctly (or so it seems).
What I don't get is, JNA is supposed to encode String by default as char * with UTF-8 encoding and NUL terminated (exactly what I do manually), yet I get different results.
Anyone able to shed some light on this?

You shouldn't be passing W32OPTIONS to a library that isn't a WIN32 API.
By default, JNA maps String to char*, so removing the options should fix the issue for you.
You should also be using an explicit native type for your handle instead of Java long. Pointer is probably correct in this case.

Looks like I found the issue, although I'm not 100% sure what is happening.
It seems that using W32APIOptions.DEFAULT_OPTIONS means it will use the UNICODE settings (because w32.ascii property is false). This looked okay to me, as mpv-1.dll works with UTF-8 strings only, which is Unicode.
However, now I'm guessing that in this case it means it will call a wide-char version of the library function (and if that doesn't exist, still call the original function), and probably means it encodes Strings with two bytes per character. This is because most Win32 libraries have an ASCII and WIDE version of methods accepting strings, but nothing for UTF-8.
Since mpv-1.dll only accepts UTF-8 (and isn't really Win32), strings should be just encoded as bytes in UTF-8 format (basically, just leave them alone). To let JNA know this, either donot pass a W32APIOptions map at all, or select the ASCII_OPTIONS manually.

String to Byte[] and Byte to String

Given the following example:
String f="FF00000000000000";
byte[] bytes = DatatypeConverter.parseHexBinary(f);
String f2= new String (bytes);
I want the output to be FF00000000000000 but it's not working with this method.

You're currently trying to interpret the bytes as if they were text encoded using the platform default encoding (UTF-8, ISO-8859-1 or whatever). That's not what you actually want to do at all - you want to convert it back to hex.
For that, just look at the converter you're using for the parsing step, and look for similar methods which work in the opposite direction. In this case, you want printHexBinary:
String f2 = DatatypeConverter.printHexBinary(bytes);
The approach of "look for reverse operations near the original operation" is a useful one in general... but be aware that sometimes you need to look at a parallel type, e.g. DataInputStream / DataOutputStream. When you find yourself using completely different types for inverse operations, that's usually a bit of a warning sign. (It's not always wrong, it's just worth investigating other options.)

Is specifying String encoding when parsing byte[] really necessary?

Supposedly, it is "best practice" to specify the encoding when creating a String from a byte[]:
byte[] b;
String a = new String(b, "UTF-8"); // 100% safe
String b = new String(b); // safe enough
If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

Different use cases have to be distinguished here: If you get the bytes from an external source via some protocol with a specified encoding then always use the first form (with explicit encoding).
If the source of the bytes is the local machine, for example a local text file, the second form (without explicit encoding) is better.
Always keep in mind, that your program may be used on a different machine with a different platform encoding. It should work there without any changes.

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?
But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)
And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?
If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:
If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".
If the requirement is to always encode (or decode) in using the platform default, then do that.
If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.
The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.
(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

Preventing "Null Byte Attacks" | Java

My initial understanding on this topic is that I need to prevent some junk characters available in request to avoid these attacks.
I have decided to solve this by Pattern matching for every request parameter before using it. Most of the posts available on internet talks about Null Byte and the example given shows how file IOs are the main victims of this attack. So following are my questions
Is File IOs are the only thing that null byte can affect or other operations are also victims of this attack?
What are the char/strings/patterns I need to take care if I want to filter my request parameter to be safe for null bye attacks? I have a list and I am sure it is not complete one. %00, \0, 0x00 in hex
The articles that I am referring to are:
http://projects.webappsec.org/w/page/13246949/Null%20Byte%20Injection
http://www.perlmonks.org/index.pl?node_id=38548
http://hakipedia.com/index.php/Poison_Null_Byte
Thanks in advance
So to make it more clear:
First post points out the vulnerability in java that I am talking about. String serverlogs.txt%00.db is allowed in java but when it comes to C/C++ this is serverlogs.txt as in C %00 would be replace by null byte causing the string to terminate after serverlogs.txt. So we should avoid such characters. This is what I am trying to figure out which such characters I should not allow.
String fn = request.getParameter("fn");
if (fn.endsWith(".db"))
{
File f = new File(fn);
//read the contents of “f” file
…
}

Have you tried it? I wrote this quick unit test:
#Test
public void test() throws Exception {
FileOutputStream out = new FileOutputStream("test.txt");
out.write("hello!".getBytes("utf-8"));
out.close();
String badPath = "test.txt\0foo";
File file = new File(badPath);
FileInputStream in = new FileInputStream(file);
System.out.println(StreamUtils.copyToString(in, Charset.forName("utf-8")));
}
Now, if the null character broke the string, I would expect to have the contents of my file printed to the console. Instead, I get a FileNotFoundException. For the record, this was using Java 1.7.0_40 on Ubuntu 13.04.
Update
Further investigation reveals this code in File#isInvalid:
final boolean isInvalid() {
if (status == null) {
status = (this.path.indexOf('\u0000') < 0) ? PathStatus.CHECKED
: PathStatus.INVALID;
}
return status == PathStatus.INVALID;
}

Not a bad question. I'm doubtful that this is a valid vulnerability on all platforms (for example, I believe Windows uses Pascal-style strings, not null-terminated strings, in its kernel), but I would not at all be surprised if some platforms and JVMs were in fact vulnerable to this kind of attack.
The key point to consider is where your strings are coming from, and what you're doing to those bytes before you interact with them as strings. Any bytes coming from a remote machine should always be assumed to be malicious until proven otherwise. And you should never take strings that come from over the Internet and try to turn them into paths on your local machine. Yes webservers like Apache do this, but that's also the most vulnerable code they have. The correct solution is: don't try to blacklist bad data (like null bytes), only whitelist good data.

You might also fight the issue of Null byte from the other angle!
in May 1013 Oracle fixed the problem: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8014846
So, upgrade to Java 8 or Java 7u40 and you are protected.
(Yes, i tested it!), it works!
If a link to my personal blog is not considered a spam, I'l drop it here:
http://crocode.blogspot.ru/2015/03/java-null-byte-injections.html

If I'm reading your question correctly, you want to prevent executable code from being injected into memory after the terminating null byte of a string.
Java ain't C.
Java doesn't use terminating null byes for its strings so you don't need to protect against this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java String.getBytes( charsetName ) vs String.getBytes ( Charset object ) - java

If you already have the Charset, then use the 2nd version as it's less error prone.

Related

Error which "shouldn't happen" caused by MalformedInputException when reading file to string with UTF-16

JNA call with String behaves differently from one with byte[]

String to Byte[] and Byte to String

Is specifying String encoding when parsing byte[] really necessary?

Preventing "Null Byte Attacks" | Java

Categories

Resources