I'm reading Java, A Beginner's guide by Schildt, and on the IO chapter, when using a FileWriter class, he tries to use the constructor of the type FileWriter(String filename, Charset charset). For the charset he uses System.console().charset().
However, my VSCode tells me the method charset is undefined for Console object...
Is there a way to obtain the charset of the console?
This book is showing you how to do it in the modern way. Whilst java-the-ecosystem makes a point of not changing everything every other year unlike some other programming language ecosystem, this did change, and changed, specifically, in the JDK17 release. 1
Starting with JDK17, this is the 'model' that the JVM uses to deal with charset encoding issues:
Every constructor and method in java that ends up converting bytes to characters or vice versa necessarily is going to use a charset (you need one to convert from bytes to character or vice versa, you can't not have one) - and they all have an overload (in all versions of java): You can either specify no charset in which case 'the default' would be used, or you could specify an explicit one. Starting with JDK17, all these methods use, as default, UTF-8, whether your host OS uses that as default charset or not. This is the change: Prior to JDK17, 'the default charset' meant the host OS charset. JDK17 and up, it means 'UTF-8'.
In the unlikely case you want to write data in the charset of the host OS, there's System.console().charset(). Therefore, new FileWriter("file.txt") would write in UTF-8, whereas new FileWriter("file.txt, System.console.charset()) would write in the host OS charset, whatever that might be. On linux it usually is UTF-8 and there's no difference; on windows it's more often something like Cp1252 or some such.
That's a change. Prior to JDK17, it worked differently:
Those use-default-charset methods/constructors, such as new FileWriter(fileName), would use the host OS charset. If you want UTF_8, you'd have to write new FileWriter(fileName, StandardCharsets.UTF_8).
There is no System.console.charset() method. It doesn't exist at all.
As an exception, all methods in the new files API (java.nio.file) default to UTF-8, even before JDK17.
Clearly then the JDK you told VSCode to use predates it (because it's telling you System.console().charset() does not exist. It was introduced in JDK17, conclusion: Your VSCode is using a JDK that predates this).
Unfortunately (I raised this point at the openjdk mailing lists, I don't think any openjdk team member cares, as nobody's taken any action on it), this means it is impossible to write code that works both pre- and post- JDK17 without jumping through some crazy hoops, if you want to write data explicitly in 'host OS charset' form.
As you said, you're a beginner, so presumably you don't particularly care about writing code that writes data using the host OS charset in a way that compiles on both pre- and post-JDK17.
Thus, pick a solution. Either one will work:
Upgrade to JDK17.
Use new FileWriter(fileName), without adding that charset.
I'd pick option #1 - the book is likely going to end up using other things introduced in JDK17, if it did so here.
NB: For a beginner's book, worrying about charset encoding is a bizarre choice, especially considering that evidently they thought it was okay to use the obsolete FileWriter, presumably to keep things simple. It's almost like someone who tries to explain how to drive a car taking a moment to explain how a carburettor works (waaaay too much detail, you don't need to know this when learning how to drive cars until much later), but sort of handwave away how to charge the car, which is relevant much earlier. Bizarre choice - consider it 1 minor demerit against this book, and know that this charset malarky is not what you should be worrying about right now, if your aim is to learn java.
[1] Congratulations - you are using a rather modern book. That's good news - a common problem is that tutorials are 20 years behind the times and end up showing you outdated ways of doing things. "Too new" is a better problem to have than "too old".
Related
As has been answered repeatedly, you can easily convert a String to an InputStream.
When browsing the Java 8 documentation I came across the long-deprecated StringBufferInputStream class, which states that
As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.
What is way is this referring to? There are several methods requiring classes in non-default libraries such as the error-prone ReaderInputStream from Apache Commons IO, but I'm looking for the preferred way mentioned in the documentation. The solutions referenced in other questions are sufficient for my use cases, but I'd still like to know what the documentation is referencing.
Update
Apparently this is a 16 year old bug that hasn't been fixed. The proposed solution in the link is to use the deprecated class. I can't imagine that is what is intended from what the documentation says.
I don't know the answer to the javadoc part, but the first answer you pointed to is a reasonable one: just use String.getBytes(encoding) to get byte array, then use ByteArrayInputStream.
But usually the more important question is this: why on earth do you NEED such conversion? In a well designed system, you should never need to go in this direction: it is against the normal flow of things where within JDK you deal with chars and Strings, and outside with bytes and Streams. So conversions in this direction are quite rare and it does not seem necessary for JDK to have explicit support.
I am little bit curious about that what happen if I manually changed something into bytecode before execution. For instance, let suppose assigning int type variable into byte type variable without casting or remove semicolon from somewhere in program or anything that leads to compile time error. As I know all compile time errors are checked by compiler before making .class file. So what happen when I changed byte code after successfully compile a program then changed bytecode manually ? Is there any mechanism to handle this ? or if not then how program behaves after execution ?
EDIT :-
As Hot Licks, Darksonn and manouti already gave correct satisfy answers.Now I just conclude for those readers who all seeking answer for this type question :-
Every Java virtual machine has a class-file verifier, which ensures that loaded class files have a proper internal structure. If the class-file verifier discovers a problem with a class file, it throws an exception. Because a class file is just a sequence of binary data, a virtual machine can't know whether a particular class file was generated by a well-meaning Java compiler or by shady crackers bent on compromising the integrity of the virtual machine. As a consequence, all JVM implementations have a class-file verifier that can be invoked on untrusted classes, to make sure the classes are safe to use.
Refer this for more details.
You certainly can use a hex editor (eg, the free "HDD Hex Editor Neo") or some other tool to modify the bytes of a Java .class file. But obviously, you must do so in a way that maintains the file's "integrity" (tables all in correct format, etc). Furthermore (and much trickier), any modification you make must pass muster by the JVM's "verifier", which essentially rechecks everything that javac verified while compiling the program.
The verification process occurs during class loading and is quite complex. Basically, a data flow analysis is done on each procedure to assure that only the correct data types can "reach" a point where the data type is assumed. Eg, you can't change a load operation to load a reference to a HashMap onto the "stack" when the eventual user of the loaded reference will be assuming it's a String. (But enumerating all the checks the verifier does would be a major task in itself. I can't remember half of them, even though I wrote the verifier for the IBM iSeries JVM.)
(If you're asking if one can "jailbreak" a Java .class file to introduce code that does unauthorized things, the answer is no.)
You will most likely get a java.lang.VerifyError:
Thrown when the "verifier" detects that a class file, though well formed, contains some sort of internal inconsistency or security problem.
You can certainly do this, and there are even tools to make it easier, like http://set.ee/jbe/. The Java runtime will run your modified bytecode just as it would run the bytecode emitted by the compiler. What you're describing is a Java-specific case of a binary patch.
The semicolon example wouldn't be an issue, since semicolons are only for the convenience of the compiler and don't appear in the bytecode.
Either the bytecode executes normally and performs the instructions given or the jvm rejects them.
I played around with programming directly in java bytecode some time ago using jasmin, and I noticed some things.
If the bytecode you edited it into makes sense, it will of coursse run as expected. However there are some bytecode patterns that are rejected with a VerifyError.
For the specific examble of out of bounds access, you can compile code with out of bounds just fine. They will get you an ArrayIndexOutOfBoundsException at runtime.
int[] arr = new int[20];
for (int i = 0; i < 100; i++) {
arr[i] = i;
}
However you can construct bytecode that is more fundamentally flawed than that. To give an example I'll explain some things first.
The java bytecode works with a stack, and instructions works with the top elements on the stack.
The stack naturally have different sizes at different places in the program but sometimes you might use a goto in the bytecode to cause the stack to look different depending on how you reached there.
The stack might contain object, int then you store the object in an object array and the int in an int array. Then you go on and from somewhere else in that bytecode you use a goto, but now your stack contains int, object which would result in an int being passed to an object array and vice versa.
This is just one example of things that could happen which makes your bytecode fundamentally flawed. The JVM detects these kinds of flaws when the class is loaded at runtime, and then emits a VerifyError if something dosen't work.
Java has two overloads each for String.toLowerCase and toUpperCase. One of the overloads takes a Locale as a parameter while the other one takes no parameters and uses the default locale (Locale.getDefault()).
The parameterless variants might not work as expected because case conversion respects internationalization, and the default locale is system dependent. Most notably, the lower case i is converted to an upper case dotted İ in the Turkish locale.
What is the purpose of these methods? Do the parameterless variants have any legitimate use? Or perhaps they were just a design mistake? (Not unlike several I/O APIs that use the system default character encoding by default.)
I think they're just convenience methods that will work most of the time, since apps that really need I18n are probably a small minority in the universe of java apps in the world.
If you hardcode a unix path for a File name in a java program and try to run in a windows box, you will also get wrong results and it's not java's fault.
I guess that's an implementation of write once run anywhere principle.
It makes sense, cause you can provide the default locale at JVM startup as one of the runtime parameters.
Furthermore, Java runtime has got a bunch of similar formatting methods for Dates and Numbers. (SimpleDateFormat, NumberFormat etc)
Several blog posts suggest that default locales and charsets indeed were a design mistake and have no meaningful use.
I was frustrated recently in this question where OP wanted to change the format of the output depending on a feature of the number being formatted.
The natural mechanism would be to construct the format dynamically but because PrintStream.format takes a String instead of a CharSequence the construction must end in the construction of a String.
It would have been so much more natural and efficient to build a class that implemented CharSequence that provided the dynamic format on the fly without having to create yet another String.
This seems to be a common theme in the Java libraries where the default seems to be to require a String even though immutability is not a requirement. I am aware that keys in Maps and Sets should generally be immutable for obvious reasons but as far as I can see String is used far too often where a CharSequence would suffice.
There are a few reasons.
In a lot of cases, immutability is a functional requirement. For example, you've identified that a lot of collections / collection types will "break" if an element or key is mutated.
In a lot of cases, immutability is a security requirement. For instance, in an environment where you are running untrusted code in a sandbox, any case where untrusted code could pass a StringBuilder instead of a String to trusted code is a potential security problem1.
In a lot of cases, the reason is backwards compatibility. The CharSequence interface was introduced in Java 1.4. Java APIs that predate Java 1.4 do not use it. Furthermore, changing an preexisting method that uses String to use CharSequence risks binary compatibility issues; i.e. it could prevent old Java code from running on a newer JVM.
In the remainder it could simply be - "too much work, too little time". Making changes to existing standard APIs involves a lot of effort to make sure that the change is going to be acceptable to everyone (e.g. checking for the above), and convincing everyone that it will all be OK. Work has to be prioritized.
So while you find this frustrating, it is unavoidable.
1 - This would leave the Java API designer with an awkward choice. Does he/she write the API to make (expensive) defensive copies whenever it is passed a mutable "string", and possibly change the semantics of the API (from the user's perspective!). Or does he/she label the API as "unsafe for untrusted code" ... and hope that developers notice / understand?
Of course, when you are designing your own APIs for your own reasons, you can make the call that security is not an issue. The Java API designers are not in that position. They need to design APIs that work for everyone. Using String is the simplest / least risky solution.
See http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html
Do you notice the part that explains that it has been around since 1.4? Previously all the API methods used String (which has been around since 1.0)
Anybody knows a faster way to do what java.nio.charset.Charset.decode(..)/encode(..) does?
It's currently one of the bottleneck of a technology that I'm using.
[EDIT]
Specifically, in my application, I changed one segment from a java-solution to a JNI-solution (because there was a C++ technology that was most suitable for my needs than the Java technology that I was using).
This change brought about some significant decrease in speed (and significant increase in cpu & mem usage).
Looking deeper into the JNI-solution that I used, the java application is communicating with the C++ application via byte[]. These byte[] are produced by Charset.encode(..) from the java side and passed to the C++ side. Then when the C++ response with a byte[], it gets decoded in the java side via Charset.decode(..).
Running this against a profiler, I see that Charset.decode(..) and Charset.encode(..) both took a significantly long time compared to the whole execution time of the JNI-solution (I profiled only the JNI-solution because it's something I could whip up quite quickly. I'll profile the whole application on a latter date once I free up my schedule :-) ).
Upon reading further regarding my problem, it's seems that it's a known problem with Charset.encode(..) and decode(..) and it's being addressed in Java7. However, moving to Java7 is not an option for me (for now) due to some constraints.
Which is why I ask here if somebody knows a Java5 solution / alternative to this (Sorry, should have mentioned that this was for Java5 sooner) ? :-)
The javadoc for encode() and decode() make it clear that these are convenience methods. For example, for encode():
Convenience method that encodes
Unicode characters into bytes in this
charset.
An invocation of this method upon a
charset cs returns the same result as
the expression
cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(bb);
except that it is potentially more
efficient because it can cache
encoders between successive
invocations.
The language is a bit vague there, but you might get a performance boost by not using these convenience methods. Create and configure the encoder once, and then re-use it:
CharsetEncoder encoder = cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
It always pays to read the javadoc, even if you think you already know the answer.
First part - it is bad idea in general to pass arrays into JNI code. Because of GC, Java has to copy arrays. In the worth case array will be copied two times - on the way to JNI code and on the way back :)
Because of that Buffer class hierarchy was introduced. And of course Java dev team creates a nice way to encode/decode chars:
Charser#newDecoder returns you CharsetDecoder, which could be used to comvert ByteBuffer to CharBuffer according to a Charset. There are two main method versions:
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
CharBuffer decode(ByteBuffer in)
For the max performance you need the first one. It has no hidden memory allocations inside.
You need to note that Encoder/Decoder could maintance internal state, so be careful (for example if you map from 2byte encoding and input buffer has one half of char...). Also encoder/decoder are not threadsafe
There are very few reasons to "squeeze" a string in a byte array.
I would recommend to write the C functions to take utf-16 strings as parameters.
This way there is no need for any conversion.