I'm trying to understand the best practices related to these two methods of the Java nio Files class:
readAllLines vs readAllBytes
Which to use when ? Most results from google return a way to use them, but doesn't touch upon when to use one over the other.
Can someone please help me understand ?
From the readAllLines documentation (emphasis my own):
Read all lines from a file. Bytes from the file are decoded into
characters using the UTF-8 charset.
So right off the bat, you are told that readAllLines will automatically decode strings through the UTF-8 character set. This means that at the very least, you shouldn't use it when you are NOT dealing with the UTF-8 charset, but rather you have files stored in UTF-16 or UTF-32 (or some other, non UTF-8 character set).
Also, you don't always deal with strings, sometimes you are dealing with:
Binary data, which could be read and deserialized into some other object.
Image data.
So from my viewpoint, readAllLines is basically a readAllBytes with some extra processing on top of it (to transform bytes into a list of strings).
Related
This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!
Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.
Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format
I am working on some response wrapper. When OutputStream is used, i can determine the number of bytes. However, when Writer is used, I buffer the content as char[], that is used later for writing to the output.
Its a bit noobie question, but how to be sure the real content-length when using char[] (i want to set the header - i know that I don't have to, but i want)? I mean, I could convert chars to bytes using used encoding and then flush bytes, but would like to skip another conversion.
Any other idea?
It depends on the character encoding you are using:
If you are using an 8-bit encoding (such as Latin-1), and you only have mappable characters to convert, then the character count is the same as the byte count.
Otherwise, it is technically possible to work out how many bytes the character stream will encode to1, but there isn't a simple and efficient way to do it.
Yeah, but that is a point, that I do not know the encoding in front and user may use anything. I don't see any other pragmatical way of doing this besides converting to byte[] - which probably happens under the hood, i guess.
Actually, it should be easy to tell what the the encoding is. Call getCharacterEncoding() on the response wrapper. If you do that after getWriter has been called, then it will be the encoding that was used to create the writer. (At least, that is what the javadocs imply ...)
1 - For instance, with UTF-8 you would first implement a mapping of char values to Unicode code points, and then figure how many bytes are required to encode each code point. This can be done without copying, etc, but you would need to code it from the ground up ... unless you can find a 3rd-party library that does this.
I have an array of strings that I need to save into a txt file.I'm only allowed to make max 64kb files so I need to know when to stop putting strings into the file.
Is there some method that having an array of strings,i can find out how big the file will be without creating the file ?
Is the file going to be ASCII-encoded? If so, every character you write will be 1 byte. Add up the string lengths as you go, and if the total number of characters goes greater than 64k, you know to stop. Don't forget to include newlines between strings, in case you're doing that.
Java brings with it a library to input and output data named NIO. I imagine that you should know about how to use it. If you do not know how to use NIO, look at the following links to learn more:
http://en.wikipedia.org/wiki/New_I/O
https://blogs.oracle.com/slc/entry/javanio_vs_javaio
http://docs.oracle.com/javase/tutorial/essential/io/fileio.html
We all know that all data types are just bytes in the end. With characters, we have the same thing, with a little more detail. The characters (letters, numbers, symbols and so on.) in the World are mapped to a table named Unicode, and using some character encoding algorithms you can get a certain number of bytes when you come to save text to a file. How I'd spend hours talking about it for you, I suggest you take a look at the following links to understand more about character encoding:
http://www.w3schools.com/tags/ref_charactersets.asp
https://stackoverflow.com/questions/3049090/character-sets-explained-for-dummies
https://www.w3.org/International/questions/qa-what-is-encoding.en
http://unicode-table.com/en/
http://en.wikipedia.org/wiki/Character_encoding
By using Charset, CharsetEncoder and CharsetDecoder, you can choose a specific character encoding to save your text, depending on, the final size of your file may vary. With the use of UTF-8 (the 8 here means bits), you will end up saving each character in your file with 1 byte each. With UTF-16 (16 here means bits), you will save each character with 2 bytes. This means that as you use a encoding, you end up with a certain number of bytes for each character saved. On the following link you can find the actual encodings supported by the current Java API:
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html
With the NIO library, you do not need to actually save a file to know your size. If you just make the use of ByteBuffer, you may already know the final size of your file without even saving it.
Any questions, please comment.
I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.
You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.
Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.
I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.
I need to convert Strings obtained from some API's to InputStream consumed by other API's. The only way is that I convert the String to Stream without knowing the exact encoding. So I assume it to be UTF-8 and it works fine for now. However I would like to know what could be a better solution for this given that I have no way of identifying the the encoding of the source of the string.
There is no good solution to the problem of not knowing the encoding.
Because of this, you must demand that the encoding be explicitly specified, or else use one single agreed-upon encoding that is strictly adhered to.
Also, make sure you use the rare form of the contructor to InputStreamReader that condescends to raise an exception on an encoding error. That is InputStreamReader(InputStream in, CharsetDecoder dec). The other three are either broken or else infelicitously designed depending on your point of view or purposes, because they suppress encoding errors and render your program unreliable and nonportable.
Be very careful about missing errors, especially when you do not know for sure what you are getting — and even if you think you do :).
The possible consequences of applying the incorrect encoding is getting the wrong data out the other end.
The specific consequences will depend on the specific encodings. For example, if you receive a stream of ISO-8859-1 characters, and try to decode using UTF-8, you'll probably get errors due to incorrect sequences. If you start with UTF-16 and assume that it's ISO-8859-1, you'll get twice as many characters as you expect, and every other one will be garbage.
Encodings are not a property of Strings in Java, they're only relevant when you convert between Strings and bytes. If those APIs give you Strings, there is only one point where your program needs to use an encoding, which is when you convert the String back to bytes to be returned by the InputStream. And those "other APIs" of course need to know which encoding to use if they're going to interpret the contents as text data.
To add to the other answers, your deployed application will no longer be portable between Windows and Linux, since these usually have different default encodings.