Is it possible to create InputStream for a UTF-8 file? - java

We are making some code change to our production code.
In this, earlier we used a InputStream (Basically a FileInputStream) for reading a file from file path and this InputStream is passed to many methods ahead.
Now we realized, the file can contain chinese characters also, so we want to use UTF-8 encoding.
I have a file path in string. And know, sometimes the file can contain chinese character and sometimes not.
I am reluctant to make changes in so many methods and was trying to somehow use UTF-8 encoding while creating InputStream (FileInputStream).
I searched on internet but all I could get is output in bufferreader/inputstream reader (like example Reading InputStream as UTF-8 or http://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/
So is it possible to read a file from file path and also handle chinese characters and convert it in InputStram?

An InputStream does not handle text, so it does not care about the encoding, so the direct answer to your question is: no, you can't create an InputStream with UTF-8 encoding.
You can however handle UTF-8 files just fine with an InputStream by simply carrying the bytes around and never manipulating them in any way.
If you want to read text from a file you need to construct a Reader and then you'll need to specify the encoding (UTF-8 for you) in the constructor.
If you show us the point where data from the InputStream gets turned into String or char[] objects, then I can show you the place where you need to change your code.

Related

BufferedOutputStream not working with Korean characters as expected

I'm trying to write Korean characters to a File and it's writing some gibberish data which I need to work around for showing as Korean data when I open it in CSV. How can I achieve my requirement without the workaround of decoding back to UTF-8 and show Korean data.
File localExport = File.createTempFile("char-test", ".csv");
try (
FileOutputStream fos = new FileOutputStream(localExport);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter outputStreamWriter =
new OutputStreamWriter(bos, StandardCharsets.UTF_8)
) {
ArrayList<String> rows = new ArrayList<>();
rows.add("\"가짜 사용자\",사용자123,saint1_user123");
rows.add("\"페이크유저루노도스트레스 성도1\",saint1_user1");
for (int i=0; i<2; i++) {
String csvUserStr = rows.get(i);
outputStreamWriter.write(csvUserStr);
}
}
It's writing the below data instead of the one I'm actually writing to the File.
There is absolutely nothing wrong with your java code. You are writing those characters, including the korean, precisely as written.
Whatever tool you are using to look at this file?
That's the broken one. Tell it that the file is UTF-8 based. If you can't, get a better tool or figure out which encoding it reads in, and update your java code.
Note that CSV files, text files, etc - they do not store the encoding that was used to write the data. All the programs that read/write to the file need to just know what encoding it is, there's no real way to know other than being told.
UPDATE: From a comment it looks like 'the tool that is reading this' is excel.
Excel asks for the encoding of the file when you use the 'import CSV' dialog. Pick UTF-8 in the dropdown. Depends on which version/OS you're on, but usually it's called 'File Origin'.
If you prefer that your client need not mess with the default, usually the default is something like MacRoman or Win1282, and with such an encoding, it is in fact impossible get korean characters. They simply aren't in that set.
if you want the fire and forget approach, generate the excel file yourself, for example using Apache POI.
CSV files don't have any means to carry encoding information "in-band"—in the file itself. I'm guessing the default character encoding used for Excel CSV imports is the system default, so if that isn't Korean, they will have to specify the encoding when they import the CSV. If your client requires CSV, they have no choice but to accept that behavior.
However, if their requirement is to open your file in Excel (and not that the file has to be CSV format), you could write an Excel spreadsheet instead. The various Excel file formats do include character encoding information, so they would be able to open the file without manually specifying the encoding.
Library recommendations are off-topic, but libraries such Apache POI make writing simple Excel sheets fairly easy. There are additional benefits as well, such as taking care of any necessary escaping for you, so that your file doesn't repeatedly break when unanticipated values are included in the spreadsheet.
As mentioned Excel fails to detect that the text is encoded in UTF-8. One solution is to write an invisible BOM character as first one:
outputStreamWriter.write("\uFEFF");
for...
This is a normally superfluous and ugly marker for miscellaneous UTF encoding.
By the way take a look at the class Files, that can reduce the code to one line.

Java IO - Size of output file is larger than Original file

I am trying to write a read a any file and write same at different location.
File file=new File("<PathToFile>");
String str=FileUtils.readFileToString(file);
Writing str at diffrent location without any modification to str
File writefile=new File("<PathToWriteFile>");
FileUtils.writeStringToFile(writefile, str);
Problem is that size of output file is larger than original file
EDIT:-
I know the solution i.e read and write files byte wise, But my question is WHY this is happening?
It's probably a different string encoding - there can be about X2 difference in the space occupied by each char. You can verify it using any text editor that shows the encoding (e.g. with notepad, doing "save as" without actually going through with it, will show the encoding in the "save" dialog).
As a solution, you can either
Read/write it as binary data
Use writeStringToFile(File, String, Charset) if you know the right charset

How to read textfiles with unknown encoding?

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?
Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.
You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.
It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

How to get Unicode in Jar [duplicate]

I wrote a program for calling Unicode from outside the program. I am using Windows XP and Eclipse. When I run that program in the IDE, it shows the Unicode, but when I exported it as a .jar file I am unable read the Unicode and it shows as boxes.
I followed these instructions to install Unicode in my computer. I followed links to install the Telugu fonts to my system.
Can any one please tell me how can I get Unicode in Jar files?
While not answering the question directly, here is a small howto on how to read/write correctly from text files, in an OS dependent way.
First thing to know is that the JVM has a file.encoding property. It defines the default encoding used for all file read/write operation, all readers used when not specifying an encoding.
As such, you don't want to use the default constructors, but define the encoding each time. In Java, the class which "embodies" an encoding is Charset. If you want UTF-8, you will use:
StandardCharsets.UTF_8 (Java 7+),
Charset.forName("UTF-8") (Java 6-),
Charsets.UTF_8 (if you use Guava).
In order to read a file correctly, open an InputStream to that file, then an InputStreamReader over that InputStream (in the code samples below, UTF8 is the UTF-8 charset obtained from one of the methods above):
final InputStream in = new FileInputStream(...);
final Reader reader = new InputStreamReader(in, UTF8);
In order to write a file correctly, open an OutputStream to it, then an OutputStreamWriter over that OutputStream:
final OutputStream out = new FileOutputStream(...);
final Writer writer = new OutputStreamWriter(out, UTF8);
And, of course, do not forget to .close() both of the streams/readers/writers in a finally block. Hint: if you don't use Java 7, use Guava 14.0+, use Closer. It is the most secure way to deal with multiple I/O resources and ensuring they are dealt with correctly.
You did the code already (I didn't follow the links), but you may compare the code with How to import a font - registerFont is crucial.
Also in a jar file all paths are case-sensitive. You may inspect the jar with 7zip or WinZip.

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.
How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

Categories