Character encoding UTF and ISO-8859-1 in CSV [duplicate]

Character encoding UTF and ISO-8859-1 in CSV [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to add a UTF-8 BOM in java
My oracle database has a character set of UTF8.
I have a Java stored procedure which fetches record from the table and creates a csv file.
BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.
PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");
I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.
Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)

You're currently only talking about one part of a process that is inherently two-sided.
Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.
So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).
UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.

I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.

I think the problem is [Excel]csv could not figure out the utf8 encoding.
utf-8 csv issue
But I m still not able to resolve the issue even if I put a BOM on the PrintStream.
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
out.write('\ufeff');
I also tried:
out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });
but to no avail.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

When is encoding being relevant in Java?

This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!

Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.

Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.

There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.

You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.

You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

Android UTF-8 vs ANSI .equals not working / how change charset?

i read a list for my android app from a csv or txt file.
If the File is encoded UTF-8 with Notepad++ i seh the list all right. But i cant search/find strings with .equals.
If the file is encoded with widows as ansi, is cant see äöü etc. But now i can find strings.
Now my question. How can i found out what charset my string has?
I compare my frist string (from the file) with another string, read in in the app with searchview.
I "THINK" my searchview string from the app is ansi too, how to change that to UTF-8 and hope that the compare then works, again.
Android 4.4.2
Thank you
following dosent work:
String s = null;
try
{
s = new String(query.getBytes(), "UTF-8");
}
catch (UnsupportedEncodingException e)
{
Log.e("utf8", "conversion", e);
}

Java strings are always encoded as UTF-16, regardless of where the string data comes from.
It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8") will work fine if the byte[] array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException or UnmappableCharacterException, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder class instead.
Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.
Always know the charset of your data. If you don't know, ask. Avoid guessing.

ISO-8859-1 to UTF-8 in Java

An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to å“ˆç“¦é‚£ (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not å“ˆç“¦é‚£.
Thanks.

When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.

I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to å“ˆç“¦é‚£ (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.

Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Character encoding UTF and ISO-8859-1 in CSV [duplicate] - java

I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

When is encoding being relevant in Java?

Find out encoding directly from an input stream [duplicate]

Android UTF-8 vs ANSI .equals not working / how change charset?

ISO-8859-1 to UTF-8 in Java

Categories

Resources