I have a error with my file. That is, all the characters are like "Giá»âºi tÃÂnh". I want to use Java to write a program that convert those characters to normal ones. I have tried to convert them to bytes and then convert again to String but it remained the same.
You need to know the encoding of the file in order to do this. Java internally represents all Strings as UTF-16; in order to fix the issue, you need to know the encoding of the file, and use that encoding when reading the file: http://goo.gl/PoBgo (Java API Docs)
Related
I'm currently writing a Java program which involves goals. It's basically a to-do list. Each goal has a few strings, such as name, description etc. I can save and load these goals to a file. My issue was separating the strings - I couldn't think of a character that couldn't be in the string itself. I ended up prefixing each string with it's length and then a colon.
I'm sure there is something in the Java API that will handle this, like ObjectOutputStream. I'm curious about the 'general case', though. This must be an issue for any program that saves and loads strings from a file without being able to assume anything about the string. Is there a better way to go about this?
There are couple of ways to handle your case, e.g:
Encoding your String with something like base64
Applying a well defined format, e.g. JSON or CSV
There are tons of tools support you including:
Apache Commons codec for base64 encoding/decoding
Jaskson for JSON serializing/deserializing
opencsv for csv serializing/deserializing
This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!
Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.
Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format
I've a complex XML file and I've to parse it with Java to get some text inside some tags.
This is done correctly, but there are some rows with cyrillic characters (serbian) and in XML appear in correct mode, when I get it with Java in another one, and when I save it into Oracle, in another one!
How I can elaborate and save this kind of data in the correct cyrillic format from xml to oracle? Thanks.
First: read http://www.joelonsoftware.com/articles/Unicode.html
Second: you don't get a "simple string", you have a file. Which contains bytes. That given an encoding represent a string. When you read it in as a string, you need to specify that encoding or things will get corrupted.
Once you have a java.lang.String, it is an actual unicode representation and encoding-independent but when you want to push that string to a database, you once again need to think about encoding because at some point somewhere, the database will have to transform that string to bytes to store it.
Additionally: never "trust" an editor when it comes to examining encoding issues. They almost always have automagic stuff to make stuff work so something that "looks fine" might actually be corrupt or only valid given the assumptions that that specific editor made.
I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.
Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.
You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).
Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.
The library ICU4J seems to be what you're looking for. See the Normalization page.
I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.
You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.
Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.
I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.