Handling Norwegian and Icelandic letters in Java

Handling Norwegian and Icelandic letters in Java - java

In Java,
I am receiving an text input which contains Norwegian Characters and Icelandic characters.
I get a stream and then parse it to String and assign to some variables and again create output.
When i make output, Norwegian and Icelandic characters get distorted and get some ? or ¶ etc. Output files also get same character when opened.
I am making web project .war using Maven. What basic settings are required for Icelandic/Norwegian Text in Coding?
I get a method of setting Locale but unable to produce output using it. Locale.setDefault(new Locale("is_IS", "Iceland"));
Kindly Suggest. How to do it?
Actual Character: HÝS048
Distorted Character: HÃ?S048 (when SOUT directly) or H??S048 (when i get bytes from string and put into string object using UTF-8)
Update (11:13)
I have used
CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("d:\\try1.csv"),encoder));
out.write(sb.toString());
out.flush();
out.close();
Output: HÃƒï¿½S048
Update (12:41):
While reading stream from HTTP source i have used following:
`BufferedReader in = new BufferedReader(new InputStreamReader(apiURL.openStream(), "UTF-8"));`
It perfectly shows output on Console.
I have fetched value of CSV and put it in after logics Bean.
Now I need to create CSV file but when i get values from bean it again gives distorted text. I am using StringBuilder to append the values of bean and write it to file. :( Hope for best. Looking for ideas

The solution to this problem is to get data in UTF-8, print it in UTF-8 and to create file in UTF-8
Read data from URL as below:
BufferedReader in = new BufferedReader(new InputStreamReader(apiURL.openStream(), "UTF-8"));
Then set it to beans or do whatever you want. While printing
System.out.println(new String(sb.toString().getBytes("UTF-8"),"UTF-8"));
Then while creating file again:
FileWriter writer = new FileWriter("d:\\try2.csv");
writer.append(new String(sb.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
This is how my problem got resolved.

Related

Files.readAllLines() does not read all characters correctly

I have a simple text file which includes only one character which is '≤'. Nothing else. This file has UTF-8 encoding.
When I read this file using the method Files.readAllLines(), the character is shown as a question mark '?'
try (FileWriter fw = new FileWriter(new File(file, "f.txt"));
PrintWriter writer = new PrintWriter(fw);) {
List<String> lines = Files.readAllLines(deProp.toPath());
for (String line : lines) {
System.out.println(line);
writer.write(line);
writer.println();
}
In my example I am trying to print the line to the console and to a new file. In both cases a question mark is shown instead.
Any suggestions to solve this?

The Files.readAllLines(path) already uses UTF-8 (see the linked documentation). If you're using the Files.readAllLines(path, charset) variant, well, pass UTF-8 as the charset, of course (for example by using StandardCharsets.UTF_8).
Assuming you're using either the short version or passing UTF-8, then the error lies not with java, but with your setup.
Either the file doesn't contain ≤ in UTF-8, or you're printing it in java to a place that doesn't show such symbols (for example, because your font doesn't have it, and uses ? as the placeholder symbol for 'I do not have this symbol in my font file'; it's more usually a box symbol), or you're sending the output someplace that incorrectly presumes that what is sent is not UTF-8.

The static method of File class e.i
public static List<String> readAllLines(Path path) throws IOException
is read all the lines from a file. The bytes from the file are decoded into characters using the UTF-8 charset. This method invoking equivalent to evaluating the expression:
Files.readAllLines(path, StandardCharsets.UTF_8)
It may be possible that the file contains some garbage or something out of format of UTF-8 charset. Check the text inside files once manually :p

Shift_JIS to UTF_8 conversion of full width tilde [～] character returns a thicker character

I'm processing Shift_JIS files and outputting UTF-8 files. Most of the characters are displayed as expected when viewed in a text editor, except for the full width tilde character [～]. It becomes thicker similar to this: [～].
note: this is not the same character, I just don't know how to type it here so I bolded it
When I type it manually in the UTF-8 file, I get the regular version.
Here is my code:
try (BufferedReader in = new BufferedReader(new InputStreamReader (
new FileInputStream(inFile), Charset.forName("Shift_JIS")))) {
try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter (
new FileOutputStream(outFile), StandardCharsets.UTF_8))) {
IOUtils.copy(in, out);
}
}
I also tried using "MS932" and also tried not using IOUtils.

To read Shift_JIS files made with Windows, you have to use Charset.forName("Windows-31j") rather than Charset.forName("Shift_JIS").
Java distinguish Shift_JIS and Windows-31j. "Shift_JIS" in documents for Windows means Windows-31J(MS932) in Java. On the other hand, "Shift_JIS" in documents for AIX means Shift_JIS in Java.
Character mappings for Windows-31J and Shift_JIS are slightly different. For example, ～ (0x8160 in Shift_JIS) is mapped to U+301C in Shift_JIS, and U+FF5E in Windows-31j. Microsoft IME uses U+FF5E (FULLWIDTH TILDE to represent the character ～.

character ° encoding and visualization in txt file

I have a field in a table that contains the string "Address Pippo p.2 °".
My program read this value and write it into txt file, but the output is:
"Address Pippo p.2 Â°" (Â is unwanted)
I have a problem because the txt file is a positional file.
I open the file with these Java istructions:
FileWriter fw = new FileWriter(file, true);
pw = new PrintWriter(fw);
I want to write the string without strange characters
Any help for me ?
Thanks in advance

Try encoding the string into UTF-8 like this,
File file = new File("D://test.txt");
FileWriter fw = new FileWriter(file, true);
PrintWriter pw = new PrintWriter(fw);
String test = "Address Pippo p.2 °";
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(test);
test = StandardCharsets.UTF_8.decode(byteBuffer).toString();
pw.write(test);
pw.close();

Java uses Unicode. When you write text to a file, it gets encoded using a particular character encoding. If you don't specify it explicitly, it will use a "system default encoding" which is whatever is configured as default for your particular JVM instance. You need to know what encoding you've used to write the file. Then you need to use the same encoding to read and display the file content. The funny characters you are seeing are probably due to writing the file using UTF-8 and then trying to read and display it in e.g. Notepad using Windows-1252 ("ANSI") encoding.
Decide what encoding you want and stick to it for both reading and writing. To write using Windows-1252, use:
Writer w = new OutputStreamWriter(new FileInputStream(file, true), "windows-1252");
And if you write in UTF-8, then tell Notepad that you want it to read the file in UTF-8. One way to do that is to write the character '\uFEFF' (Byte Order Mark) at the beginning of the file.
If you use UTF-8, be aware that non-ASCII characters will throw the subsequent bytes out of position. So if, for example, a telephone field must always start at byte position 200, then having a non-ASCII character in an address field before it will make the telephone field start at byte position 201 or 202. Using windows-1252 encoding you won't have this issue, but that encoding can't encode all Unicode characters.

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®
This output I obtain when I translated an English sentence. Is there any way to make it readable form ??
The Goal is to translate English Sentence to Hindi. The Hindi translated output is correctly obtained in the console. I need to write it to text file.
The translated sentence is set to "translation" and by getParameter() it is tried to save in to the file.
String translation = request.getParameter("translation");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileDir,true), "UTF-8");
BufferedWriter fbw = new BufferedWriter(writer);
fbw.write(translation);
Output file
Output file 1

This is an issue with mismatching character encoding (like UTF-8).
Make sure the character encoding of data that is returned from the request parameter is in UTF-8 encoding.
If the data is in a different encoding, you will have to use that encoding while writing to the file.

Same unicode character behaves differently in different IDEs

When I read the following unicode string it reads as differently..When I execute the program using netbeans it is working fine but when I tried using Eclipse / directly from CMD it is not working.
After reading it adds ƒÂ these characters
Then the string becomes MÃƒÂ½xico
String to be read is MÃ½xico...I used the CSVReader with Encoding to read as follows.
sourceReader = new CSVReader(new FileReader(soureFile));
List<String[]> data = sourceReader.readAll();
Any suggestions????????

It sounds like the different editors are using different encodings. For example one is using utf-8 and one is using something else.
Check the encoding settings in all of the editors are the same.

We should use encoding while reading the file. So above statment should be changed as follows
targetReader=new CSVReader(new InputStreamReader(
new FileInputStream(targetFile), "UTF-8"));
data = targetReader.readAll();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Handling Norwegian and Icelandic letters in Java - java

Related

Files.readAllLines() does not read all characters correctly

Shift_JIS to UTF_8 conversion of full width tilde [～] character returns a thicker character

character ° encoding and visualization in txt file

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

Same unicode character behaves differently in different IDEs

Categories

Resources