file reading encoding trouble

file reading encoding trouble - java

I've a file to read save, do something with its informations and then rewrite them back to another file. the problem is that the original file contains some characters from asian languages like 坂本龍一, 東京事変 and メリー (I guess they're chinese, japanese and korean). I can see them using Notepad++.
the problem is when I read them and write those things via java they get corrupted and I see weird stuff in my output file like ???????? or Ð–Ð°Ð½Ð½Ð° Ð‘Ð¸Ñ‡ÐµÐ²Ñ?ÐºÐ°Ñ?
I think I got something wrong with the encoding but I've no idea of which to use and how to use it.
can someone help me? here's my code:
String fileToRead= SONG_2M;
Scanner scanner = new Scanner(new File(fileToRead), "UTF-8");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] songData = line.split("\t");
if (/*something*/) {
save the string in the map
}
}
scanner.close();
saveFile("coded_artist_small2.txt");
}
public void saveFile(String fileToSave) throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileToSave, "UTF-8");
for (Entry<String, Integer> entry : artistsMap.entrySet()) {
writer.println(entry.getKey() + DELIMITER + entry.getValue());
}
writer.close();
}

It is likely that your input file is not, in fact, encoded in UTF-8 (an encoding using two bytes per character satisfying the unicode standard). For instance, the character 坂 you are seeing is unicode 0x5742. If, in fact, your file is encoded in ASCII, that should be displayed as character 0x57 followed by 0x42 - i.e. 9*.
If you're unsure of your file's encoding - take a guess that it might be ASCII text. Try removing the encoding when you set up the Scanner i.e. make the second line of your code
Scanner scanner = new Scanner(new File(fileToRead));
If, in fact, you know the file is unicode, there are different encodings. See this answer for a more comprehensive unicode reader - dealing with various unicode encodings.
For your output - you need to decide how you want the file encoded : some unicode encoding (e.g. UTF-8) or as ASCII.

Related

Files.readAllLines() does not read all characters correctly

I have a simple text file which includes only one character which is '≤'. Nothing else. This file has UTF-8 encoding.
When I read this file using the method Files.readAllLines(), the character is shown as a question mark '?'
try (FileWriter fw = new FileWriter(new File(file, "f.txt"));
PrintWriter writer = new PrintWriter(fw);) {
List<String> lines = Files.readAllLines(deProp.toPath());
for (String line : lines) {
System.out.println(line);
writer.write(line);
writer.println();
}
In my example I am trying to print the line to the console and to a new file. In both cases a question mark is shown instead.
Any suggestions to solve this?

The Files.readAllLines(path) already uses UTF-8 (see the linked documentation). If you're using the Files.readAllLines(path, charset) variant, well, pass UTF-8 as the charset, of course (for example by using StandardCharsets.UTF_8).
Assuming you're using either the short version or passing UTF-8, then the error lies not with java, but with your setup.
Either the file doesn't contain ≤ in UTF-8, or you're printing it in java to a place that doesn't show such symbols (for example, because your font doesn't have it, and uses ? as the placeholder symbol for 'I do not have this symbol in my font file'; it's more usually a box symbol), or you're sending the output someplace that incorrectly presumes that what is sent is not UTF-8.

The static method of File class e.i
public static List<String> readAllLines(Path path) throws IOException
is read all the lines from a file. The bytes from the file are decoded into characters using the UTF-8 charset. This method invoking equivalent to evaluating the expression:
Files.readAllLines(path, StandardCharsets.UTF_8)
It may be possible that the file contains some garbage or something out of format of UTF-8 charset. Check the text inside files once manually :p

Interpret a string from one encoding to another in java

I've looked around for answers to this (I'm sure they're out there), and I'm not sure it's possible.
So, I got a HUGE file that contains the word "för". I'm using RandomAccessFile because I know where it is (kind of) and can therefore use the seek() function to get there.
To know that I've found it I have a String "för" in my program that I check for equality. Here's the problem, I ran the debugger and when I get to "för" what I get to compare is "fÃ¶r".
So my program terminates without finding any "för".
This is the code I use to get a word:
private static String getWord(RandomAccessFile file) throws IOException {
StringBuilder stb = new StringBuilder();
String word;
char c;
c = (char)file.read();
int end;
do {
stb.append(c);
end = file.read();
if(end==-1)
return "-1";
c = (char)end;
} while (c != ' ');
word = stb.toString();
word.trim();
return word;
}
So basically I return all the characters from the current point in the file to the first ' '-character. So basically I get the word, but since (char)file.read(); reads a byte (I think), UTF-8 'ö' becomes the two characters 'Ã' and '¶'?
One reason for this guess is that if I open my file with encoding UTF-8 it's "för" but if I open the file with ISO-8859-15 in the same place we now have exactly what my getWord method returns: "fÃ¶r"
So my question:
When I'm sitting with a "för" and a "fÃ¶r", is there any way to fix this? Like saying "read "fÃ¶r" as if it was an UTF-8 string" to get "för"?

If you have to use a RandomAccessFile you should read the content into a byte[] first and then convert the complete array to a String - somthing along the lines of:
byte[] buffer = new byte[whatever];
file.read(buffer);
String result = new String(buffer,"UTF-8");
This is only to give you a general impression what to do, you'll have to add some length-handling etc.
This will not work correctly if you start reading in the middle of a UTF-8 sequence, but so will any other method.

You are using RandomAccessFile.read(). This reads single bytes. UTF-8 sometimes uses several bytes for one character.
Different methods to read UTF-8 from a RandomAccessFile are discussed here: Java: reading strings from a random access file with buffered input
If you don't necessarily need a RandomAccessFile, you should definitely switch to reading characters instead of bytes.
If possible, I would suggest Scanner.next() which searches for the next word by default.

import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.

\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.

Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Why is the first character of the first line of a file in windows a 0?

So I'm reading a plain text file in Java, and I'd like do identify which lines start with "abc". I did the following:
Charset charset = StandardCharsets.UTF_8;
BufferedReader br = Files.newBufferedReader(file.toAbsolutePath(), charset);
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith("abc")) {
// Do something
}
}
But if the first line of the file is "abcd", it won't match. By debugging I've found out that the first character is a 0 (non-printable character), and because of this it won't match. Why is that so? How could I robustly identify which lines start with "abc"?
EDIT: perhaps I should point out that I'm creating the file using notepad

Windows has a few problems with UTF-8, and as such it is a heavy user of the UTF-8 BOM (Byte Order Mark).
If my guess is correct, the first three bytes would then be (in hexadecimal): 0xef, 0xbb, 0xbf.
Given that, for instance, Excel creates UTF-8 CSV files with a BOM prefix, I wouldn't be surprised at all if Notepad did as well...
edit: not surprisingly, it seems this is the case: see here.

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());

I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôÃ´
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.

Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

file reading encoding trouble - java

Related

Files.readAllLines() does not read all characters correctly

Interpret a string from one encoding to another in java

Why is my String returning "\ufffd\ufffdN a m e"

Why is the first character of the first line of a file in windows a 0?

Java App : Unable to read iso-8859-1 encoded file correctly

Categories

Resources