Files.readAllLines() does not read all characters correctly

Files.readAllLines() does not read all characters correctly - java

I have a simple text file which includes only one character which is '≤'. Nothing else. This file has UTF-8 encoding.
When I read this file using the method Files.readAllLines(), the character is shown as a question mark '?'
try (FileWriter fw = new FileWriter(new File(file, "f.txt"));
PrintWriter writer = new PrintWriter(fw);) {
List<String> lines = Files.readAllLines(deProp.toPath());
for (String line : lines) {
System.out.println(line);
writer.write(line);
writer.println();
}
In my example I am trying to print the line to the console and to a new file. In both cases a question mark is shown instead.
Any suggestions to solve this?

The Files.readAllLines(path) already uses UTF-8 (see the linked documentation). If you're using the Files.readAllLines(path, charset) variant, well, pass UTF-8 as the charset, of course (for example by using StandardCharsets.UTF_8).
Assuming you're using either the short version or passing UTF-8, then the error lies not with java, but with your setup.
Either the file doesn't contain ≤ in UTF-8, or you're printing it in java to a place that doesn't show such symbols (for example, because your font doesn't have it, and uses ? as the placeholder symbol for 'I do not have this symbol in my font file'; it's more usually a box symbol), or you're sending the output someplace that incorrectly presumes that what is sent is not UTF-8.

The static method of File class e.i
public static List<String> readAllLines(Path path) throws IOException
is read all the lines from a file. The bytes from the file are decoded into characters using the UTF-8 charset. This method invoking equivalent to evaluating the expression:
Files.readAllLines(path, StandardCharsets.UTF_8)
It may be possible that the file contains some garbage or something out of format of UTF-8 charset. Check the text inside files once manually :p

Related

How do I write chinese charactes in ZipEntry?

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "ç±»åž‹" in the CSV file.

First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("ç±»åž‹"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue

The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

character ° encoding and visualization in txt file

I have a field in a table that contains the string "Address Pippo p.2 °".
My program read this value and write it into txt file, but the output is:
"Address Pippo p.2 Â°" (Â is unwanted)
I have a problem because the txt file is a positional file.
I open the file with these Java istructions:
FileWriter fw = new FileWriter(file, true);
pw = new PrintWriter(fw);
I want to write the string without strange characters
Any help for me ?
Thanks in advance

Try encoding the string into UTF-8 like this,
File file = new File("D://test.txt");
FileWriter fw = new FileWriter(file, true);
PrintWriter pw = new PrintWriter(fw);
String test = "Address Pippo p.2 °";
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(test);
test = StandardCharsets.UTF_8.decode(byteBuffer).toString();
pw.write(test);
pw.close();

Java uses Unicode. When you write text to a file, it gets encoded using a particular character encoding. If you don't specify it explicitly, it will use a "system default encoding" which is whatever is configured as default for your particular JVM instance. You need to know what encoding you've used to write the file. Then you need to use the same encoding to read and display the file content. The funny characters you are seeing are probably due to writing the file using UTF-8 and then trying to read and display it in e.g. Notepad using Windows-1252 ("ANSI") encoding.
Decide what encoding you want and stick to it for both reading and writing. To write using Windows-1252, use:
Writer w = new OutputStreamWriter(new FileInputStream(file, true), "windows-1252");
And if you write in UTF-8, then tell Notepad that you want it to read the file in UTF-8. One way to do that is to write the character '\uFEFF' (Byte Order Mark) at the beginning of the file.
If you use UTF-8, be aware that non-ASCII characters will throw the subsequent bytes out of position. So if, for example, a telephone field must always start at byte position 200, then having a non-ASCII character in an address field before it will make the telephone field start at byte position 201 or 202. Using windows-1252 encoding you won't have this issue, but that encoding can't encode all Unicode characters.

file reading encoding trouble

I've a file to read save, do something with its informations and then rewrite them back to another file. the problem is that the original file contains some characters from asian languages like 坂本龍一, 東京事変 and メリー (I guess they're chinese, japanese and korean). I can see them using Notepad++.
the problem is when I read them and write those things via java they get corrupted and I see weird stuff in my output file like ???????? or Ð–Ð°Ð½Ð½Ð° Ð‘Ð¸Ñ‡ÐµÐ²Ñ?ÐºÐ°Ñ?
I think I got something wrong with the encoding but I've no idea of which to use and how to use it.
can someone help me? here's my code:
String fileToRead= SONG_2M;
Scanner scanner = new Scanner(new File(fileToRead), "UTF-8");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] songData = line.split("\t");
if (/*something*/) {
save the string in the map
}
}
scanner.close();
saveFile("coded_artist_small2.txt");
}
public void saveFile(String fileToSave) throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileToSave, "UTF-8");
for (Entry<String, Integer> entry : artistsMap.entrySet()) {
writer.println(entry.getKey() + DELIMITER + entry.getValue());
}
writer.close();
}

It is likely that your input file is not, in fact, encoded in UTF-8 (an encoding using two bytes per character satisfying the unicode standard). For instance, the character 坂 you are seeing is unicode 0x5742. If, in fact, your file is encoded in ASCII, that should be displayed as character 0x57 followed by 0x42 - i.e. 9*.
If you're unsure of your file's encoding - take a guess that it might be ASCII text. Try removing the encoding when you set up the Scanner i.e. make the second line of your code
Scanner scanner = new Scanner(new File(fileToRead));
If, in fact, you know the file is unicode, there are different encodings. See this answer for a more comprehensive unicode reader - dealing with various unicode encodings.
For your output - you need to decide how you want the file encoded : some unicode encoding (e.g. UTF-8) or as ASCII.

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.

\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.

Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Why is the first character of the first line of a file in windows a 0?

So I'm reading a plain text file in Java, and I'd like do identify which lines start with "abc". I did the following:
Charset charset = StandardCharsets.UTF_8;
BufferedReader br = Files.newBufferedReader(file.toAbsolutePath(), charset);
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith("abc")) {
// Do something
}
}
But if the first line of the file is "abcd", it won't match. By debugging I've found out that the first character is a 0 (non-printable character), and because of this it won't match. Why is that so? How could I robustly identify which lines start with "abc"?
EDIT: perhaps I should point out that I'm creating the file using notepad

Windows has a few problems with UTF-8, and as such it is a heavy user of the UTF-8 BOM (Byte Order Mark).
If my guess is correct, the first three bytes would then be (in hexadecimal): 0xef, 0xbb, 0xbf.
Given that, for instance, Excel creates UTF-8 CSV files with a BOM prefix, I wouldn't be surprised at all if Notepad did as well...
edit: not surprisingly, it seems this is the case: see here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Files.readAllLines() does not read all characters correctly - java

Related

How do I write chinese charactes in ZipEntry?

character ° encoding and visualization in txt file

file reading encoding trouble

Why is my String returning "\ufffd\ufffdN a m e"

Why is the first character of the first line of a file in windows a 0?

Categories

Resources