Java UTF-8 non ASCII chars not supported on Windows - java

I am working on a tool which produces a zip with some generated files. Some of my users using Windows 10 reported me that when I add a string into a file within a zip, non ascii chars are replaced by "?"
It is really strange because that works perfectly on linux (nixos). Do you have any idea?
fis = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
...
public static void addToZip(String zipFilePath, final InputStream fis, final ZipOutputStream zos)
throws IOException {
final ZipEntry zipEntry = new ZipEntry(zipFilePath);
zipEntry.setLastModifiedTime(FileTime.fromMillis(0L));
zos.putNextEntry(zipEntry);
final byte[] bytes = new byte[1024];
int length;
while ((length = fis.read(bytes)) >= 0)
zos.write(bytes, 0, length);
zos.closeEntry();
fis.close();
if (!(Settings.PROTECTION.toBool()))
return;
zipEntry.setCrc(bytes.length);
zipEntry.setSize(new BigInteger(bytes).mod(BigInteger.valueOf(Long.MAX_VALUE)).longValue());
}
...
final ZipOutputStream zos = new ZipOutputStream(fos, StandardCharsets.UTF_8);

Since your description and your code is incomplete, I'm making a few assumptions:
Some of the files in zip archive are text files (or similar such as CSV).
The zip archives are created on a Linux system (or a single system under your control).
The zip archives are the sent to your users and then used on different operating systems.
If so, the problem is not related to the zip archive. Instead, it's a general problem of text files. It would also occur if you just sent a single text file.
The cause of the problem is that text files do not contain any reliable information about the encoding. On your side, the text file is created using UTF-8 encoding. On the users side, different operating systems and different tools are used to view or process the text file. Some of these tools might make an effort to determine the encoding and guess it correctly. But if they just use the operating system's default encoding, users with Windows will use the incorrect encoding as Windows defaults to Windows-1252 and similar encodings.
The result of processing an UTF-8 encoded file with Windows-1252 encoding is that bytes that are not valid in Windows-1252 are shown as "?".
If your users view the text files with text editors, ask them to set the text editor to UTF-8. If the text files are processed with custom software, ask them to modify the software such that it explicitly uses UTF-8.

Related

Java, unzip folder with German characters in filenames

I'm trying to unzip folder that contains German characters in it, for example Aufhänge .
I know that in Java 7, it is using utf-8 by default, and i think "ä" is one of the utf-8 characters.
Here is my code snippet
public static void main(String[] args) throws IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
This is an error that I get: java.lang.IllegalArgumentException: MALFORMED
It works with Charset.forName("Cp437"), but it doesn't work with StandardCharsets.UTF_8
You don't mention your operating system, nor how you created the zip file, but I managed to recreate your problem anyway, using 7-Zip on Windows 10:
Create a simple text file with some trivial content (e.g. nothing but the three characters "abc").
Save the file as D:\Temp\Aufhänge.txt. Note the umlaut in the file name.
Locate that file in Windows File Explorer.
Select the file and right click. From the context menu select 7-Zip > Add to "Aufhänge.zip" to create Aufhänge.zip.
Then, in NetBeans run the following code to unzip the file you just created:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public class GermanZip {
static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";
public static void main(String[] args) throws FileNotFoundException, IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
}
As you pointed out, the code throws java.lang.IllegalArgumentException: MALFORMED when executing this statement: zipEntry = zipInputStream.getNextEntry()) != null.
The problem arises because by default 7-Zip encodes the names of the files within the zip file using Cp437, as noted in this comment from 7-Zip:
Default encoding is OEM (DOS) encoding. It's for compatibility with
old zip software.
That's why the unzip works when using Charset.forName("Cp437") instead of Charset.forName("UTF-8").
If you want to unzip using Charset.forName("UTF-8") then you have to force 7-Zip to encode the filenames within the zip in UTF-8. To do this specify the cu parameter when running 7-Zip, as noted in the linked comment:
In Windows File Explorer select the file and right click.
From the context menu select 7-Zip > Add to Archive...".
In the Add to Archive dialog specify cu in the Parameters field:
Having stored the zipped filenames in UTF-8 format, you can then replace Charset.forName("Cp437") with Charset.forName("UTF-8") in your code, and no exception will be thrown when unzipping.
This answer is specific to Windows 10 and 7-Zip, but the general principle should apply in any environment: if specifying an encoding of UTF-8 for your ZipInputStream be certain that the filenames within the zip file really are encoded using UTF-8. You can easily verify this by opening the zip file in a binary editor and searching for the names of the zipped files.
Update based on OP's comment/question below:
Unfortunately the .ZIP File Format Specification does not currently provide a way to store the encoding used for zipped file names apart from one exception, as described in "APPENDIX D - Language Encoding (EFS)":
D.2 If general purpose bit 11 is unset, the file name and comment
SHOULD conform to the original ZIP character encoding. If general
purpose bit 11 is set, the filename and comment MUST support The
Unicode Standard, Version 4.1.0 or greater using the character
encoding form defined by the UTF-8 storage specification. The
Unicode Standard is published by the The Unicode Consortium
(www.unicode.org). UTF-8 encoded data stored within ZIP files is
expected to not include a byte order mark (BOM).
So in your code, for each zipped file, first check whether bit 11 of the general purpose bit flag is set. If it is then you can be certain that the name of that zipped fie is encoded using UTF-8. Otherwise the encoding is whatever was used when the zipped file was created. That is Cp437 by default on Windows, but if you are running on Windows and processing a zip file created on Linux I don't think there is an easy way of determining the encoding(s) used.
Unfortunately ZipEntry does not provide a method to access the general purpose bit flag field of a zipped file, so you would need to process the zip file at the byte level to do that.
To add a further complication, "encoding" in this context relates to the encoding used for each zipped filename rather than for the zip file itself. One zipped file name could be encoded in UTF-8, another zipped file name could have been added using Cp437, etc.

File changes permission after writing block of byte

I am getting issue when trying to read and write to the same file using RandomAcessFile.
I am reading block of 16 bytes from a file and write them in the same file on given position (eg. 256-th).
The problem is on ra.write(b) line. When the following line is execute i got a message on the text editor Kate (I am using Linux Manjaro) saying:
The file /home/mite/IdeaProjects/IspitJuni2015/dat.txt was opened with UTF-8 encoding but contained invalid characters.
It is set to read-only mode, as saving might destroy its content.
Either reopen the file with the correct encoding chosen or enable the read-write mode again in the tools menu to be able to edit it.
and it turns on read-only mode.
Also I tried manually to uncheck the read-only permission in Kate but it's not working either. What seems to be the problem?
public static byte[] read(long i) throws IOException{
File in = new File("./dat.txt");
RandomAccessFile ra = new RandomAccessFile(in,"rw");
byte[] readObj= new byte[16];
if (i>in.length()/16)
{
return null;
}
ra.seek(i*16);
ra.read(readObj);
ra.close();
return readObj;
}
public static void write(long i, byte[] obj) throws IOException{
File out=new File("./dat.txt");
RandomAccessFile ra=new RandomAccessFile(out,"rw");
if (!out.exists())
{
out.createNewFile();
}
long size=out.length();
if (i*16>size)
{
ra.seek(out.length());
for (long j=size;j<i*16;j+=16)
{
byte[] b=new byte[16];
ra.write(b);
}
}
ra.seek((i)*16);
System.out.println(new String(obj));
ra.write(obj);
ra.close();
}
public static void main(String[] args) throws IOException{
write(35,read(4));
}
I think you misunderstand what your editor tells you.
First of all, not every possible sequence of bytes is a valid UTF-8 string, see for example "UTF-8 decoder capability and stress test". So when you copy 16 bytes from one place of UTF-8 file to another you might get a file which no longer contains a valid UTF-8 text.
I suspect that you have the same file opened in Kate to see results of your editing. What the editors says to you is that it noticed that the file you opened is not a valid UTF-8 file and thus it doesn't know how to handle it correctly and thus to prevent accidental damage to your potentially precious data which now looks as binary (not text) to the editor, the editor refuses to save anything from UI back to that file. This doesn't change any permission on file-system level and probably other (dumber) editors will not warn you about such possible corruptions.
Thank you for your replies. I figured out the problem.
Sometimes text editors are adding one extra byte at the end of the file which is not supported as byte in Java. Usually this is EOF byte and is treated as UTF-8 which Java only accepts writing/reading ASCI bytes, except manipulating through writeUTF() method.
Also this byte is invisible in text editors and that was the reason why I write this post.
It took me two days to find out what is the issue, but if someone gets stuck here keep in mind the EOF byte.

can not save utf8 file in windows server with java

I have a simple java application that saves some String in utf-8 encode.
But when I open that file with notepad and save as,it shows it's encode ANSI.Now I don't know where is the problem?
My code that save the file is
File fileDir = new File("c:\\Sample.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("kodehelp UTF-8").append("\r\n");
out.append("??? UTF-8").append("\r\n");
out.append("???? UTF-8").append("\r\n");
out.flush();
out.close();
The characters you are writing to the file, as they appear in the code snippet, are in the basic ASCII subset of UFT-8. Notepad is likely auto-detecting the format, and seeing nothing outside the ASCII range, decides the file is ANSI.
If you want to force a different decision, place characters such as 字 or õ which are well out of the ASCII range.
It is possible that the ??? strings in your example were intended to be UTF-8. If so. make sure your IDE and/or build tool recognizes the files as UTF-8, and the files are indeed UTF-8 encoded. If you provide more information about your build system, then we can help further.

Commons Net FTPClient retrieved file encoding issue

I'm retrieving a file from a FTP Server. The file is encoded as UTF-8
ftpClient.connect(props.getFtpHost(), props.getFtpPort());
ftpClient.login(props.getUsername(), props.getPassword());
ftpClient.setFileType(FTP.BINARY_FILE_TYPE);
inputStream = ftpClient.retrieveFileStream(fileNameBuilder
.toString());
And then somewhere else I'm reading the input stream
bufferedReader = new BufferedReader(new InputStreamReader(
inputStream, "UTF-8"));
But the file is not getting read as UTF-8 Encoded!
I tried ftpClient.setAutodetectUTF8(true); but still doesn't work.
Any ideas?
EDIT:
For example a row in the original file is
...00248090041KENAN SARÐIN 00000000015.993FAC...
After downloading it through FTPClient, I parse it and load in a java object, one of the fields of the java object is name, which for this row is read as "KENAN SAR�IN"
I tried dumping to disk directly:
File file = new File("D:/testencoding/downloaded-file.txt");
FileOutputStream fop = new FileOutputStream(file);
ftpClient.retrieveFile(fileName, fop);
if (!file.exists()) {
file.createNewFile();
}
I compared the MD5 Checksums of the two files(FTP Server one and the and the one dumped to disk), and they're the same.
I would separate out the problems first: dump the file to disk, and compare it with the original. If it's the same as the original, the problem has nothing to do with UTF-8. The FTP code looks okay though, and if you're saying you want the raw binary data, I'd expect it not to mess with anything.
If the file is the same after transfer as before, then the problem has nothing to do with FTP. You say "the file is not getting read as UTF-8 Encoded" but it's not clear what you mean. How certain are you that it's UTF-8 text to start with? If you could edit your question with the binary data, how it's being read as text, and how you'd expect it to be read as text, that would really help.
Try to download the file content as bytes and not as characters using InputStream and OutputStream instead of InputStreamReader. This way you are sure that the file is not changed during transfer.

How can i read a Russian file in Java?

I tried adding UTF-8 for this but it didn't work out. What should i do for reading a Russian file in Java?
FileInputStream fstream1 = new FileInputStream("russian.txt");
DataInputStream in = new DataInputStream(fstream1);
BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));
If the file is from Windows PC, try either "windows-1251" or "Cp1251" for the charset name.
If the file is somehow in the MS-DOS encoding, try using "Cp866".
Both of these are single-byte encodings and changing the file type to UTF-8 (which is multibyte) does nothing.
If all else fails, use the hex editor and dump a few hex lines of these file to you question. Then we'll detect the encoding.
As others mentioned you need to know how the file is encoded. A simple check is to (ab)use Firefox as an encoding detector: answer to similar question
If this is a display problem, it depends what you mean by "reads": in the console, in some window? See also How can I make a String with cyrillic characters display correctly?

Categories