Creating tar archive with national characters in Java

Creating tar archive with national characters in Java - java

Do you know some library/way in Java to generate tar archive with file names in proper windows national codepage ( for example cp1250 ).
I tried with Java tar, example code:
final TarEntry entry = new TarEntry( files[i] );
String filename = files[i].getPath().replaceAll( baseDir, "" );
entry.setName( new String( filename.getBytes(), "Cp1250" ) );
out.putNextEntry( entry );
...
It doesn't work. National characters are broken where I extract tar in windows.
I've also found a strange thing, under Linux Polish national characters are shown correctly only when I used ISO-8859-1:
entry.setName( new String( filename.getBytes(), "ISO-8859-1" ) );
Despite the fact that proper Polish codepage is ISO-8859-2, which doesn't work too.
I've also tried Cp852 for windows, no effect.
I know the limitations of tar format, but changing it is not an option.
Thanks for suggestions,

Officially, TAR doesn't support non-ASCII in headers. However, I was able to use UTF-8 encoded filenames on Linux.
You should try this,
String filename = files[i].getName();
byte[] bytes = filename.getBytes("Cp1250")
entry.setName(new String(bytes, "ISO-8859-1"));
out.putNextEntry( entry );
This at least preserves the bytes in Cp1250 in TAR headers.

tar doesn't allow for non-ASCII values in its headers. If you try a different encoding, the result is probably up to what the target platform decides to do with those byte values. It kind of sounds like your target platform's tar program is interpreting the bytes as ISO-8859-1, which is why that 'works'.
Have a look at extended attributes? http://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+8-current
I am no expert here but this seems to be the only official way to put any non-ASCII values in a tar file header.

Related

How do I write chinese charactes in ZipEntry?

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "ç±»åž‹" in the CSV file.

First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("ç±»åž‹"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue

The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

Reading any text file having strange encoding?

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.
As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:
String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";
while ( ( line = reader.readLine() ) != null ) {
System.out.println( line ); // Prints garbage characters
}
The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.
Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset
I suppose you need to use UTF-16LE according to it.
Here is more info on the supported character sets and their Java names:
Supported Encodings

You're providing the wrong encoding in InputStreamReader. Have you tried using UTF-16LE instead if UTF8?
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );
According to Charset:
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte
order

You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardet or jChardet
For more info see Java : How to determine the correct charset encoding of a stream

Working with special characters derived from a filename in a zip file

This question concerns a Tomcat 7 web application, which is connected to a MySQL (5.5.16) database.
When I open a zip file, That has filenames encoded in windows-1252 charset, the characters seem to be interpreted correctly by Java:
ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
Enumeration entries = zf.entries();
while( entries.hasMoreElements() ) {
ZipEntry ze = ( ZipEntry ) entries.nextElement();
if( ! ze.isDirectory() ) {
String name = ze.getName();
System.out.println( name ); //prints correct filenames, e.g. café.pdf
}
}
Omitting the Charset object in the ZipFile constructor would cause an exception.
The filenames in the zip file are printed correctly to standard output, including diacritics.
But, when I subsequently try to store the filename in a database, the e-acute is replaced with a question mark (as seen with the mysql console client).
I had no problems inserting special characters from the web application into MySQL before.
When I execute an INSERT with é in Java source code:
statement.executeUpdate( "insert into files (filename) values ('café.pdf')" );
the é shows up well in MySQL.
Also, my log file shows a comma instead of é: caf‚.pfd
Does anyone know what could be happening here?

As you mentioned in the comments section, the incoming data (zipped files' names) can be in different character sets. This is going to be an issue to you, because you are using MySQL+JDBC link, and it gives you a lot of limitations (like one character set per column in MySQL, and only one character set per connection in JDBC).
Therefore, I would recommend you to switch the character sets (look for variables like character_set_server and character_set_connection) on the MySQL side to UTF8, because it will enable you to transfer and store almost any character that you may receive. See here on how to properly set up your MySQL server. Note, that settings the MySQL server might be challenging, so don't hesitate to PM for additional help. JDBC will automatically adjust to the server's character_set_connection variable, so you don't have to change anything in your Java application.
The one thing you WILL have to change in your application is you would have to convert all incoming data to UTF8 in order to send and store it on the MySQL server.
Good luck.

In the table where you store the data, make sure you use the correct collation to be able to store the e-acute character

The issue is resolved. This post suggested that the encoding of filenames in a zip file might not be windows-1252 but rather IBM437. Changing the Charset from:
ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
to
ZipFile zf = new ZipFile( zipFile, Charset.forName( "IBM437" ) );
gave the desired result: when saving the acquired filename in MySQL, it was stored correctly with é.
What went wrong?
Printing out the filenames contained in the zip file to standard output with
System.out.println( name );
made me wrongly assume that the filenames in the zip file were interpreted well: when I used windows-1252 encoding to open the zip file, the filename was printed to standard output nicely with diacritic: café.pdf. Using other character encodings, different symbols appeared instead of the é.
But when printing the Unicode value of the é-char with the help of this answer, I was able to see that when opening the zip file with windows-1252 encoding, the actual Unicode value was NOT \u00e9 (latin small letter e with acute), but \u201a (single low-9 quotation mark). When I opened the ZipFile with IBM437 charset the correct Unicode value DID appear.
Of course when printing a String to standard output with PrintStream, the PrintStream is also associated with a certain character encoding. From the PrintStream Javadoc:
All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.
I am working on Windows XP.
When I created a new PrintStream
out = new PrintStream( System.out, true, "IBM437" );
everything made sense: opening the zip file with IBM437 character encoding, and using the new PrintStream, é was printed correctly.
There Ain't No Such Thing As Plain Text.

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...

If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);

There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.

UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.

If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)

The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.

Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.