Converting decode utf-8 string to file

Converting decode utf-8 string to file - java

I am trying to save image which I am receiving from android device. From Android getting utf-8 encode string and below is the code I am using to save.
String test = java.net.URLDecoder.decode(image_base64, "UTF-8");
byte[] data = Base64.decodeBase64(test.getBytes());
FileOutputStream stream = null;
try {
stream = new FileOutputStream("/var/lib/easy-tomcat7/webapps/test/test1.bmp");
stream.write(data);
stream.flush();
test1 += "success";
}
catch (IOException e)
{
test1 = "failuare";
e.getMessage();
}
finally
{
test1 += "finally";
stream.close();
}
File is creating but the it is corrupted. I did lot of research on this but not getting why it is happening. Please help me to solve this issue.

I assume you are using Base64 from Apache Commons Codec.
Note that you are dealing with multiple different kinds of encodings:
URL encoding
Base64 encoding
UTF-8 character encoding
Those are three totally different things, and you should understand all of them to understand what is happening exactly.
Check how exactly the image is encoded that you get from the Android device. Your code is assuming that you are getting it as URL-encoded Base64 data, using the UTF-8 character set. Is that indeed how the Android device is sending the data? You will have to check that with whoever wrote the Android application.
What does the string image_base64 contain? Is it valid, URL-encoded Base64 data?
You shouldn't call getBytes() on the string before you pass it to Base64.decodeBase64 - that will convert the string into a byte array using the default character encoding of the system you're running it on. Just do this instead:
byte[] data = Base64.decodeBase64(test);
To make matters worse, there are several variants of Base64 encoding (as you can see on the Wikipedia page about Base64). It may be the case that whatever variant the Android app used is different from what the Base64 class is using.

Use the encoding also for getBytes()
Base64.decodeBase64(test.getBytes("utf-8"));

Related

How do I write chinese charactes in ZipEntry?

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "ç±»åž‹" in the CSV file.

First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("ç±»åž‹"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue

The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

Ruby base 64 decoding for Java base 64 encoding

I having a string that is encoded in java using
data = new String(Base64.getEncoder().encode(encVal), StandardCharsets.UTF_8);
I am receiving this encoded data as an API response. I want to base64 decode this in ruby. I am using
Base64.strict_decode64(data)
for this. but this is not working. Can anyone help me with this?

Your Java code is correct:
byte[] encVal = "Hello World".getBytes();
String data = new String(Base64.getEncoder().encode(encVal), StandardCharsets.UTF_8);
System.out.println(data); // SGVsbG8gV29ybGQ=
The SGVsbG8gV29ybGQ= decodes correctly using multiple tools, e.g. https://www.base64decode.org/.
You are observing garbage characters decoding your value most likely due to an error in creating byte[]. Possibly you have to specify the correct encoding when creating byte[].

Unicodes are not getting translated in java

I wrote the below code snippets and i'm expecting the output to be Açılış Tarih/Saati: but instead i'm getting AÃ§Ä±lÄ±ÅŸ Tarih/Saati: The code is as follows.
public class ResourceBundleTest {
public static void main(String[] args) {
try{
String turkey = "A\u00e7\u0131l\u0131\u015f Tarih/Saati:";
System.out.println(new String(turkey.getBytes("UTF-8")));
}
catch(Exception e)
{
System.out.println("hello");
}
}
}
Please help me how can i get rid of this issue.

This code is basically broken:
new String(turkey.getBytes("UTF-8"))
That:
Starts with a string.
Encodes the text as UTF-8.
Decodes that binary data using the platform default encoding.
It's like saving an image as a PNG and then trying to load it as a JPEG.
You can try:
System.out.println(turkey);
... if that doesn't show up as you want it to, the problem is probably just that your console doesn't support those characters. Trying to change the encoding won't help at all - the best it could do (if you were really lucky) is to get back to the original string. More likely (and it seems this is the case with your output) is that the platform default encoding isn't UTF-8, so you're losing data.

In turkey.getBytes("UTF-8") you explicitly create a byte-array containing the String contents encoded in UTF-8. However the contructor String(byte[]) decodes, according to its documentation, this byte sequence using the system's default encoding. Which may not be UTF-8.
Why don't you just write System.out.println(turkey), though? What's the point of encoding the String in UTF-8 and then decoding it before printing it?

Base64 InputStream to String

I have been trying to get an input stream reading a file, which isa plain text and has embeded some images and another files in base64 and write it again in a String. But keeping the encoding, I mean, I want to have in the String something like:
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIf
IiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7
I have been trying with the classes Base64InputStream and more from packages as org.apache.commons.codec but I just can not fiugure it out. Any kind of help would be really appreciated. Thanks in advance!
Edit
Piece of code using a reader:
BufferedReader br= new BufferedReader(new InputStreamReader(bodyPart.getInputStream()));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
Getting as a result something like: .DIC;ÿÛC;("(;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;ÿÀ##"ÿÄ

Have you tried doing this:
final byte[] bytes64bytes = Base64.encodeBase64(IOUtils.toByteArray(is));
final String content = new String(bytes64bytes);

A text file containing some base64 data can be read with the charset of the rest of the file.
Base64 encoding is a mean to encode bytes in a limited set of characters that are unchanged with almost all char encodings, for example ASCII or UTF-8.
Base64 isn't a charset encoding, you don't have to specify you have some base64 encoded data when reading a file into a string.
So if your text file is generally UTF-8 (that's probable), you can read it without problem even if it contains a base64 encoded stream. Simply use a basic reader and don't use a Base64InputStream if you don't want to decode it.
When opening a file with a reader, you have to specify the encoding. If you don't know it, I suggest you test with the probable ones, like UTF-8, US-ASCII or ISO-8859-1.

If you have a normal InputStream object than You can directly get Base64 encoded stream from it using apache common library class Base64InputStream constructor

I found the solution, inspired by this post getting base64 content string of an image from a mimepart in Java
I think it is kind of stupid decode and encode again the base64 code, but it is the only way I found to manage this issue. If someone could give a better solution, it would be also really appreciated.
Thanks

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");

Using "ISO-8859-1" helped me deal with the French charactes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.