convert UTF-8 String to ISO-8859-1 java - java

I have an application where i want to convert an utf-8 encoded string to ISO-8859-1 because this is the encoding of my oracle DB.
Currently this is what i'm inserting in my db:
België
But i expect this:
België
When i print my string in java i get the following:
België
Can anybody help me?
This is what i already tried:
System.out.println(xmlString);
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(xmlString.getBytes(utf8charset));
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
xmlt = new oracle.xdb.XMLType(con, new String(outputData, iso88591charset));
suggestion from the comments didn't work either:
byte[] utf8 = xmlString.getBytes("UTF-8");
byte[] latin = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
ByteArrayInputStream bis = new ByteArrayInputStream(latin);
xmlt = new oracle.xdb.XMLType(con, bis);

Normally UTF-8 can do encoding of any Unicode code. ISO-8859-1 is capable of handling tiny fraction of them and transcoding of UTF-8 to ISO-8859-1 can cause "replacement characters" (�) to appear in your text when unsupported characters are found and also there might be other issues.
Transcoding from ISO-8859-1 to UTF-8 dont have any problem.
I would suggest to transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

Related

Java - encode a string to Base64

I am converting a piece of javascript code to java and want to encode a string to Base64 in java.
Code in javascript:
let encodedData = btoa(String.fromCharCode.apply(null, new Uint8Array(array)))
This converts Uint8Array to string first and then encode it to Base64. But I am not able to find a way to do same in java.
Java code is
InputStream insputStream = new FileInputStream(file);
long length = file.length();
byte[] bytes = new byte[(int) length];
insputStream.read(bytes);
insputStream.close();
byte[] encodedBytes = Base64.getEncoder().encode(bytes);
Which is encoding bytes. Dues to which, encodedData(js) and encodedBytes(java) are not same.
What I want to do is something like:
String str = new String(bytes);
byte[] encodedBytes = Base64.getEncoder().encode(str); // ERROR: encode doesn't accept string
Is there any way to achieve this?
Base64.getEncoder().encode(str.getBytes(Charset)) may help you
(as Thomas noticed). But i can't guess your charset. The right syntax for Charset will be something like StandartCharsets.SOME_CHARSET or Charset.forName("some_charset")

How to convert char in UTF-16 encoding to char in CP855 encoding?

For example, I have char a = "п", and I want to get cyrillic ascii value of it, is it possible?
To convert a text from one encoding to another you can use java.nio.Charset class:
byte[] data = // your encoded data in UTF-16
Charset from = StandardCharsets.UTF_16;
Charset to = // cyrillic charset
byte[] converted = to.encode(from.decode(ByteBuffer.wrap(data))).array();

String Encoding TextView.setText()

while setting text in a TextView, the character 'ù' isn't correctly interpreted. This is my code:
TextView tv = new TextView(context);
String s;
byte[] bytes;
s = "dgseùeT41ù";
bytes = s.getBytes("ISO-8859-1");
tv.setText(new String(bytes));
I don't know where I'm failing. Thank you for support
You have used "ISO-8859-1" but java uses by default UTF-8 so either define the charset while creating string
From docs
Case mapping is based on the Unicode Standard version specified by the
Character class and A String represents a string in the UTF-16 format
so
bytes = s.getBytes("ISO-8859-1");
tv.setText(new String(bytes,"ISO-8859-1"));
or don't use it at all
bytes = s.getBytes();
tv.setText(new String(bytes));

How does String.getBytes("utf-8") work in java?

OS default encoding: UTF-8
convert UTF-8 str to UTF-16 in python:
utf8_str = "Hélô" # type(utf8_str) is str and encoded in UTF-8
unicode_str = utf8_str.decode("UTF-8") # type(unicode_str) is unicode
utf16_str = unicode_str.encode("UTF-16") #type(utf16_str) is str and encoded in UTF-16
As you can see, unicode is bridge of converting utf-8 str to utf-16 str, and it's easy to understand.
But, in java, I am confused about the conversion:
String utf16Str = "Hélô";// String encoded in "UTF-16"
byte[] bytes = utf16Str.getBytes("UTF-8");//byte array encoded in UTF-8, getBytes will call a encode method.
String newUtf16Str = new String(bytes, "UTF-8");// String encoded in "UTF-16"
There is no decode, no unicode. So, what happened in this process?

What is the target charset when I encrypt a string with Cipher and Base64.Encoder?

I'm trying to accommodate encrypted tokens in a DSL I'm designing (i.e. I need a char to be used as delimiter). The encoder.encodeToString(...) docs says it uses the ISO-8859-1 charset. But when I encrypt a sampling of texts, it looks like it's not using all of the ISO-8859-1 charset, instead upper/lower-case and some symbols, and not certain punctuation and accented chars. What am I missing about this encodeToString() call and what is the final char domain?
//import java.util.Base64;
//import javax.crypto.Cipher;
//import javax.crypto.SecretKey;
static Cipher cipher;
public static String decrypt(String encryptedText, SecretKey secretKey) throws Exception {
Base64.Decoder decoder = Base64.getDecoder();
byte[] encryptedTextByte = decoder.decode(encryptedText);
cipher.init(Cipher.DECRYPT_MODE, secretKey);
byte[] decryptedByte = cipher.doFinal(encryptedTextByte);
String decryptedText = new String(decryptedByte);
return decryptedText;
}
String has a constructor with charset; otherwise the default OS charset is taken.
new String(decryptedByte, StandardCharsets.ISO_8859_1);
As there often is a mix-up of Latin-1 (ISO-8859-1) with Windows Latin-1 (Windows-1252) you might try "Windows-1252" too.
Base64 is named just so because it uses 64 characters from the ASCII table. The encoding used should not matter as long as it is compatible with ASCII.
If you want to use more than 64 characters you will have to use a different encoding.
Encryption returns bytes, no matter the encryption is.
I personally use the following function to convert from decrypted bytes to string:
public static String getStringFromBytes(byte[] data) {
StringBuilder sb = new StringBuilder();
if (data != null && data.length > 0) {
for (byte b : data) {
sb.append((char) (b));
}
}
return sb.toString();
}
Read the document line carefully:
This method first encodes all input bytes into a base64 encoded byte
array and then constructs a new String by using the encoded byte array
and the ISO-8859-1 charset.
This means 2 things:
Encoding scheme used is Base64, that's the reason you can not getting expected output because ISO-8859-1 is not used as encoding scheme.
ISO-8859-1 is only used as charset reference for constructing String.
Base64 class give your encoders and decoders for the Base64 encoding scheme, so if you are using its encoders and decoders then output will be encoded/decoded using Base64 encoding scheme.
This class consists exclusively of static methods for obtaining
encoders and decoders for the Base64 encoding scheme.
So, what you need is to properly encode your byte array using some encoding scheme - UTF-8 or ISO-8859-1.
I would personally recommend using "UTF-8" because it is widely used + it encodes all ASCII and some Latin character in 1 byte, Unicode other BMP character using 2 byte and Unicode supplementary characters using 4 bytes. So, it is not space consuming for all ASCII and some Latin character + ability to encode all Unicode characters.
There are many ways you can find to encode your String using desired encoding scheme, an example below:
byte[] byteArr = new byte[3];
String decodeText = new String(byteArr);
Charset.forName("UTF-8").encode(decodeText);

Categories