String Encoding TextView.setText() - java

while setting text in a TextView, the character 'ù' isn't correctly interpreted. This is my code:
TextView tv = new TextView(context);
String s;
byte[] bytes;
s = "dgseùeT41ù";
bytes = s.getBytes("ISO-8859-1");
tv.setText(new String(bytes));
I don't know where I'm failing. Thank you for support

You have used "ISO-8859-1" but java uses by default UTF-8 so either define the charset while creating string
From docs
Case mapping is based on the Unicode Standard version specified by the
Character class and A String represents a string in the UTF-16 format
so
bytes = s.getBytes("ISO-8859-1");
tv.setText(new String(bytes,"ISO-8859-1"));
or don't use it at all
bytes = s.getBytes();
tv.setText(new String(bytes));

Related

Java - encode a string to Base64

I am converting a piece of javascript code to java and want to encode a string to Base64 in java.
Code in javascript:
let encodedData = btoa(String.fromCharCode.apply(null, new Uint8Array(array)))
This converts Uint8Array to string first and then encode it to Base64. But I am not able to find a way to do same in java.
Java code is
InputStream insputStream = new FileInputStream(file);
long length = file.length();
byte[] bytes = new byte[(int) length];
insputStream.read(bytes);
insputStream.close();
byte[] encodedBytes = Base64.getEncoder().encode(bytes);
Which is encoding bytes. Dues to which, encodedData(js) and encodedBytes(java) are not same.
What I want to do is something like:
String str = new String(bytes);
byte[] encodedBytes = Base64.getEncoder().encode(str); // ERROR: encode doesn't accept string
Is there any way to achieve this?
Base64.getEncoder().encode(str.getBytes(Charset)) may help you
(as Thomas noticed). But i can't guess your charset. The right syntax for Charset will be something like StandartCharsets.SOME_CHARSET or Charset.forName("some_charset")

convert UTF-8 String to ISO-8859-1 java

I have an application where i want to convert an utf-8 encoded string to ISO-8859-1 because this is the encoding of my oracle DB.
Currently this is what i'm inserting in my db:
België
But i expect this:
België
When i print my string in java i get the following:
België
Can anybody help me?
This is what i already tried:
System.out.println(xmlString);
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(xmlString.getBytes(utf8charset));
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
xmlt = new oracle.xdb.XMLType(con, new String(outputData, iso88591charset));
suggestion from the comments didn't work either:
byte[] utf8 = xmlString.getBytes("UTF-8");
byte[] latin = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
ByteArrayInputStream bis = new ByteArrayInputStream(latin);
xmlt = new oracle.xdb.XMLType(con, bis);
Normally UTF-8 can do encoding of any Unicode code. ISO-8859-1 is capable of handling tiny fraction of them and transcoding of UTF-8 to ISO-8859-1 can cause "replacement characters" (�) to appear in your text when unsupported characters are found and also there might be other issues.
Transcoding from ISO-8859-1 to UTF-8 dont have any problem.
I would suggest to transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

How to convert char in UTF-16 encoding to char in CP855 encoding?

For example, I have char a = "п", and I want to get cyrillic ascii value of it, is it possible?
To convert a text from one encoding to another you can use java.nio.Charset class:
byte[] data = // your encoded data in UTF-16
Charset from = StandardCharsets.UTF_16;
Charset to = // cyrillic charset
byte[] converted = to.encode(from.decode(ByteBuffer.wrap(data))).array();

Byte array is a valid UTF8 encoded String in Java but not in Python

When I run the following in Python 2.7.6, I get an exception:
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
The output:
base 64 of the bytes:
gAID
'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
When I try it in Java 8 (HotSpot 1.8.0_74), using this code:
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
I don't get any exception.
How so? Why is the same byte array is valid in Java and invalid in Python, using UTF8 decoding?
This is because the String constructor in Java just doesn't throw exceptions in the case of invalid characters. See documentation here
public String(byte[] bytes, Charset charset)
... This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
It's not valid UTF8. https://en.wikipedia.org/wiki/UTF-8
Bytes between 0x80 and 0xBF cannot be the first byte of a multi-byte character. They can only be the second byte or later.
Java replaces bytes which cannot be decoded with a ? rather than throw an exception.
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
This is wrong as you try to decode the Base64 encoded bytes.
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
# store the decoded bytes
some_bytes = base64.b64encode(some_bytes)
decoded_bytes = [hex(ord(c)) for c in some_bytes]
print ("decoded bytes: ")
print (decoded_bytes)
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
output
gAID
['0x67', '0x41', '0x49', '0x44']
gAID
In Java you try to create a String from the Base64 encoded bytes using the UTF-8 charset. Which results (as already answered) in the default replacement character �.
Running following snippet
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
System.out.println("base 64 of the bytes:");
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
produce following output
base 64 of the bytes:
x80 x02 x03
?
There you can see the same bytes you are using in the Python snippet. Which lead in Python to 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte lead there to an ? (it stands for the default replacement character on a not-unicode console)
Following snippet used the bytes from gAID to construct a String with the UTF-8 character set.
byte[] bytes = "gAID".getBytes(StandardCharsets.ISO_8859_1);
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
output
x67 x41 x49 x44
gAID

What is the target charset when I encrypt a string with Cipher and Base64.Encoder?

I'm trying to accommodate encrypted tokens in a DSL I'm designing (i.e. I need a char to be used as delimiter). The encoder.encodeToString(...) docs says it uses the ISO-8859-1 charset. But when I encrypt a sampling of texts, it looks like it's not using all of the ISO-8859-1 charset, instead upper/lower-case and some symbols, and not certain punctuation and accented chars. What am I missing about this encodeToString() call and what is the final char domain?
//import java.util.Base64;
//import javax.crypto.Cipher;
//import javax.crypto.SecretKey;
static Cipher cipher;
public static String decrypt(String encryptedText, SecretKey secretKey) throws Exception {
Base64.Decoder decoder = Base64.getDecoder();
byte[] encryptedTextByte = decoder.decode(encryptedText);
cipher.init(Cipher.DECRYPT_MODE, secretKey);
byte[] decryptedByte = cipher.doFinal(encryptedTextByte);
String decryptedText = new String(decryptedByte);
return decryptedText;
}
String has a constructor with charset; otherwise the default OS charset is taken.
new String(decryptedByte, StandardCharsets.ISO_8859_1);
As there often is a mix-up of Latin-1 (ISO-8859-1) with Windows Latin-1 (Windows-1252) you might try "Windows-1252" too.
Base64 is named just so because it uses 64 characters from the ASCII table. The encoding used should not matter as long as it is compatible with ASCII.
If you want to use more than 64 characters you will have to use a different encoding.
Encryption returns bytes, no matter the encryption is.
I personally use the following function to convert from decrypted bytes to string:
public static String getStringFromBytes(byte[] data) {
StringBuilder sb = new StringBuilder();
if (data != null && data.length > 0) {
for (byte b : data) {
sb.append((char) (b));
}
}
return sb.toString();
}
Read the document line carefully:
This method first encodes all input bytes into a base64 encoded byte
array and then constructs a new String by using the encoded byte array
and the ISO-8859-1 charset.
This means 2 things:
Encoding scheme used is Base64, that's the reason you can not getting expected output because ISO-8859-1 is not used as encoding scheme.
ISO-8859-1 is only used as charset reference for constructing String.
Base64 class give your encoders and decoders for the Base64 encoding scheme, so if you are using its encoders and decoders then output will be encoded/decoded using Base64 encoding scheme.
This class consists exclusively of static methods for obtaining
encoders and decoders for the Base64 encoding scheme.
So, what you need is to properly encode your byte array using some encoding scheme - UTF-8 or ISO-8859-1.
I would personally recommend using "UTF-8" because it is widely used + it encodes all ASCII and some Latin character in 1 byte, Unicode other BMP character using 2 byte and Unicode supplementary characters using 4 bytes. So, it is not space consuming for all ASCII and some Latin character + ability to encode all Unicode characters.
There are many ways you can find to encode your String using desired encoding scheme, an example below:
byte[] byteArr = new byte[3];
String decodeText = new String(byteArr);
Charset.forName("UTF-8").encode(decodeText);

Categories