Convert Unicode to UTF-8 byte[] and save into string (Java) - java

I need to convert the character unicode to a byte[] representation and save into Srting, for example
U+1F601 -> \xF0\x9F\x98\x81
I dont have idea how can i do it this..
Anyone has idea?Thanks

int[] codepoints = { 0x1F601 }; // U+1F601
String s = new String(codepoints, 0, codepoints.length);
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // As UTF-8 (Unicode) bytes
System.out.println(Arrays.toString(bytes));
So one first coposes the Unicode code points into a java String. Java Strings hold Unicode.
When one wants bytes, say in UTF-8 - a Unicode representation -, then one has to indicate the CharSet in which the bytes will be.

Related

How does encoding/decoding bytes work in Java?

Little background: I'm doing cryptopals challenges and I finished https://cryptopals.com/sets/1/challenges/1 but realized I didn't learn what I guess is meant to be learned (or coded).
I'm using the Apache Commons Codec library for Hex and Base64 encoding/decoding. The goal is to decode the hex string and re-encode it to Base64. The "hint" at the bottom of the page says "Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing."
Here's my answer...
private static Hex forHex = new Hex();
private static Base64 forBase64 = new Base64();
public static byte[] hexDecode(String hex) throws DecoderException {
byte[] rawBytes = forHex.decode(hex.getBytes());
return rawBytes;
}
public static byte[] encodeB64(byte[] bytes) {
byte[] base64Bytes = forBase64.encode(bytes);
return base64Bytes;
}
public static void main(String[] args) throws DecoderException {
String hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
//decode hex String to byte[]
byte[] myHexDecoded = hexDecode(hex);
String myHexDecodedString = new String(myHexDecoded);
//Lyrics from Queen's "Under Pressure"
System.out.println(myHexDecodedString);
//encode myHexDecoded to Base64 encoded byte[]
byte[] myHexEncoded = encodeB64(myHexDecoded);
String myB64String = new String(myHexEncoded);
//"pretty printing" of base64
System.out.println(myB64String);
}
...but I feel like I cheated. I didn't learn how to decode bytes that were encoded as hex, and I didn't learn how to encode "pure" bytes to Base64, I just learned how to use a library to do something for me.
If I were to take a String in Java then get its bytes, how would I encode those bytes into hex? For example, the following code snip turns "Hello" (which is readable English) to the byte value of each character:
String s = "Hello";
char[] sChar = s.toCharArray();
byte[] sByte = new byte[sChar.length]
for(int i = 0; i < sChar.length; i++) {
sByte[i] = (byte) sChar[i];
System.out.println("sByte[" + i + "] = " +sByte[i]);
}
which yields sByte[0] = 72, sByte[1] = 101, sByte[2] = 108, sByte[3] = 108, sByte[4] = 111
Lets use 'o' as an example - I am guessing its decimal version is 111 - do I just take its decimal version and change that to its hex version?
If so, to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
No. You take the characters 2 at a time, transform the character '0' to the numeric value 0, the character '1' to the numeric value 1, ..., the character 'a' (or 'A', depending on which encoding you want to support) to the numeric value 10, ..., the character 'f' or 'F' to the numeric value 15.
Then you multiply the first numeric value by 16, and you add it to the second numeric value to get the unsigned integer value of your byte. Then you transform that unsigned integer value to a signed byte.
ASCII has nothing to do with this algorithm.
To see how it's done in practice, since commons-codec is open-source, you can just look at its implementation.

Java NIO server receives random string [duplicate]

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

How to convert the byte offset from sqlite FTS query into a character offset in java

I have a problem where I am searching my FTS tables in android and I get returned a byte offset for the result :
col termno byteoffset size
1 0 111 4
However problem is, when using cursor.getString(colNo) it gives me a UTF-16 string after which I am unable to tally up which character of the text is the start/end of the match.
Its a similar problem to : Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem)
However I cannot fathom the solution in the question. So how can I accurately know the character offsets in my string (for highlight) after I know the byte offsets?
Encode your string back to the same encoding that Sqlite was using, then extract the pieces you want in byte form and convert them back to strings:
String chars = cursor.getString(colNo);
byte[] bytes = chars.getBytes("UTF-8");
String prefix = new String(bytes, 0, byteOffset, "UTF-8");
String match = new String(bytes, byteOffset, size, "UTF-8");
int charOffset = prefix.length;
int charSize = match.length;
(Assuming that your data is encoded as UTF-8 bytes, which is probable.)
It is unfortunate that you have to do all this redundant encoding and decoding. It might perhaps be worth adding optimisations to short-cut the pure-ASCII common case.

Decode bytes to chars one at a time

I have an arbitrary chunk of bytes that represent chars, encoded in an arbitrary scheme (may be ASCII, UTF-8, UTF-16). I know the encoding.
What I'm trying to do is find the location of the last new line (\n) in the array of bytes. I want to know how many bytes are left over after reading the last encoded \n.
I can't find anything in the JDK or any other library that will let me convert a byte array to chars one by one. InputStreamReader reads the stream in chunks, not giving me any indication how many bytes are getting read to produce a char.
Am I going to have to do something as horrible are re-encoding each char to figure out its byte length?
You can try something like this
CharsetDecoder cd = Charset.forName("UTF-8").newDecoder();
ByteBuffer in = ByteBuffer.wrap(bytes);
CharBuffer out = CharBuffer.allocate(1);
int p = 0;
while (in.hasRemaining()) {
cd.decode(in, out, true);
char c = out.array()[0];
int nBytes = in.position() - p;
p = in.position();
out.position(0);
}

Convert byte array to string with equivalent number of bytes

Is it possible to convert a byte array to a string but where the length of the string is exactly the same length as the number of bytes in the array? If I use the following:
byte[] data; // Fill it with data
data.toString();
The length of the string is different than the length of the array. I believe that this is because Java and/or Android takes some kind of default encoding into account. The values in the array can be negative as well. Theoretically it should be possible to convert any byte to some character. I guess I need to figure out how to specify an encoding that generates a fixed single byte width for each character.
EDIT:
I tried the following but it didn't work:
byte[] textArray; // Fill this with some text.
String textString = new String(textArray, "ASCII");
textArray = textString.getBytes("ASCII"); // textArray ends up with different data.
You can use the String constructor String(byte[] data) to create a string from the byte array. If you want to specify the charset as well, you can use String(byte[] data, Charset charset) constructor.
Try your code sample with US-ASCII or ISO-8859-1 in place of ASCII. ASCII is not a built-in Character encoding for Java or Android, but one of those two are. They are guaranteed single-byte encodings, with a caveat that characters not in the character set will be silently truncated.
This should work fine!
public static byte[] stringToByteArray(String pStringValue){
int length= pStringValue.length();
byte[] bytes = new byte[length];
for(int index=0; index<length; index++){
char ch= pStringValue.charAt(index);
bytes[index]= (byte)ch;
}
return bytes;
}
since JDK 1.6:
You can also use:
stringValue.getBytes() which will return you a byte array.
In case of passing a NULL string, you need to handle that by either throwing the nullPointerException or handling it inside the method itself.

Categories