I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.
Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}
You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.
Related
I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.
If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.?
byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';
Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8.
Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. It will always use UTF-8 to encode the string.
Note that UTF-8 is a variable-length character encoding. Only the first 127 characters are encoded in one byte; all other characters will take up between 2 and 6 bytes.
Your third line casts a char to an int. This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character.
In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code:
import java.nio.charset.Charset;
public class EncodingTest {
private static void checkCharacterConversion(String c) {
byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
byte asDefaultEncoding = c.getBytes()[0];
byte directConversion = (byte)c.charAt(0);
if (asUtf8 != asDefaultEncoding) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and default encoding %d",
c, asUtf8, asDefaultEncoding));
}
if (asUtf8 != directConversion) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and direct as byte %d",
c, asUtf8, directConversion));
}
}
public static void main(String[] argv) {
// btw: first time I ever wrote a for loop with a char - feels weird to me
for (char c = '\0'; c <= '\u007f'; c++) {
String cc = new String(new char[] {c});
checkCharacterConversion(cc);
}
}
}
If you run this e.g. with:
java -Dfile.encoding="UTF-16LE" EncodingTest
you will get no output.
But of course every single byte (ok, except for the first) will be wrong if you try:
java -Dfile.encoding="UTF-16BE" EncodingTest
because in "big endian" the first byte is always zero for ascii chars.
That is because in UTF-16 an ascii character '\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy]
However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. This will not be true for any further characters, however; they all have multi-byte representations in UTF-8.
When reading from a file using readChar() in RandomAccessFile class, unexpected output comes.
Instead of the desired character ? is displayed.
package tesr;
import java.io.RandomAccessFile;
import java.io.IOException;
public class Test {
public static void main(String[] args) {
try{
RandomAccessFile f=new RandomAccessFile("c:\\ankit\\1.txt","rw");
f.seek(0);
System.out.println(f.readChar());
}
catch(IOException e){
System.out.println("dkndknf");
}
// TODO Auto-generated method stub
}
}
You probably intended readByte. Java char is UTF-16BE, a 2 bytes Unicode representation, and on random binary data very often not representable, no correct UTF-16BE or a half "surrogate" - part of a combination of two char forming one Unicode code point. Java represents a failed conversion in your case as question mark.
If you know in what encoding the file is in, then for a single byte encoding it is simple:
byte b = in.readByte();
byte[] bs = new byte[] { b };
String s = new String(bs, "Cp1252"); // Some single byte encoding
For the variable multi-byte UTF-8 it is also simple to identify a sequence of bytes:
single byte when high bit = 0
otherwise a continuation byte when high bits 10
otherwise a starting byte (with some special cases) telling the number of bytes by its high bits.
For UTF-16LE and UTF-16BE the file positions must be a multiple of 2 and 2 bytes long.
byte[] bs = new byte[2];
in.read(bs);
String s = new String(bs, StandardCharsets.UTF_16LE);
You almost certainly have a character encoding problem. It is not possible to simply read characters from a file. What must be done is that an appropriate sequence of bytes are read, then those bytes are interpreted according to a character encoding scheme to translate them to a character. When you want to read a file as text, Java must be told, perhaps implicitly, which character encoding to use.
If you tell Java the wrong encoding you will get gibberish. If you pick an arbitrary point in a file and start reading, and that location is not the start of the encoding of a character, you will get gibberish. One or both of those has happened in your case.
The problem I am facing occurs when I try to type cast some ASCII values to char.
For example:
(char)145 //returns ?
(char)129 //also returns ?
but it is supposed to return a different character. It happens to many other values as well.
I hope I have been clear enough.
ASCII is a 7-bit encoding system. Some programs even use this to detect if a file is binary or textual. Characters below 32 are escape characters and are used as directives (for instance new lines, print command)
The program however will still work. A character is simply stored as a short (thus sixteen bits). But the values in that range don't have an interpretation. This means that the textual output of both values will lead to nothing. On the other hand comparisons like (char) 145 == (char) 129 will still work (return false). Simply because for a processor, there is no difference between a short and a character.
If you are interested in converting your value such that only the lowest seven bits count (this modifying the value such that it is in the valid range), you can use masking:
int value = 145;
value &= 0x7f;
char c = (char) value;
The char type is Unicode 16 bit, UTF-16. So you could do (char) 265 for c-with-circumflex. ASCII is 7 bits 0 - 127.
String s = "" + ((char)145) + ((char)129);
The above is a string of two Unicode characters (each 2 bytes, UTF-16).
byte[] bytes = s.getBytes(StandardCharsets.US_ASCII); // ASCII with '?' as 7bit
s = new String(bytes, StandardCharsets.US_ASCII); // "??"
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1); // ISO-8859-1 with Latin1
byte[] bytes = s.getBytes("Windows-1252"); // With Windows Latin1
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // No information loss.
s = new String(bytes, StandardCharsets.UTF_9); // Orinal string.
In java String/char/Reader/Writer tackle text (in Unicode), whereas byte[]/InputStream/OutputStream tackle binary data, bytes.
And for bytes must always be associated with an encoding to give text.
Answer: as soon as there is a conversion from text to some encoding that does not represent that char, a question mark can be written.
These expressions evaluate to true:
((char) 145) == '\u0091';
((char) 129) == '\u0081';
These UTF-16 values map to the Unicode code points U+0091 and U+0081:
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
These are both control characters without visible graphemes (the question mark acts as a substitution character) and one of them is private use so has no designated purpose. Neither are in the ASCII set.
Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The array may be generated by code similar to:
byte[] utf8 = "Hello World".getBytes("UTF-8");
Alternatively it may have been generated by code similar to:
byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
messageContent[i] = (byte) i;
}
The key point is that we don't know what the array contains but need to find out in order to fill in the following function:
public final String getString(final byte[] dataToProcess) {
// Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
// If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
// If dataToProcess contains an encoded string then we will decode it and return.
}
How would this be extended to also cover UTF-16 or other encoding mechanisms?
It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.
If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.
However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.
Here's a way to use the UTF-8 "binary" regex from the W3C site
static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException
{
Pattern p = Pattern.compile("\\A(\n" +
" [\\x09\\x0A\\x0D\\x20-\\x7E] # ASCII\\n" +
"| [\\xC2-\\xDF][\\x80-\\xBF] # non-overlong 2-byte\n" +
"| \\xE0[\\xA0-\\xBF][\\x80-\\xBF] # excluding overlongs\n" +
"| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2} # straight 3-byte\n" +
"| \\xED[\\x80-\\x9F][\\x80-\\xBF] # excluding surrogates\n" +
"| \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2} # planes 1-3\n" +
"| [\\xF1-\\xF3][\\x80-\\xBF]{3} # planes 4-15\n" +
"| \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2} # plane 16\n" +
")*\\z", Pattern.COMMENTS);
String phonyString = new String(utf8, "ISO-8859-1");
return p.matcher(phonyString).matches();
}
As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.
As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.
And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.
UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.
The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.
A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.
Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)
The best we can do here is the following:
Test if the bytes are a valid UTF-8 encoding.
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.
IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.
In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.
I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"
For more help with Code Units and conversion try the website;
https://r12a.github.io/apps/conversion/
Here is code...
byte[] array_bytes = text.toString().getBytes();
char[] array_chars = text.toString().toCharArray();
System.out.println();
byteArrayToUtf8CodeUnits(array_bytes);
System.out.println();
charArrayToUtf16CodeUnits(array_chars);
public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
/*for (int k = 0; k < array.length; k++)
{
System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
}*/
System.out.println("array.length: = " + byte_array.length);
//------------------------------------------------------------------------------------------
for (int k = 0; k < byte_array.length; k++)
{
System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
}
//------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
/*Utf16 code units are also known as Java Unicode*/
System.out.println("array.length: = " + char_array.length);
//------------------------------------------------------------------------------------------
for (int i = 0; i < char_array.length; i++)
{
System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
}
//------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
//Returns hex String representation of byte b
char hexDigit[] =
{
'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
};
char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(array);
}
static public String charToHex(char c)
{
//Returns hex String representation of char c
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}
If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.
If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).
If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.
Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.
I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
Taken it from W3.org