I am trying to encode bytes from an inputstream to plain text characters. So, I made the string out of ints seperated by spaces, like this:
InputStream in;
//etc
int b;
String finalString="";
while((b=in.read())!=-1)finalString+=""+b+" ";
in.close()
But the problem is, this makes the string 3-4 times larger than the original bytes. Is there any other way of encoding bytes to plain text?
If I understand correctly, you want to transform binary data into plain text. You should use Base64 for that. The loss factor will only be 4/3.
Apache commons-codec has a free implementation of a Base64 encoder (and decoder).
Another possibility is Hex encoding (which commons-codec also supports), but it needs 2 bytes of text for each byte of binary data.
You can get all the bytes and out them into a byte array, and then create the string using the byte array.
i.e.
String newString = new String(byteArray);
Your current solution produces strings that are 3..4 times longer than what's in the file because it concatenates decimal character codes into a string.
Java provides a way of reading strings from streams without the need for writing loops, like this:
InputStream in;
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF8"));
String s = r.readLine();
Follow the documentation here
For example if your string is UTF8:
byte[] bytes = // you got that from somewhere...
String x = new String(bytes, "UTF8");
Commons-codec has methods to encode bytes to Base64 encoding.
encodedText = new String(
org.apache.commons.codec.binary.Base64.encodeBase64(byteArray));
If you can get it all into a single byte[], then this should just be
new String(byteArray, StandardCharsets.UTF_16LE);
or whatever character encoding you expect the input to use.
Related
Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.
String string2 = new String(string1.getBytes("UTF-8"), "UTF-8"));
OR
String string3 = new String(string1.getBytes(),"UTF-8"));
ALSO if I use above two code together i.e.
line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8"));
line 2 :string1 = new String(string1.getBytes(),"UTF-8"));
Will the value of string1 will be the same in both the lines?
PS: Purpose of doing all this is to send Japanese text in web service call.
So I want to send it with UTF-8 encoding.
According to the javadoc of String#getBytes(String charsetName):
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
And the documentation of String(byte[] bytes, Charset charset)
Constructs a new String by decoding the specified array of bytes using
the specified charset.
Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:
// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));
String and char (two-bytes UTF-16) in java is for (Unicode) text.
When converting from and to byte[]s one needs the Charset (encoding) of those bytes.
Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.
So use
byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");
Or better, not throwing an UnsupportedCharsetException:
byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
(Android does not know StandardCharsets however.)
The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).
Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.
Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].
There are no UTF-8-encoded strings in Java.
If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.
If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.
I have a large UTF-8 input that is divided to 1-kB size chunks. I need to process it using a method that accepts String. Something like:
for (File file: inputs) {
byte[] b = FileUtils.readFileToByteArray(file);
String str = new String(b, "UTF-8");
processor.process(str);
}
My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. The result of running my code is that some lines end with '?', which corrupts my input.
What would be a good approach to solve this?
If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.
The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application.
I have array of bytes that should be converted to string. That array consists of Windows-1257 encoded text. What is the best way of converting this array to string? Later I will need to convert string to ISO-8859-13 byte array, but I know how to make this part of job.
I tried like this:
String result = new String(currentByteArray, "ISO-8859-13");
But of course got garbage in local character places.
String unicodeString = new String(currentByteArray, "Windows-1257");
byte[] result = unicodeString.getBytes("ISO-8859-13");
or
PrintWriter out = new PrintWriter(file, "ISO-8859-13");
Java is very simple: String/Reader&Writer is Unicode text capable to contain all characters.
And binary byte[]s/InputStream&OutputStream is for binary data.
Hence the String constructor for bytes needs the original encoding of those bytes,
and getting bytes needs the encoding where those bytes should be in.
Be aware that there are overloaded versions with one parameter, without encoding. That uses the platform encoding; not-portable.
How can I decode an utf-8 string with android? I tried with this commands but output is the same of input:
URLDecoder.decode("hello&//à", "UTF-8");
new String("hello&//à", "UTF-8");
EntityUtils.toString("hello&//à", "utf-8");
A string needs no encoding. It is simply a sequence of Unicode characters.
You need to encode when you want to turn a String into a sequence of bytes. The charset the you choose (UTF-8, cp1255, etc.) determines the Character->Byte mapping. Note that a character is not necessarily translated into a single byte. In most charsets, most Unicode characters are translated to at least two bytes.
Encoding of a String is carried out by:
String s1 = "some text";
byte[] bytes = s1.getBytes("UTF-8"); // Charset to encode into
You need to decode when you have а sequence of bytes and you want to turn them into a String. When yоu dо that you need to specify, again, the charset with which the bytеs were originally encoded (otherwise you'll end up with garblеd tеxt).
Decoding:
String s2 = new String(bytes, "UTF-8"); // Charset with which bytes were encoded
If you want to understand this better, a great text is "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
the core functions are getBytes(String charset) and new String(byte[] data). you can use these functions to do UTF-8 decoding.
UTF-8 decoding actually is a string to string conversion, the intermediate buffer is a byte array. since the target is an UTF-8 string, so the only parameter for new String() is the byte array, which calling is equal to new String(bytes, "UTF-8")
Then the key is the parameter for input encoded string to get internal byte array, which you should know beforehand. If you don't, guess the most possible one, "ISO-8859-1" is a good guess for English user.
The decoding sentence should be
String decoded = new String(encoded.getBytes("ISO-8859-1"));
Try looking at decode string encoded in utf-8 format in android but it doesn't look like your string is encoded with anything particular. What do you think the output should be?
I'm making a chat client that uses special encryption. It has a problem reading letters like «, ƒ, ̕ from the input buffer.
Im reading them into a byte array and I tried using
Connection.getInputStream().read();
And also using
BufferedReader myInput = new BufferedReader(
new InputStreamReader(Connection.getInputStream()));
But there appears to be a problem as it displays them as square boxes.
You have to make sure that your InputStreamReader uses the same charset to decode the bytes into chars than the one used by the sender to encode chars into bytes. Look at the other constructors of InputStreamReader.
You must also make sure that the font you're using to display the chars supports your special characters.
Set the correct encoding on the stream through new InputStreamReader(..,"utf-8") or whatever your input is.
Conver byte array to String specifying Character set.
String data = new String(byte[], "UTF-8");
make sure that displaying font support UTF-8 or your specified encoding charset.
You can try using a DataInputStream and the readChar() method.
DataInputStream in = new DataInputStream(myinput);
//where muinput is your BufferedInputStream.
char c = in.readChar();
should do what you want.