How to convert chunks of UTF-8 bytes to charcters? - java

I have a large UTF-8 input that is divided to 1-kB size chunks. I need to process it using a method that accepts String. Something like:
for (File file: inputs) {
byte[] b = FileUtils.readFileToByteArray(file);
String str = new String(b, "UTF-8");
processor.process(str);
}
My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. The result of running my code is that some lines end with '?', which corrupts my input.
What would be a good approach to solve this?

If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.
The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application.

Related

FileInputStream available method returns 2 when there is only 1 byte in the file

As said in the title I am trying to read a file byte by byte using a FileInputSteam. My code reads:
FileInputStream input = new FileInputStream(inFileName);
System.out.println(input.available());
My file inFileName contains only the character "±" which should only amount to one byte, however when i run the program, the output is 2.
Any help is greatly appreciated.
That is a unicode character, which in this case is 2 bytes.
http://www.fileformat.info/info/unicode/char/b1/index.htm
Scroll down to the UTF-8 part and you can see the value of each byte.
If your ultimate goal is to get a string from a byte array that is UTF-8, then you can generate a String from bytes using new String(bytes, "UTF-8");
It's also possible that this is UTF-16 (which would also be 2 bytes), but that is less common.

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.
There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.
You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.
You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

How To Convert InputStream To String To Byte Array In Java?

On my java server I get from an iOS client an InputStream, which looks like this:
--0xKhTmLbOuNdArY
Content-Disposition: form-data; filename="Image001"
Content-Type: image/png
âPNG
IHDR���#���#���™iqfi���gAMA��Ø»7äÈ���tEXtSoftware�Adobe ImageReadyq…e<��IDATx⁄‰;iê]Uôflπ˜Ω◊Ø;ΩB::õY
ê6LÄ“Õ¿
... etc. ...
≠Yy<‘_˜øüYmc˚æØ…ægflóÏK$å±çe0ˆΩleIë¢êH¢Tñê–Üd
≠≤§àä6D¸>˙÷˜˚<øÁ˘˝˜˚º^sÁ=Áû{ÓπÁ‹œπ˜úÄÎ:!44¡#
--0xKhTmLbOuNdArY--
The first and last line are my HTTP Boundary. In line 2 and 3 are Information about the image file. And from line 5 until the penultimate line there is the image file which I need as a byte array.
So how do I get the image information as a String and the image file as a byte array from the InputStream?
The solution should be fast and efficient (The file size can be several megabytes / < 10 MB ).
My approach:
I convert the InputStream to a String, then split it and convert the second String to byte array...
String str = org.apache.commons.io.IOUtils.toString( inputStream );
String[] strArray1 = str.split( "\r\n\r\n", 2 );
byte[] bytes = strArray1[1].getBytes();
That way is very fast, but the byte array seems to be damaged. I can not create an image file from that byte array... Some characters are incorrectly converted.
Perhaps someone can help?
The reason why your code breaks is the first line:
String str = org.apache.commons.io.IOUtils.toString( inputStream );
Trying to convert random bytes into Unicode characters, and then back to the same random bytes, isn't going to work.
The only way you can make this work is by reading the input in stages, rather than reading it all into a String.
Read from the InputStream until you're convinced you're past the HTTP boundary line.
Read the rest of the stream into a byte array (you can use IOUtils for that, too).
You probably don't want to convert your bytes to char and back, that would destroy your bytes as the byte stream doesn't correspond to any encoding.
I would read the whole thing in as a byte[] using IOUtils.toByteArray, then look for the byte sequence "\r\n\r\n".getBytes() in that array.
Note that IOUtils.toByteArray doesn't stop until end-of-stream. This should be fine for HTTP 1.0, but will break for HTTP 1.1 which can send multiple requests on the same stream. In that case, you'll have to read incrementally to find the Content-Length field so you know how much of the InputStream to read.

How to encode bytes to a string in Java

I am trying to encode bytes from an inputstream to plain text characters. So, I made the string out of ints seperated by spaces, like this:
InputStream in;
//etc
int b;
String finalString="";
while((b=in.read())!=-1)finalString+=""+b+" ";
in.close()
But the problem is, this makes the string 3-4 times larger than the original bytes. Is there any other way of encoding bytes to plain text?
If I understand correctly, you want to transform binary data into plain text. You should use Base64 for that. The loss factor will only be 4/3.
Apache commons-codec has a free implementation of a Base64 encoder (and decoder).
Another possibility is Hex encoding (which commons-codec also supports), but it needs 2 bytes of text for each byte of binary data.
You can get all the bytes and out them into a byte array, and then create the string using the byte array.
i.e.
String newString = new String(byteArray);
Your current solution produces strings that are 3..4 times longer than what's in the file because it concatenates decimal character codes into a string.
Java provides a way of reading strings from streams without the need for writing loops, like this:
InputStream in;
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF8"));
String s = r.readLine();
Follow the documentation here
For example if your string is UTF8:
byte[] bytes = // you got that from somewhere...
String x = new String(bytes, "UTF8");
Commons-codec has methods to encode bytes to Base64 encoding.
encodedText = new String(
org.apache.commons.codec.binary.Base64.encodeBase64(byteArray));
If you can get it all into a single byte[], then this should just be
new String(byteArray, StandardCharsets.UTF_16LE);
or whatever character encoding you expect the input to use.

Writing unicode to rtf file

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?
By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html
DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.
You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

Categories