StreamDecoder vs InputStreamReader when reading malformed files - java

I came across some strange behavior with reading files in Java 8 and i'm wondering if someone can make sense of it.
Scenario:
Reading a malformed text file. By malformed i mean that it contains bytes that do not map to any unicode code points.
The code i use to create such a file is as follows:
byte[] text = new byte[1];
char k = (char) -60;
text[0] = (byte) k;
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);
This code produces a file that contains exactly one byte, which is not part of the ASCII table (nor the extended one).
Attempting to cat this file produces the following output:
�
Which is the UNICODE Replacement Character. This makes sense because UTF-8 needs 2 bytes in order to decode non-ascii characters, but we only have one. This is the behavior i expect from my Java code as well.
Pasting some common code:
private void read(Reader reader) throws IOException {
CharBuffer buffer = CharBuffer.allocate(8910);
buffer.flip();
// move existing data to the front of the buffer
buffer.compact();
// pull in as much data as we can from the socket
int charsRead = reader.read(buffer);
// flip so the data can be consumed
buffer.flip();
ByteBuffer encode = Charset.forName("UTF-8").encode(buffer);
byte[] body = new byte[encode.remaining()];
encode.get(body);
System.out.println(new String(body));
}
Here is my first approach using nio:
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(Channels.newReader(inputStream.getChannel(), "UTF-8");
This produces the following exception:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.Reader.read(Reader.java:100)
Which is not what i expected but also kind of makes sense, because this is actually a corrupt and an illegal file, and the exception is basically telling us it expected more bytes to be read.
And my second one (using regular java.io):
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(new InputStreamReader(inputStream, "UTF-8"));
This does not fail and produces the exact same output as cat did:
�
Which also makes sense.
So my questions are:
What is the expected behavior from a Java Application in this scenario?
Why is there a difference between using the Channels.newReader (which returns a StreamDecoder) and simply using the regular InputStreamReader? Am i doing something wrong with how i read?
Any clarifications would be much appreciated.
Thanks :)

The difference between the behaviour actually goes right down to the StreamDecoder and Charset classes. The InputStreamReader gets a CharsetDecoder from StreamDecoder.forInputStreamReader(..) which does replacement on error
StreamDecoder(InputStream in, Object lock, Charset cs) {
this(in, lock,
cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE));
}
while the Channels.newReader(..) creates the decoder with the default settings (i.e. report instead of replace, which results in an exception further up)
public static Reader newReader(ReadableByteChannel ch,
String csName) {
checkNotNull(csName, "csName");
return newReader(ch, Charset.forName(csName).newDecoder(), -1);
}
So they work differently, but there's no indication in documentation anywhere about the difference. This is badly documented, but I suppose they changed the functionality because you'd rather get an exception than have your data silently corrupted.
Be careful when dealing with character encodings!

Related

How to get rid of incorrect symbols during Java NIO decoding?

I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.
UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.

Do I have any performance gain using the BufferedReader in this case?

I want to send a CSV file encoded in base64 from Client to Server, in order to parse it and use the data.
I want to get the InputStream directly from the Request object and pipe it to the reader used by the CSV parser.
Is there any performance or memory gain using this method?
Can the following code achieve this ? I feel like there's something missing while decoding the content.
Is BufferedReader really needed in this example ?
/* Suppose I get a Base64 encoded CSV file from the client */
String csvContent = "Column 1;Column 2;Column 3\r\nValue 1;Value 2;Value 3\r\n";
ByteArrayInputStream inputStream = new ByteArrayInputStream(Base64.encodeBase64(csvContent.getBytes()));
/* retrieving the content UPDATED */
Base64InputStream b64InputStream = new Base64InputStream(inputStream, false);
/* Parsing the CSV content */
Reader reader = new BufferedReader(
new InputStreamReader(b64InputStream));
CSVParser csvParser = new CSVParser(reader, FORMAT_EXCEL_FR);
/* printing results */
csvParser.forEach(record -> printRecord(record));
Update
I replaced the byte[] array with a Base64InputStream from org.apache.commons.codec
Probably not. A BufferedReader ... uses a buffer. It is commonly used when your data is not in java memory yet. ( e.g. socket communication, reading data from a file , ... )
In your case, you are wrapping a byte[], which means that the data is already in memory. So there is no point in adding a buffer.
The javadoc describes a BufferedReader as follows:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
Now, let's say for example you want to read the content of a file, and want to check something byte-per-byte. So you do a lot of byte b = in.read(); calls. In that case, a buffered reader will actually fetch those bytes in chunks internally.
So, basically, whenever it is more efficient to fetch data in chunks, use a BufferedReader.
Update
In response to your update. No, also in this case it's not necessary to add a BufferedReader. As Holger pointed out:
It's likely that the CSVParser does that already (i.e. buffering).
I checked the source code of the CSVParser, and look what's in the constructor.
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
throws IOException {
...
this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
...
}
It wraps some kind of buffered reader by default. So, there's no need to add one yourself.

Interpret a string from one encoding to another in java

I've looked around for answers to this (I'm sure they're out there), and I'm not sure it's possible.
So, I got a HUGE file that contains the word "för". I'm using RandomAccessFile because I know where it is (kind of) and can therefore use the seek() function to get there.
To know that I've found it I have a String "för" in my program that I check for equality. Here's the problem, I ran the debugger and when I get to "för" what I get to compare is "för".
So my program terminates without finding any "för".
This is the code I use to get a word:
private static String getWord(RandomAccessFile file) throws IOException {
StringBuilder stb = new StringBuilder();
String word;
char c;
c = (char)file.read();
int end;
do {
stb.append(c);
end = file.read();
if(end==-1)
return "-1";
c = (char)end;
} while (c != ' ');
word = stb.toString();
word.trim();
return word;
}
So basically I return all the characters from the current point in the file to the first ' '-character. So basically I get the word, but since (char)file.read(); reads a byte (I think), UTF-8 'ö' becomes the two characters 'Ã' and '¶'?
One reason for this guess is that if I open my file with encoding UTF-8 it's "för" but if I open the file with ISO-8859-15 in the same place we now have exactly what my getWord method returns: "för"
So my question:
When I'm sitting with a "för" and a "för", is there any way to fix this? Like saying "read "för" as if it was an UTF-8 string" to get "för"?
If you have to use a RandomAccessFile you should read the content into a byte[] first and then convert the complete array to a String - somthing along the lines of:
byte[] buffer = new byte[whatever];
file.read(buffer);
String result = new String(buffer,"UTF-8");
This is only to give you a general impression what to do, you'll have to add some length-handling etc.
This will not work correctly if you start reading in the middle of a UTF-8 sequence, but so will any other method.
You are using RandomAccessFile.read(). This reads single bytes. UTF-8 sometimes uses several bytes for one character.
Different methods to read UTF-8 from a RandomAccessFile are discussed here: Java: reading strings from a random access file with buffered input
If you don't necessarily need a RandomAccessFile, you should definitely switch to reading characters instead of bytes.
If possible, I would suggest Scanner.next() which searches for the next word by default.
import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

PDF generation issue in java code

I am getting a PDF attachment in a Soap response message. I need to generate a PDF back out of it. However, the generated PDF is of the following form:
%PDF-1.4
%
2 0 obj
<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width
278/Length 7735/Height 62/Filter/DCTDecode>>stream
How can I solve this issue?
Here is the code showing how I am embedding a PDF as an attachment:
message = messageFactory.createMessage();
SOAPBody body = message.getSOAPBody();
header.detachNode();
AttachmentPart attachment1 = message.createAttachmentPart();
fr = new FileReader(new File(pathName));
br = new BufferedReader(fr);
String stringContent = "";
line = br.readLine();
while (line != null) {
stringContent = stringContent.concat(line);
stringContent = stringContent.concat("\n");
line = br.readLine();
}
fr.close();
br.close();
attachment1.setMimeHeader("Content-Type", "application/pdf");
attachment1.setContent(stringContent, "application/pdf");
The below code describes how I am getting PDF back from the SOAP message:
Object content = attachment1.getContent();
writePdf(content);
private void writePdf(Object content) throws IOException, PrintException,
DocumentException {
String str = content.toString();
//byte[] b = Base64.decode(str);
//byteArrayToFile(b);
OutputStream file = new FileOutputStream(new File
(AppConfig.getInstance().getConfigValue("webapp.root") +
File.separator + "temp" + File.separator + "hede.pdf"));
//String s2 = new String(bytes, "UTF-8");
//System.out.println("S2::::::::::"+s2);
Document document = new Document();
PdfWriter.getInstance(document, file);
document.open();
document.add(new Paragraph(str));
document.close();
file.close();
}
Can anyone help me out?
There are several faults in the supplied code:
In the code showing how you are embedding pdf as an attachment, you are using a Reader (a FileReader enveloped in a BufferedReader) to read the file to attach line by line, concat these lines with using \n as separator, and send the result of the concatenation as attachment content of type "application/pdf".
This is a procedure you may consider for text files (even though it isn't a good choice there either) but binary files read like this most like get broken beyond repair (and PDFs are binary files, in spite of a phase early in their history where handling them as text was quite harmless):
When reading a file, a Reader interprets the bytes in it according to some character encoding (as none is given explicitly here, most likely the platform default encoding is used) to transform them to Unicode characters collected in a String. Already here most likely the binary data is damaged.
When using readLine you read these Unicode characters until the Reader recognizes a line break. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed. (Java API sources JavaDocs). When you continue to concatenate these lines uniformly using \n as separators, you essentially replace all single carriage return characters and all carriage return - line feed character pairs into single line feed characters, damaging the binary data even further.
When you make the attachment API you use encode this string as the content of some attachment part, you make it transform your Unicode characters back into bytes. If by chance the same character encoding is assumed as was by the Reader before, this might heal some of the damage done back then, but surely not all, and the line break interpretation of the step inbetween certainly isn't healed, either. If a different encoding is used, the data is damaged once again.
Thus, check what other arguments your AttachmentPart.setContent methods accept, choose something which does not damage binaries (e.g. InputStreams, ByteBuffers, byte[], ...) and use that, e.g. a FileInputStream.
The code which describes how you are getting PDF back from SOAP message is even weirder... You assume that toString of the attachment content returns some meaningful string representation (very unlikely here), and then continue to create a new PDF containing that string representation as text content of the first and only paragraph of the PDF. Thus while your attachment creation code discussed above at least 'merely' damaged the PDF, your attachment restrieval code completely ignores the nature of the attachment and destroys it beyond recognition.
You should instead check the actual type of the content object, retrieve the binary data it holds according to its type, and store that content using a FileOutputStream (not a Writer, and not using Strings inbetween, and not copying 'line' by 'line').
And whatever source gave you the impression that your code was appropriate for the task... well, either you completely misunderstood it or you should shun it from now on.

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());
I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");
Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.
#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôô
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.
If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.
Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Categories