PDF generation issue in java code - java

I am getting a PDF attachment in a Soap response message. I need to generate a PDF back out of it. However, the generated PDF is of the following form:
%PDF-1.4
%
2 0 obj
<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width
278/Length 7735/Height 62/Filter/DCTDecode>>stream
How can I solve this issue?
Here is the code showing how I am embedding a PDF as an attachment:
message = messageFactory.createMessage();
SOAPBody body = message.getSOAPBody();
header.detachNode();
AttachmentPart attachment1 = message.createAttachmentPart();
fr = new FileReader(new File(pathName));
br = new BufferedReader(fr);
String stringContent = "";
line = br.readLine();
while (line != null) {
stringContent = stringContent.concat(line);
stringContent = stringContent.concat("\n");
line = br.readLine();
}
fr.close();
br.close();
attachment1.setMimeHeader("Content-Type", "application/pdf");
attachment1.setContent(stringContent, "application/pdf");
The below code describes how I am getting PDF back from the SOAP message:
Object content = attachment1.getContent();
writePdf(content);
private void writePdf(Object content) throws IOException, PrintException,
DocumentException {
String str = content.toString();
//byte[] b = Base64.decode(str);
//byteArrayToFile(b);
OutputStream file = new FileOutputStream(new File
(AppConfig.getInstance().getConfigValue("webapp.root") +
File.separator + "temp" + File.separator + "hede.pdf"));
//String s2 = new String(bytes, "UTF-8");
//System.out.println("S2::::::::::"+s2);
Document document = new Document();
PdfWriter.getInstance(document, file);
document.open();
document.add(new Paragraph(str));
document.close();
file.close();
}
Can anyone help me out?

There are several faults in the supplied code:
In the code showing how you are embedding pdf as an attachment, you are using a Reader (a FileReader enveloped in a BufferedReader) to read the file to attach line by line, concat these lines with using \n as separator, and send the result of the concatenation as attachment content of type "application/pdf".
This is a procedure you may consider for text files (even though it isn't a good choice there either) but binary files read like this most like get broken beyond repair (and PDFs are binary files, in spite of a phase early in their history where handling them as text was quite harmless):
When reading a file, a Reader interprets the bytes in it according to some character encoding (as none is given explicitly here, most likely the platform default encoding is used) to transform them to Unicode characters collected in a String. Already here most likely the binary data is damaged.
When using readLine you read these Unicode characters until the Reader recognizes a line break. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed. (Java API sources JavaDocs). When you continue to concatenate these lines uniformly using \n as separators, you essentially replace all single carriage return characters and all carriage return - line feed character pairs into single line feed characters, damaging the binary data even further.
When you make the attachment API you use encode this string as the content of some attachment part, you make it transform your Unicode characters back into bytes. If by chance the same character encoding is assumed as was by the Reader before, this might heal some of the damage done back then, but surely not all, and the line break interpretation of the step inbetween certainly isn't healed, either. If a different encoding is used, the data is damaged once again.
Thus, check what other arguments your AttachmentPart.setContent methods accept, choose something which does not damage binaries (e.g. InputStreams, ByteBuffers, byte[], ...) and use that, e.g. a FileInputStream.
The code which describes how you are getting PDF back from SOAP message is even weirder... You assume that toString of the attachment content returns some meaningful string representation (very unlikely here), and then continue to create a new PDF containing that string representation as text content of the first and only paragraph of the PDF. Thus while your attachment creation code discussed above at least 'merely' damaged the PDF, your attachment restrieval code completely ignores the nature of the attachment and destroys it beyond recognition.
You should instead check the actual type of the content object, retrieve the binary data it holds according to its type, and store that content using a FileOutputStream (not a Writer, and not using Strings inbetween, and not copying 'line' by 'line').
And whatever source gave you the impression that your code was appropriate for the task... well, either you completely misunderstood it or you should shun it from now on.

Related

How to get rid of incorrect symbols during Java NIO decoding?

I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.
UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.

Shift_JIS to UTF_8 conversion of full width tilde [~] character returns a thicker character

I'm processing Shift_JIS files and outputting UTF-8 files. Most of the characters are displayed as expected when viewed in a text editor, except for the full width tilde character [~]. It becomes thicker similar to this: [~].
note: this is not the same character, I just don't know how to type it here so I bolded it
When I type it manually in the UTF-8 file, I get the regular version.
Here is my code:
try (BufferedReader in = new BufferedReader(new InputStreamReader (
new FileInputStream(inFile), Charset.forName("Shift_JIS")))) {
try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter (
new FileOutputStream(outFile), StandardCharsets.UTF_8))) {
IOUtils.copy(in, out);
}
}
I also tried using "MS932" and also tried not using IOUtils.
To read Shift_JIS files made with Windows, you have to use Charset.forName("Windows-31j") rather than Charset.forName("Shift_JIS").
Java distinguish Shift_JIS and Windows-31j. "Shift_JIS" in documents for Windows means Windows-31J(MS932) in Java. On the other hand, "Shift_JIS" in documents for AIX means Shift_JIS in Java.
Character mappings for Windows-31J and Shift_JIS are slightly different. For example, ~ (0x8160 in Shift_JIS) is mapped to U+301C in Shift_JIS, and U+FF5E in Windows-31j. Microsoft IME uses U+FF5E (FULLWIDTH TILDE to represent the character ~.

StreamDecoder vs InputStreamReader when reading malformed files

I came across some strange behavior with reading files in Java 8 and i'm wondering if someone can make sense of it.
Scenario:
Reading a malformed text file. By malformed i mean that it contains bytes that do not map to any unicode code points.
The code i use to create such a file is as follows:
byte[] text = new byte[1];
char k = (char) -60;
text[0] = (byte) k;
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);
This code produces a file that contains exactly one byte, which is not part of the ASCII table (nor the extended one).
Attempting to cat this file produces the following output:
�
Which is the UNICODE Replacement Character. This makes sense because UTF-8 needs 2 bytes in order to decode non-ascii characters, but we only have one. This is the behavior i expect from my Java code as well.
Pasting some common code:
private void read(Reader reader) throws IOException {
CharBuffer buffer = CharBuffer.allocate(8910);
buffer.flip();
// move existing data to the front of the buffer
buffer.compact();
// pull in as much data as we can from the socket
int charsRead = reader.read(buffer);
// flip so the data can be consumed
buffer.flip();
ByteBuffer encode = Charset.forName("UTF-8").encode(buffer);
byte[] body = new byte[encode.remaining()];
encode.get(body);
System.out.println(new String(body));
}
Here is my first approach using nio:
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(Channels.newReader(inputStream.getChannel(), "UTF-8");
This produces the following exception:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.Reader.read(Reader.java:100)
Which is not what i expected but also kind of makes sense, because this is actually a corrupt and an illegal file, and the exception is basically telling us it expected more bytes to be read.
And my second one (using regular java.io):
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(new InputStreamReader(inputStream, "UTF-8"));
This does not fail and produces the exact same output as cat did:
�
Which also makes sense.
So my questions are:
What is the expected behavior from a Java Application in this scenario?
Why is there a difference between using the Channels.newReader (which returns a StreamDecoder) and simply using the regular InputStreamReader? Am i doing something wrong with how i read?
Any clarifications would be much appreciated.
Thanks :)
The difference between the behaviour actually goes right down to the StreamDecoder and Charset classes. The InputStreamReader gets a CharsetDecoder from StreamDecoder.forInputStreamReader(..) which does replacement on error
StreamDecoder(InputStream in, Object lock, Charset cs) {
this(in, lock,
cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE));
}
while the Channels.newReader(..) creates the decoder with the default settings (i.e. report instead of replace, which results in an exception further up)
public static Reader newReader(ReadableByteChannel ch,
String csName) {
checkNotNull(csName, "csName");
return newReader(ch, Charset.forName(csName).newDecoder(), -1);
}
So they work differently, but there's no indication in documentation anywhere about the difference. This is badly documented, but I suppose they changed the functionality because you'd rather get an exception than have your data silently corrupted.
Be careful when dealing with character encodings!

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.
\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.
Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)
If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as #bobince said, use just a FileOutputStream on byte[] or InputStream.
This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed #Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.

Categories