How to read PDF from the .jar file

How to read PDF from the .jar file - java

In my maven project I have PDF file which is located inside resources folder. My function reads the PDF file from the resources folder and adds some values in the document based on the user's data.
This project is packed as .jar file using mvn clean install and is used as dependency in my other spring boot application.
In my spring boot project I create instace of the class that will perform some work on the PDF. Once all job on the PDF file is done, and when PDF file is saved on file system it is always empty (all pages are blank). I have impression that mvn clean install does something with the PDF file. Here is what I've tried so far:
First way
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
File file= new ClassPathResource("/pdfs/testpdf.pdf").getFile();//Try to get PDF file
PDDocument pdf = PDDocument.load(file);//Load PDF document from the file
List<PDField> fields = forms.getFields();//Get input fields that I want to update in the PDF
fieldsMap.forEach(throwingConsumerWrapper((field,value) -> changeField(fields,field,value)));//Set input field values
pdf.save(byteArrayOutputStream);//Save value to the byte array
This works great, but as soon as project is packed in a .jar file then I get exception that new ClassPathResource("/pdfs/testpdf.pdf").getFile(); can't find the specified file.
This is normal because the File class can't access anything inside .jar file (it can access the .jar file itself only) and that is clear.
So, the solution to that problem is to use the InputStream instead of the File. Here is what I did:
Second way
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
InputStream inputStream = new ClassPathResource("/pdfs/testpdf.pdf").getInputStream();//Try to get input stream
PDDocument pdf = PDDocument.load(inputStream );//Load PDF document from the input stream
List<PDField> fields = forms.getFields();//Get input fields that I want to update in the PDF
fieldsMap.forEach(throwingConsumerWrapper((field,value) -> changeField(fields,field,value)));//Set input field values
pdf.save(byteArrayOutputStream);//Save value to the byte array
This time getInputStream() doesn't throw error and inputStream object is not null. But the PDF file once saved on my file system is empty, meaning all pages are empty.
I even tried to copy complete inputStream and saving it to the file byte by byte but what I've noticed that every byte is equal 0. Here is what I did:
Third way
InputStream inputStream = new ClassPathResource("/pdfs/test.pdf").getInputStream();
byte[] buffer = new byte[inputStream.available()];
inputStream.read(buffer);
File targetFile = new File(OUTPUT_FOLDER);
OutputStream outStream = new FileOutputStream(targetFile);
outStream.write(buffer);
Copied test.pdf is saved but when opened with Adobe Reader is reported as corrupted.
Anyone have idea how to fix this?

You have to load it like this:
InputStream inputStream = this.getClass().getClassloader().getResourceAsStream("/pdfs/testpdf.pdf");
If you load it via the ClassLoader the path starts in the root of the classpath.

After few hours of investigation and good input from #Simon Martinelli and #Tilman Hausherr I had 2 issues to solve:
First issue - Read the file correctly
In order to read a file from the resources folder you have to use appropriate classes. As stated above you can't use File class to read the file from the .jar and I used the following construction in my case:
InputStream inputStream = CreatePDF.class.getResourceAsStream("/pdfs/test.pdf");
PDDocument pdf = PDDocument.load(inputStream);
In my case CreatePDF class is static one. If your class is not static then use the following:
InputStream inputStream = this.getClass().getResourceAsStream("/pdfs/test.pdf");
PDDocument pdf = PDDocument.load(inputStream);
Second issue - My original problem
One thing I noticed in my third example of my question is, when I'm copying file byte by byte from the resources to my local folder then all bytes were equal to 0. I knew this can't be correct so I tried to do the same thing with simple .txt file and in that case everything worked correctly. This means mvn clean install was causing some problems on PDF files.
After some investigation I realized that mvn filters are causing the problem. If resource filters are enabled:
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
then your binary data is going to be corrupted and that was my original problem. When I set it to false it worked like expected.
Here is Warning from the maven page:
Warning: Do not filter files with binary content like images! This
will most likely result in corrupt output.
If you have both text files and binary files as resources it is
recommended to have two separated folders. One folder
src/main/resources (default) for the resources which are not filtered
and another folder src/main/resources-filtered for the resources which
are filtered.
Here is an example how you could do it:
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
<includes>
<include>**/*.properties</include>
<include>**/*.xml</include>
<include>**/*.txt</include>
<include>**/*.html</include>
</includes>
</resource>
<resource>
<directory>src/main/resources</directory>
<filtering>false</filtering>
<includes>
<include>**/*.pdf</include>
</includes>
</resource>

Related

Custom maven plugin - file not found

I created a simple plugin in maven which aims to generate files automatically. The files are created by reading templates that are located, inside the plugin, in the resources folder (src / main / resources). When I use the plugin outside of an application it works correctly. If I try to run the plugin within an exisisting project, it cannot find the template files. How can I get around?
I copy resource in the jar in this way
<resources>
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*.txt</include>
</includes>
</resource>
</resources>
I use this code to read file
ClassLoader classLoader = getClass().getClassLoader();
String x = classLoader.getResource("template/" + fileName).getFile();
File file = new File(x);
The value of x is correct:
file:\C:\Users\myname\.m2\repository\com\ciro\myapp\0.0.1-SNAPSHOT\myapp-0.0.1-SNAPSHOT.jar!\template\assembler.txt
but when I execute
new File(x)
I get
file:\C:\Users\myname.m2\repository\com\ciro\myapp\0.0.1-SNAPSHOT\myapp-0.0.1-SNAPSHOT.jar!\template\
The syntax of the file, directory, or volume name is incorrect

I solved in this way
InputStream in = getClass().getResourceAsStream("/template/" + fileName);
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = reader.readLine();
while (line!=null){
data.add(line);
line= reader.readLine();
}
following this link
Read all lines with BufferedReader

Trying to compare generated inputStream with resource file - jUnit

I'm trying to compare the inputStream from a resource file with a created inputStream.
I'm doing that the following way:
InputStream isAsJpg = Thread.currentThread().getContextClassLoader()
.getResourceAsStream("koala.jpg");
InputStream returnedIs = ImageUtil.convertImageStreamToPdfStream(isAsJpg);
//get is from src/test/resources
InputStream expectedIs = Thread.currentThread().getContextClassLoader()
.getResourceAsStream("koala.pdf");
and for my tests I'm calling:
assertTrue(IOUtils.contentEquals(expectedIs, returnedIs));
but it returns false. Therefor I started with creating files so that I could manually check if the file is empty or something. So I added:
File tempFile = File.createTempFile("koala", ".pdf");
tempFile.deleteOnExit();
try (FileOutputStream out = new FileOutputStream(tempFile)) {
IOUtils.copy(returnedIs, out);
}
and I have checked the content of the file manually and it seems ok. Now I wanted to created a file from the resource that I've got to check that content (on the same way) and the pdf was empty.. Although it is placed in the src/test/resource directory and when I try to open it there, it is not empty.
What am I doing wrong? It seems as if I'm not getting the resource on the correct way (koala.pdf) but I can't find an error actually..
EDIT:
When I go and look to
C:..\target\test-classes
the file is there, but.. it is empty (blank page). Although when I open it from
C:..\src\test\resources
it is not empty. How can that be??

I've found a solution.
It's possible to say to maven that he should replace Maven placeholders of type ${..} and my binary PDF content is ofc full with it and therefor the file got corrupted.
I've changed the filtering in my pom:
<testResources>
<testResource>
<directory>src/test/resources</directory>
<filtering>false</filtering>
</testResource>
</testResources>
and the file in the target/test-classes also contains the image now.

Does this line
InputStream returnedIs = ImageUtil.convertImageStreamToPdfStream(isAsJpg);
actually write to a file? It doesn't seem like it. It seems like you should use the InputStream returned to write to a corresponding OutputStream. Then continue as you were.

iText Error: java.io.IOException: trailer not found

I'm creating a web application which will fill a PDF form using iText. To create the PDF forms I'm first using Microsoft Word to create a template, saving it, then opening that file in Adobe Acrobat Xi Pro, adding my form fields, and saving it as a PDF. The problem is the PDF is not saving with a trailer so when I execute this:
PdfReader reader = new PdfReader(templateName);
It throws an exception "java.io.IOException: trailer not found". I know I can read a PDF if it has a trailer because I've tried reading other PDFs. So it appears the issue is that Acrobat is not adding a trailer to my PDF. Even if I try creating a PDF form from scratch in Acrobat it is not saved with a trailer.
Has anyone else run into this problem? Is there some setting in Acrobat that will add the trailer? Is there a way to get iText to read it without the trailer?
====UPDATE====
I must have had an old version of iText because when I downloaded the latest version I was able to read my PDF file. However after reading the file and stamping it I got an exception closing the stamper. The code looks like this:
PdfReader reader = new PdfReader(templateName);
FileOutputStream os = new FileOutputStream(outputPath);
PdfStamper stamper = new PdfStamper(reader, os);
AcroFields acroFields = stamper.getAcroFields();
List<String> fields = getFieldNames(getContextCd());
for (String field : fields) {
acroFields.setField (field, StringUtil.checkEmpty(request.getParameter(field)));
}
stamper.setFormFlattening(true);
stamper.close();
The error I got was:
java.lang.AbstractMethodError: javax.xml.parsers.DocumentBuilderFactory.setFeature(Ljava/lang/String;Z)V
at com.itextpdf.xmp.impl.XMPMetaParser.createDocumentBuilderFactory(XMPMetaParser.java:423)
at com.itextpdf.xmp.impl.XMPMetaParser.(XMPMetaParser.java:71)
at com.itextpdf.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:167)
at com.itextpdf.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:153)
at com.itextpdf.text.pdf.PdfStamperImp.close(PdfStamperImp.java:337)
at com.itextpdf.text.pdf.PdfStamper.close(PdfStamper.java:208)
The only jar file I added to my classpath is itextpdf-5.5.2.jar. Do I need any of the other jars?

My Solution will work...
Assumption 1: the pdf is missing trailer, it is placed under resources/xxx and after build this is moved to classess/xxx
Assumption 2: you are trying to read the pdf from classpath and trying to make a PdfReader object from this pdf file path.
If above assumptions are right, below is the solution followed by reasoning:
go to your pom file or what ever build configuration file you have, and change the setting to exclude .pdf files from being filtered during build. we want the pdf to move from resources/xxx to classess/xxx without any manipulation by build activities. something like below:
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
<excludes>
**<exclude>**/*.pdf</exclude>**
</excludes>
</resource>
</resources>
Explanation needed? Ok.
When resources are being moved from resources/xxx to classess/xxx, no matter what kind of file they are, they are all interpreted by build process. while doing so for the PDF, trailer and EOF tags are expected. If not, the build tool adds extra characters to pdf content to report the problem by any code trying to use this PDF.
If we skip the filtering of PDF from build activity, the same pdf will work even if the trailer is missing.
Hope this helps!

PDFbox loading large files

I'm trying to convert the first page of a pdf file to image using PDFBox.
When i'm loading a large pdf file i get an exception.
code:
PDDocument doc;
try {
InputStream input = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
doc = PDDocument.load(input);
PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
BufferedImage image =firstPage.convertToImage();
File outputfile = new File("image2.png");
ImageIO.write(image, "png", outputfile);
input.close();
doc.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
exception:
org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
at java.io.PushbackInputStream.unread(Unknown Source)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
... 5 more

An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be
doc = PDDocument.load(input);
but
doc = PDDocument.loadNonSeq(input, null);
that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.

First, find the current buffer size:
System.out.println(System.getProperty("org.apache.pdfbox.baseParser.pushBackSize"));
Now that you have a baseline, do exactly what it suggests. Increase the buffer size above what you just printed out using this:
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "<buffer size>");
Keep increasing the buffer size until it works. Hopefully you won't run out of memory, if you do increase heap.
This is how you set system properties at runtime. You could also pass it as an argument, but I find setting near the beginning of main will do the trick and makes it easier for future developers to maintain the project.
For whatever reason, with large files you don't have a big enough buffer to load the page. Maybe the page is loaded into a buffer before or while it's rendered into an image. My guess is that the DPI in the PDF is very high and can't fit in the buffer.

I had a similar issue, which I thought was related to a large pdf file based on the error, however it turned out it was not. It turned out to be a corrupt pdf file.
For our use case, we had a pdf template file (which we populate its form values programmatically) as a resource in our project that is cooked into our war.
The exception I was seeing for reference: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 480478 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize. We added the property and then ran things again and we got a different issue.
The next stack trace stated "Could not read embedded TTF for font TimesNewRoman,Bold". It took us a while, however after exploding the war and trying to open the pdf file in the war, we noticed that it was corrupt, but the pdf file that was in source was not corrupt and could be opened without issues.
The root cause of our issue was that we added "filtering" in our pom for our resource folder. We did this so that we could use some reflection to get some values in our health check page, but that corrupted the pdf file, which we figured out from the following reference: https://bitbucket.org/petermr/xhtml2stm/issues/12/pdf-files-are-being-corrupted-at-some
Below is an example of the filtering we setup that bit us:
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
Our solution was to remove this from our pom and rework how we got the information for our health page.

In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.
Good Luck

java FileInputStream - differences based on how the File object is referenced: classloader/filesystem

I'm using apache POI to extract some data from an excel file.
I need an InputStream to instantiate the POI HSSFWorkbook class HSSFWorkbook wb = new HSSFWorkbook(inputStreamX);
I'm finding differences if I try to construct the InputStream object like
InputStream inputStream = new FileInputStream(new File("/home/xxx/workspace/myproject/test/resources/importTest.xls"));
InputStream inputStream2 = new FileInputStream(getClass().getResource("/importTest.xls").getFile());
InputStream inputStream3 = new ClassPathResource("importTest.xls").getInputStream();
If I construct the POI object with inputStream it works fine.
But inputStream2 and inputStream3 are throwing this exception
java.io.IOException: Invalid header signature; read -2300849302551019537, expected -2226271756974174256
at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:100)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:84)
It seems that the header of the binary file is different and the library can't recognize it as an Excel file. I can't understand why.
The only difference I see is that inputStream2 & 3 are using the classloader to locate the file. (ClassPathResource is a Spring class).
I'd like to have the file path separated from the system. So I would prefer something like inputStream2 or 3.
Do you have any idea on why this is happening?
Thank you
Update:
I tried writing to disk the inputStream and inputStream2. The excel file that comes with inputStream is Ok. inputStream2 contains an excel file with some strange characters that wrap the real content.
It seems that maven corrupts the excel file in some way during the build.
So it's basically the file I retrieve with the classLoader (under /home/xxx/workspace/myproject/target/test-classes/importTest.xls) that is not ok.
Any idea?

The problem seems maven's filtering option. If the pom looks like this
<testResource>
<directory>${basedir}/src/test/resources</directory>
<includes>
<include>**/*.xml</include>
<include>**/*.properties</include>
<include>**/*.sql</include>
<include>**/*.xls</include>
</includes>
<filtering>true</filtering>
</testResource>
When the filtering option is set to true on xls files it corrupts them.

Have you tried ClassLoader#getResourceAsStream(String)? It will probably behave similarly to your second attempt using Class#getResource(String), as alluded to in the latter's documentation.
My first thought here was that no such file was found, but if it's consistently reading the same value (-2300849302551019537) each time you run the program, that suggests there really is a file there that's being read. Trap the statement after you initialize your InputStream and inspect the stream instance in the debugger. You should be able to find a reference to the underlying file name. To make this easier at first, try using ClassLoader#getResources(String) and inspect the sequence of URLs returned.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.