PDFbox loading large files - java

I'm trying to convert the first page of a pdf file to image using PDFBox.
When i'm loading a large pdf file i get an exception.
code:
PDDocument doc;
try {
InputStream input = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
doc = PDDocument.load(input);
PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
BufferedImage image =firstPage.convertToImage();
File outputfile = new File("image2.png");
ImageIO.write(image, "png", outputfile);
input.close();
doc.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
exception:
org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
at java.io.PushbackInputStream.unread(Unknown Source)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
... 5 more

An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be
doc = PDDocument.load(input);
but
doc = PDDocument.loadNonSeq(input, null);
that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.

First, find the current buffer size:
System.out.println(System.getProperty("org.apache.pdfbox.baseParser.pushBackSize"));
Now that you have a baseline, do exactly what it suggests. Increase the buffer size above what you just printed out using this:
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "<buffer size>");
Keep increasing the buffer size until it works. Hopefully you won't run out of memory, if you do increase heap.
This is how you set system properties at runtime. You could also pass it as an argument, but I find setting near the beginning of main will do the trick and makes it easier for future developers to maintain the project.
For whatever reason, with large files you don't have a big enough buffer to load the page. Maybe the page is loaded into a buffer before or while it's rendered into an image. My guess is that the DPI in the PDF is very high and can't fit in the buffer.

I had a similar issue, which I thought was related to a large pdf file based on the error, however it turned out it was not. It turned out to be a corrupt pdf file.
For our use case, we had a pdf template file (which we populate its form values programmatically) as a resource in our project that is cooked into our war.
The exception I was seeing for reference: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 480478 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize. We added the property and then ran things again and we got a different issue.
The next stack trace stated "Could not read embedded TTF for font TimesNewRoman,Bold". It took us a while, however after exploding the war and trying to open the pdf file in the war, we noticed that it was corrupt, but the pdf file that was in source was not corrupt and could be opened without issues.
The root cause of our issue was that we added "filtering" in our pom for our resource folder. We did this so that we could use some reflection to get some values in our health check page, but that corrupted the pdf file, which we figured out from the following reference: https://bitbucket.org/petermr/xhtml2stm/issues/12/pdf-files-are-being-corrupted-at-some
Below is an example of the filtering we setup that bit us:
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
Our solution was to remove this from our pom and rework how we got the information for our health page.

In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.
Good Luck

Related

iText Error: java.io.IOException: trailer not found

I'm creating a web application which will fill a PDF form using iText. To create the PDF forms I'm first using Microsoft Word to create a template, saving it, then opening that file in Adobe Acrobat Xi Pro, adding my form fields, and saving it as a PDF. The problem is the PDF is not saving with a trailer so when I execute this:
PdfReader reader = new PdfReader(templateName);
It throws an exception "java.io.IOException: trailer not found". I know I can read a PDF if it has a trailer because I've tried reading other PDFs. So it appears the issue is that Acrobat is not adding a trailer to my PDF. Even if I try creating a PDF form from scratch in Acrobat it is not saved with a trailer.
Has anyone else run into this problem? Is there some setting in Acrobat that will add the trailer? Is there a way to get iText to read it without the trailer?
====UPDATE====
I must have had an old version of iText because when I downloaded the latest version I was able to read my PDF file. However after reading the file and stamping it I got an exception closing the stamper. The code looks like this:
PdfReader reader = new PdfReader(templateName);
FileOutputStream os = new FileOutputStream(outputPath);
PdfStamper stamper = new PdfStamper(reader, os);
AcroFields acroFields = stamper.getAcroFields();
List<String> fields = getFieldNames(getContextCd());
for (String field : fields) {
acroFields.setField (field, StringUtil.checkEmpty(request.getParameter(field)));
}
stamper.setFormFlattening(true);
stamper.close();
The error I got was:
java.lang.AbstractMethodError: javax.xml.parsers.DocumentBuilderFactory.setFeature(Ljava/lang/String;Z)V
at com.itextpdf.xmp.impl.XMPMetaParser.createDocumentBuilderFactory(XMPMetaParser.java:423)
at com.itextpdf.xmp.impl.XMPMetaParser.(XMPMetaParser.java:71)
at com.itextpdf.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:167)
at com.itextpdf.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:153)
at com.itextpdf.text.pdf.PdfStamperImp.close(PdfStamperImp.java:337)
at com.itextpdf.text.pdf.PdfStamper.close(PdfStamper.java:208)
The only jar file I added to my classpath is itextpdf-5.5.2.jar. Do I need any of the other jars?
My Solution will work...
Assumption 1: the pdf is missing trailer, it is placed under resources/xxx and after build this is moved to classess/xxx
Assumption 2: you are trying to read the pdf from classpath and trying to make a PdfReader object from this pdf file path.
If above assumptions are right, below is the solution followed by reasoning:
go to your pom file or what ever build configuration file you have, and change the setting to exclude .pdf files from being filtered during build. we want the pdf to move from resources/xxx to classess/xxx without any manipulation by build activities. something like below:
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
<excludes>
**<exclude>**/*.pdf</exclude>**
</excludes>
</resource>
</resources>
Explanation needed? Ok.
When resources are being moved from resources/xxx to classess/xxx, no matter what kind of file they are, they are all interpreted by build process. while doing so for the PDF, trailer and EOF tags are expected. If not, the build tool adds extra characters to pdf content to report the problem by any code trying to use this PDF.
If we skip the filtering of PDF from build activity, the same pdf will work even if the trailer is missing.
Hope this helps!

How can I get an image too big from a server?

I'm currenty developing for blackberry and just bumped into this problem as i was trying to download an image from a server. The servlet which the device communicates with is working correctly, as I have made a number of tests for it. But it gives me the
413 HTTP error ("Request entity too large").
I figure i will just get the bytes, uhm, portion by portion. How can i accomplish this?
This is the code of the servlet (the doGet() method):
try {
ImageIcon imageIcon = new ImageIcon("c:\\Users\\dcalderon\\prueba.png");
Image image = imageIcon.getImage();
PngEncoder pngEncoder = new PngEncoder(image, true);
output.write(pngEncoder.pngEncode());
} finally {
output.close();
}
Thanks. It's worth mentioning that I am developing both the client-side and the server-side.
I am not aware by server side code. You can look on this Link to get an idea how to upload file using multipart to support Big files uploading
it can also work on blackberry , With some modifications needed.
http://www.developer.nokia.com/Community/Wiki/HTTP_Post_multipart_file_upload_in_Java_ME
I'm not familiar with the PNGEncoder class you're using, but just looking at your servlet code, and the comment you made about the request size (2.2 MB), I'm guessing that part of your problem is that you're uncompressing the image, and then transmitting it across the network.
I don't think you should have any PNGEncoder or ImageIcon code in your servlet. You should just read the "c:\\Users\\dcalderon\\prueba.png" file in with a normal InputStream as bytes, and then write that to the servlet's output. I don't think it matters whether that file is a PNG image, a .mp3 file, or any other content. (although you might need to set the Content Type to image/png).
So, I would try transmitting the image compressed (as a .png just as it's stored on disk). If that still doesn't work, then go with the suggestion to use multipart transmission.

Fastest way to access given lines of text file with and without using GZip and the Jar File (GZip in memory?)

I have given number (5-7) of large UTF8 text files (7 MB). In unicode their size is about 15MB each.
I need to load given parts of a given file. The files are known and does not change. I would like to access and load lines at give place as fast as possible. I load these lines adding HTML tags and display them in a JEditorPane. I know the bottle neck will be the rendering by the JEditorPane of the HTML generated but for now I would like to concentrate on the file access performances.
Moreover the user can search for a given word in all the files.
For now the code I use is :
private static void loadFile(String filename, int startLine, int stopLine) {
try {
FileInputStream fis = new FileInputStream(filename);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}
Now my questions :
As the number of parts of each file is known, 67 in my case (for each file), I could create 67 smaller files. It will be "faster" to load a given part but be slower when I do a search as I must open each of the 67 file.
I have not done bench marking but my feelings says that opening 67 files in case of a search is much longer than the time to perform empty reader.readlines when loading a part of the file.
So in my case it is better to have a single larger file. Do you agree with that ?
If I put each large file in the resource, I mean in the Jar file, will the performance be worse, if yes is it significantly worse ?
And the related question is what if I zip each file to spare size. As far as I undersand a Jar file is simply a zip file.
I think I don't know how unzipping works. If I zip a file, will the file be decompressed in memory or will my program be able to access the given lines I need directly on the disk.
Same for the Jar file will it be decompressed in memory.
If unzipping is not in memory can someone edit my code to use zip file.
Final question and the most important for me. I could increase all the performance if everything was performed in memory, but due to unicode and the quite large files this could easily result in a heap of memory of more than 100MB. Is there a possibility of having the zip file loaded in memory and work on that. This would be fast and use only few memory.
Summary of the questions
In my case, 1 large file is best than plenty of small ones.
If files are zipped, is the unzip process (GZipInputStream) performed in memory. Is all the file unzipped in memory and then access or is it possible to access it directly on disk.
If yes to question 2, can someone edit my code to be able to do it ?
MOST IMPORTANT : is it possible to have the zip file loaded in memory and how ?
I hope my questions are clear enough. ;-)
UPDATE : Thanks to Mike for the getResourceAsStream hint, I get it working
Notice that benchmarking give that load the Gzip file is efficient, but in ma case is too slow.
~200 ms for the gzip file
~125 ms for the standard file so 1.6 times faster.
Assuming that the resource folder is called resources
private static void loadFile(String filename, int startLine, int stopLine) {
try {
GZIPInputStream zip = new GZIPInputStream(this.class.getResourceAsStream("resources/"+filename));
InputStreamReader isr = new InputStreamReader(zip, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}
If the files really aren't changing very often I would suggest using some other data structures. Creating a hash table of all the words and locations they show up would make searching much faster, creating an index of all the line start positions would make that process much faster.
But, to answer your questions more directly:
Yes, one large file is probably still better than many small files, I doubt that reading a line and decoding from UTF8 will be noticeable compared to opening many files, or decompressing many files.
Yes, the unzipping process is performed in memory, and on the fly. It happens as you request data, but acts as a buffered stream, it will decompress entire blocks at a time, so it is actually very efficient.
I can't fix your code directly, but I can suggest looking up getResourceAsStream:
http://docs.oracle.com/javase/6/docs/api/java/lang/Class.html#getResourceAsStream%28java.lang.String%29
This function will open a file that is in a zip / jar file and give you access to it as a stream, automatically decompressing it in memory as you use it.
If you treat it as a resource, java will do it all for you, you will have to read up on some of the specifics of handling resources, but java should handle it fairly intelligently.
I think it would be quicker for you to load the file(s) into memory. You can then zip around to whatever part of the file you need.
Take a look at RandomAccessFile for this.
The GZipInputStream reads the files into memory as a buffered stream.
That's another question entirely :)
Again, the zip file will be decompressed in memory depending on what Class you use to open it.

Java file IO and "access denied" errors

I have been tearing my hair out on this and thus I am looks for some help .
I have a loop of code that performs the following
//imports ommitted
public void afterPropertiesSet() throws Exception{
//building of URL list ommitted
// urlMap is a HashMap <String,String> created and populated just prior
for ( Object urlVar : urlMap.keySet() ){
String myURLvar = urlMap.get(urlVar.toString);
System.out.println ("URL is "+myURLvar );
BufferedImage imageVar = ImageIO.read(myURLvar);//URL confirmed to be valid even for executions that fail
String fileName2Save = "filepath"// a valid file path
System.out.println ("Target path is "+fileName2Save );
File file2Save = new File (fileName2Save);
fileName2Save.SetWriteable(true);//set these just to be sure
fileName2Save.SetReadable(true);
try{
ImageIO.write (imageVar,"png",file2save)//error thrown here
}catch (Exception e){
System.out.println("R: "+file2Save.canRead()+" W: "+file2Save.canWrite()+" E:"+file2Save.canExecute()+" Exists: "+file2Save.exists+" is a file"+file2Save.isFile() );
System.out.println("parent Directory perms");// same as above except on parent directory of destination
}//end try
}//end for
}
This all runs on Windows 7 and JDK 1.6.26 and Netbeans,Tomcat 7.0.14 . The target directory is actually inside my netbeans project directory in a folder for a normal web app ( outside WEB-INF) where I would expect normally to have permission to write files.
When the error occurs I get one of two results for the file a.) All false b.)all true. The Parent directory permission never change all true except for isFile.
The error thrown ( java.IO.error with "access denied" ") does not occur every time ... in fact 60% of the time the loop runs it throws no error. The remaining 40% of the time I get the error on 1 of the 60+ files it writes. Infrequently the same one. The order in which the URLs it starts from changes everytime so the order in which the files are written is variable. The file names have short concise names like "1.png". The images are small..less then 8k.
In order to make sure the permissions are correct I have :
Given "full control" to EVERYONE from the net beans project directory down
Run the JDK,JRE and Netbeans as Administrator
Disabled UAC
Yet the error persists. Google searches for this seem to run the gamut and often read like vodoo. Clearly I ( and Java and Netbeans etc ) should have permission to write a file to the directory .
Anyone have any insight ? This is all ( code and the web server hosting the URL) on a closed system so I can't cut and paste code or stacktrace.
Update: I confirmed the imageURL is valid by doing a println & toString prior to each read. I then confirmed that a.) the web server hosting the target URL returned the image with a http 200 code b.) that the URL returned the image when tested in a web browser. In testing I also put a if () in after the read to confirm that the values was not NULL or empty. I also put in tests for NULL on all the other values . They are always as expected even for a failure .The error always occurs inside the try block. The destination directory is the same every execution. Prior to every execution the directory is empty.
Update 2: Here is one of the stack traces ( in this case perms for file2Save are R: True W:True E: True isFile:True exists:True )
java.io.FileNotFoundException <fullFilepathhere> (Access is denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:53)
at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:37)
at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:393)
at javax.imageio.ImageIO.write(ImageIO.java:1514)
at myPackage.myClass.afterPropertiesSet(thisClassexample.java:204)// 204 is the line number of the ImageIO write
This may not answer your problem since there can be many other possibilties to your limited information.
One common possibilty for not being able to write a file in web application is the file locking issue on Windows if the following four conditions are met simultaneously:
the target file exists under web root, e.g. WEB-INF folder and
the target file is served by the default servlet and
the target file has been requested at least once by client and
you are running under Windows
If you are trying to replace such a file that meets all of the four conditions, you will not be able to because some servlet containers such as tomcat and jetty will buffer the static contents and lock the files so you are unable to replace or change them.
If your web application has exactly this problem, you should not use the default servlet to serve the file contents. The default servlet is desigend to serve the static content which you do not want to change, e.g. css files, javascript files, background images, etc.
There is a trick to solve the file locking issue on Windows for jetty by disabling the NIO http://docs.codehaus.org/display/JETTY/Files+locked+on+Windows
The trick is useful for development process, e.g. you want to edit the css file and see the change without restarting your web application, but it is not recommended for production mode. If your web application relies on this trick in the production process, then you should seriously consider redesign your codes.
I cannot tell you what's going on or why... I have a feeling that it's something dependent on the way ImageIO tries to save the image. What you could do is saving the BufferedImage by leveraging the ByteArrayOutputStream as described below:
BufferedImage bufferedImage = ImageIO.read(new File("sample_image.gif"));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write( bufferedImage, "gif", baos );
baos.flush(); //Is this necessary??
byte[] resultImageAsRawBytes = baos.toByteArray();
baos.close(); //Not sure how important this is...
OutputStream out = new FileOutputStream("myImageFile.gif");
out.write(resultImageAsRawBytes);
out.close();
I'm not really familiar with the ByteArrayOutputStream, but I guess its reset() function could be handy when dealing with saving multiple files. You could also try using its writeTo(OutputStream out) if you prefer. Documentation here.
Let me know how it goes...

Reading a file and editing it in Java

What I am doing is I am reading in a html file and I am looking for a specific location in the html for me to enter some text.
So I am using a bufferedreader to read in the html file and split it by the tag . I want to enter some text before this but I am not sure how to do this. The html would then be along the lines of ...(newText)(/HEAD) (The brackets round head are meant to be angled brackets. Don't know how to insert them)
Would I need a PrintWriter to the same file and if so, how would I tell that to write it in the correct location.
I am not sure which way would be most efficient to do something like this.
Please Help.
Thanks in advance.
Here is part of my java code:
File f = new File("newFile.html");
FileOutputStream fos = new FileOutputStream(f);
PrintWriter pw = new PrintWriter(fos);
BufferedReader read = new BufferedReader(new FileReader("file.html"));
String str;
int i=0;
boolean found = false;
while((str= read.readLine()) != null)
{
String[] data = str.split("</HEAD>");
if(found == false)
{
pw.write(data[0]);
System.out.println(data[0]);
pw.write("</script>");
found = true;
}
if(i < 1)
{
pw.write(data[1]);
System.out.println(data[1]);
i++;
}
pw.write(str);
System.out.println(str);
}
}
catch (Exception e) {
e.printStackTrace( );
}
When I do this it gets to a point in the file and I get these errors:
FATAL ERROR: MERLIN: Unable to connect to EDG API,
Cannot find .edg_properties file.,
java.lang.OutOfMemoryError: unable to create new native thread,
Cannot truncate table,
EXCEPTION:Cannot open connection to server: SQLExceptio,
Caught IOException: java.io.IOException: JZ0C0: Connection is already closed, ...
I'm not sure why I get these or what all of these mean?
please Help.
Should be pretty easy:
Read file into a String
Split into before/after chunks
Open a temp file for writing
Write before chunk, your text, after chunk
Close up, and move temp file to original
Sounds like you are wondering about the last couple steps in particular. Here is the essential code:
File htmlFile = ...;
...
File tempFile = File.createTempFile("foo", ".html");
FileWriter writer = new FileWriter(tempFile);
writer.write(before);
writer.write(yourText);
writer.write(after);
writer.close();
tempFile.renameTo(htmlFile);
Most people suggest writing to a temporary file and then copying the temporary file over the original on successful completion.
The forum thread has some ideas of how to do it.
GL.
For reading and writing you can use FileReaders/FileWriters or the corresponding IO stream classes.
For the editing, I'd suggest to use an HTML parser to handle the document. It can read the HTML document into an internal datastructure which simplifies your effort to search for content and apply modification. (Most?) Parsers can serialize the document to HTML again.
At least you're sure to not corrupt the HTML document structure.
Following up on the list of errors in your edit, a lot of that possibly stems from the OutOfMemoryError. That means you simply ran out of memory in the JVM, so Java was unable to allocate objects. This may be caused by a memory leak in your application, or it could simply be that the work you're trying to do does need more memory transiently than you have allocated it.
You can increase the amount of memory that the JVM starts up with by providing the Xmx argument to the java executable, e.g.:
-Xmx1024m
would set the maximum heap size to 1024 megabytes.
The other issues might possibly caused by this; when objects can't reliably be created or modified, lots of weird things tend to happen. That said, there's a few things that look like you can take action. In particular, whatever MERLIN is it looks like it can't do it's work because it needs a property file for EDG, which it's unable to find in the location it's looking. You'll probably need to either put a config file there, or tell it to look at another location.
The other IOExceptions are fairly self-explanatory. Your program could not establish a connection to the server because of a SQLException (the underlying exception itself will probably be found in the logs); and some other part of the program tried to communicate to a remote machine using a closed connection.
I'd look at fixing the properties file (if it's not a benign error) and the memory issues first, and then seeing if any of the remaining problems still manifest.

Categories