Java Communication Server to print 1 million PDF document - java

I have a Java batch job which prints 1 million (1 page) PDF document.
This batch job will run after every 5 days.
For printing 1 million (1 Page) PDF document through batch job, which method is better ?
In this PDF most of the text / paragraph is same for all customers, only few information is dynamically picked from database as (Customer Id/ Name/ Due Date/ Expiry Date/ Amount)
We have tried following
1) Jasper Report
2) iText
But above 2 methods are not giving better performance as static text / paragraph for each document is created runtime always.
So I am thinking for some approach like
There will be a template with place holders for dynamic values (Customer Id/ Name/ Due Date/ Expiry Date/ Amount).
There will be a Communication Server like Open Office, which will have this template.
Through our Java Application deployed on web server will fetch dataset from database and pass onto this communication server, where templates are already opened into memory and just from dataset dynamic placeholder values will be changed and template will be saved like "Save As" command.
Can this above approach will be achievable, If yes which API / Communication server is better ?
Here is Jasper Report Code for reference
InputStream is = getClass().getResourceAsStream("/jasperreports/reports/"+reportName+".jasper" );
JasperPrint print = JasperFillManager.fillReport(is, parameters, dataSource);
pdf = File.createTempFile("report.pdf", "");
JasperExportManager.exportReportToPdfFile(print, pdf.getPath());

Wow. 1 million PDF files every 5 days.
Even if it takes you just 0.5 second to generate a PDF file from the beginning to end (a finished file on disk) - It will take you a FULL 5 days to generate this amount of PDFs sequentially.
I think any approach of generating the file in sub-second amount of time is fine (and Jasper reports certainly can give you this level of performance).
I think you need to think about how you're going to optimise the whole process: you're certainly going to have to use multi-threading and perhaps even several physical servers to generate this amount of files in any reasonable amount of time (at least overnight).

I will go with PDF forms (this should be "fast"):
public final class Batch
{
private static final String FORM = "pdf-form.pdf"
public static void main(final String[] args) {
final PdfPrinter printer = new PdfPrinter(FORM);
final List<Customer> customers = readCustomers();
for(final Customer customer : customers) {
try {
printer.fillAndCreate("pdf-" + customer.getId(), customer);
} catch (IOException e) {
// handle exception
} catch (DocumentException e) {
// handle exception
}
}
printer.close();
}
private #Nonnull List<Customers> readCustomers() {
// implements me
}
private Batch() {
// nothing
}
}
public class PdfPrinter implements Closable
{
private final PdfReader reader;
public PdfPrinter(#Nonnull final String src) {
reader = new PdfReader(src); // <= this reads the form pdf
}
#Override
public void close() {
reader.close();
}
public void fillAndCreate(#Nonnull final String dest, #Nonnull final Customer customer) throws IOException, DocumentException {
final PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest)); // dest = output
final AcroFields form = stamper.getAcroFields();
form.setField("customerId", customer.getId());
form.setField("name", customer.getName());
// ...
stamper.close();
}
}
see also: http://itextpdf.com/examples/iia.php?id=164

As a couple of posters mentioned, 1 million PDF files is going to mean you are going to have to sustain a rate of over 2 documents per second. This is achievable from a pure document-generation aspect, but you need to keep in mind that the load on the systems running the queries and compiling the data will also undergo a reasonable load. You also haven't said anything about the PDFs - a one page PDF is much easier to generate than a 40 page PDF...
I have seen iText and Docmosis achieve tens of documents per second and so Jasper and other technologies probably could also. I mention Docmosis because it works along the lines of the technique you mentioned (populating templates loaded into memory). Please note I work for the company that produces Docmosis.
If you haven't already, you will need to consider the hardware/software architecture and run trials with whatever technologies you are trying to make sure you will be able to get the performance you require. Presumably the peak-load might be somewhat higher than the average load.
Good luck.

Related

How to avoid running out of memory reading a complex PDF via iText7?

I'm using iText7 and Java to read PDFs that are not very large (10-30MB), but they contain a massive number of objects, causing OutOfMemoryError problems when creating and using a PdfDocument. (The internal xref table and Map/Tree/Pdf[Dict/Array] objects are in the millions.)
For example, a single PDF might only be 33MB but it contains a single table with a million rows spanning 800 pages, and the bookkeeping inside of PdfDocument is blowing up to 400MB. Here's the sample code and heap dump:
public static void main(String[] args) throws Exception {
// PDF file is 33MB on disk (one big table over 800 pages)
File pdf = new File("big.pdf"); // Also tried InputStream
PdfReader reader = new PdfReader(pdf); // 35MB heap
PdfDocument document = new PdfDocument(reader); // 400MB+ heap
// do stuff ... assuming we didn't get an OOM above
}
We added more memory to the JVM, but we don't know how big/complex some of these PDFs might be, so a long-term solution is needed, ideally one that lets us read contents in pieces or in an event-like callback manner (like XML+STAX/SAX).
Is there a more efficient way to either stream the PDF or break it up into sub PdfDocuments given a file or InputStream? We want to locate and extract objects like forms, tables, tooltips, etc.
Update: I got in contact with the IText team and IText7 doesn't allow partial readings of PDFs. So there isn't much I can do in this case except add more RAM or pre-parse the PDF mysql and look for "too much data" (a lot of work). I also checked PDFBox and it suffers from the same problem.
You can do the following for reading large files:
RandomAccessSourceFactory rasf = new RandomAccessSourceFactory();
RandomAccessSource ras = rasf.createBestSource(file);
RandomAccessFileOrArray rafoa = new RandomAccessFileOrArray(ras);
PdfReader pdfReader = new PdfReader(rafoa, null);

Java reading and indexing large amount of files

I'm working on application that is supposed to read a large amount of files (test set is about 80.000 files). It then extracts the text from these files. The files can be anything from txt, pdf, docx, etc. and will be parsed using Apache Tika.
Once the text is extracted, it will be indexed in ElasticSearch to become searchable. Elastic has, thus far, not been a problem in this.
The server on which this application will run will have limited RAM (about 2GB)
Current
The Tika implementation is as follows:
private static final int PARSE_STRING_LIMIT = 100000;
private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
public String parseToString(InputStream inputStream) throws IOException, TikaException {
try {
return TIKA_INSTANCE.parseToString(inputStream, new Metadata(), PARSE_STRING_LIMIT);
} finally {
IOUtils.closeQuietly(inputStream); //should already be closed by parseToString.
}
}
For each file, a document object is created and given the appropriate values for the ElasticSearch mapping. Text extraction is done as follows:
String text = TIKA_INSTANCE.parseToString(newBufferedInputStream(new FileInputStream(file)));
attachmentDocumentNew.setText(text);
text = null;
One more caveat is that this is a Spring-boot application which will eventually run as a server so it can be called upon whenever the index is necessary (and some other stuff, such as statistics).
The jar is run with the following VM arguments:
java -Xms512m -Xmx1024m -XX:UseG1GC -jar <jar>
The problem
Whenever I start indexing the files, i get an OutOfMemoryException. I tried profiling it using VisualVM, but it's mostly char[] and byte[], which don't provide a lot of information. I am also not well versed in multithreading or profiling (I do neither at this point), since I only have 2 years of programming experience.
The question
How do I reduce the memory footprint of the application without crashing the indexing?
Perhaps a more general question if the above is too specific:
How do I reduce the memory usage when reading a large amounts of files?
If you have experience building something like this, I'll also be open about any suggestions :)
Edit
To clarify, I do not have to write much (any?) code for the Elasticsearch part of the application, since this is done using an existing library written by the people here.

Usefulness of DELETE_ON_CLOSE

There are many examples on the internet showing how to use StandardOpenOption.DELETE_ON_CLOSE, such as this:
Files.write(myTempFile, ..., StandardOpenOption.DELETE_ON_CLOSE);
Other examples similarly use Files.newOutputStream(..., StandardOpenOption.DELETE_ON_CLOSE).
I suspect all of these examples are probably flawed. The purpose of writing a file is that you're going to read it back at some point; otherwise, why bother writing it? But wouldn't DELETE_ON_CLOSE cause the file to be deleted before you have a chance to read it?
If you create a work file (to work with large amounts of data that are too large to keep in memory) then wouldn't you use RandomAccessFile instead, which allows both read and write access? However, RandomAccessFile doesn't give you the option to specify DELETE_ON_CLOSE, as far as I can see.
So can someone show me how DELETE_ON_CLOSE is actually useful?
First of all I agree with you Files.write(myTempFile, ..., StandardOpenOption.DELETE_ON_CLOSE) in this example the use of DELETE_ON_CLOSE is meaningless. After a (not so intense) search through the internet the only example I could find which shows the usage as mentioned was the one from which you might got it (http://softwarecave.org/2014/02/05/create-temporary-files-and-directories-using-java-nio2/).
This option is not intended to be used for Files.write(...) only. The API make is quite clear:
This option is primarily intended for use with work files that are used solely by a single instance of the Java virtual machine. This option is not recommended for use when opening files that are open concurrently by other entities.
Sorry I can't give you a meaningful short example, but see such file like a swap file/partition used by an operating system. In cases where the current JVM have the need to temporarily store data on the disc and after the shutdown the data are of no use anymore. As practical example I would mention it is similar to an JEE application server which might decide to serialize some entities to disc to freeup memory.
edit Maybe the following (oversimplified code) can be taken as example to demonstrate the principle. (so please: nobody should start a discussion about that this "data management" could be done differently, using fixed temporary filename is bad and so on, ...)
in the try-with-resource block you need for some reason to externalize data (the reasons are not subject of the discussion)
you have random read/write access to this externalized data
this externalized data only is of use only inside the try-with-resource block
with the use of the StandardOpenOption.DELETE_ON_CLOSE option you don't need to handle the deletion after the use yourself, the JVM will take care about it (the limitations and edge cases are described in the API)
.
static final int RECORD_LENGTH = 20;
static final String RECORD_FORMAT = "%-" + RECORD_LENGTH + "s";
// add exception handling, left out only for the example
public static void main(String[] args) throws Exception {
EnumSet<StandardOpenOption> options = EnumSet.of(
StandardOpenOption.CREATE,
StandardOpenOption.WRITE,
StandardOpenOption.READ,
StandardOpenOption.DELETE_ON_CLOSE
);
Path file = Paths.get("/tmp/enternal_data.tmp");
try (SeekableByteChannel sbc = Files.newByteChannel(file, options)) {
// during your business processing the below two cases might happen
// several times in random order
// example of huge datastructure to externalize
String[] sampleData = {"some", "huge", "datastructure"};
for (int i = 0; i < sampleData.length; i++) {
byte[] buffer = String.format(RECORD_FORMAT, sampleData[i])
.getBytes();
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
sbc.position(i * RECORD_LENGTH);
sbc.write(byteBuffer);
}
// example of processing which need the externalized data
Random random = new Random();
byte[] buffer = new byte[RECORD_LENGTH];
ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
for (int i = 0; i < 10; i++) {
sbc.position(RECORD_LENGTH * random.nextInt(sampleData.length));
sbc.read(byteBuffer);
byteBuffer.flip();
System.out.printf("loop: %d %s%n", i, new String(buffer));
}
}
}
The DELETE_ON_CLOSE is intended for working temp files.
If you need to make some operation that needs too be temporaly stored on a file but you don't need to use the file outside of the current execution a DELETE_ON_CLOSE in a good solution for that.
An example is when you need to store informations that can't be mantained in memory for example because they are too heavy.
Another example is when you need to store temporarely the informations and you need them only in a second moment and you don't like to occupy memory for that.
Imagine also a situation in which a process needs a lot of time to be completed. You store informations on a file and only later you use them (perhaps many minutes or hours after). This guarantees you that the memory is not used for those informations if you don't need them.
The DELETE_ON_CLOSE try to delete the file when you explicitly close it calling the method close() or when the JVM is shutting down if not manually closed before.
Here are two possible ways it can be used:
1. When calling Files.newByteChannel
This method returns a SeekableByteChannel suitable for both reading and writing, in which the current position can be modified.
Seems quite useful for situations where some data needs to be stored out of memory for read/write access and doesn't need to be persisted after the application closes.
2. Write to a file, read back, delete:
An example using an arbitrary text file:
Path p = Paths.get("C:\\test", "foo.txt");
System.out.println(Files.exists(p));
try {
Files.createFile(p);
System.out.println(Files.exists(p));
try (BufferedWriter out = Files.newBufferedWriter(p, Charset.defaultCharset(), StandardOpenOption.DELETE_ON_CLOSE)) {
out.append("Hello, World!");
out.flush();
try (BufferedReader in = Files.newBufferedReader(p, Charset.defaultCharset())) {
String line;
while ((line = in.readLine()) != null) {
System.out.println(line);
}
}
}
} catch (IOException ex) {
ex.printStackTrace();
}
System.out.println(Files.exists(p));
This outputs (as expected):
false
true
Hello, World!
false
This example is obviously trivial, but I imagine there are plenty of situations where such an approach may come in handy.
However, I still believe the old File.deleteOnExit method may be preferable as you won't need to keep the output stream open for the duration of any read operations on the file, too.

Generate a large stream for testing

We have a web service where we upload files and want to write an integration test for uploading a somewhat large file. The testing process needs to generate the file (I don't want to add some larger file to source control).
I'm looking to generate a stream of about 50 MB to upload. The data itself does not much matter. I tried this with an in-memory object and that was fairly easy, but I was running out of memory.
The integration tests are written in Groovy, so we can use Groovy or Java APIs to generate the data. How can we generate a random stream for uploading without keeping it in memory the whole time?
Here is a simple program which generates a 50 MB text file with random content.
import java.io.PrintWriter;
import java.util.Random;
public class Test004 {
public static void main(String[] args) throws Exception {
PrintWriter pw = new PrintWriter("c:/test123.txt");
Random rnd = new Random();
for (int i=0; i<50*1024*1024; i++){
pw.write('a' + rnd.nextInt(10));
}
pw.flush();
pw.close();
}
}
You could construct a mock/dummy implementation of InputStream to supply random data, and then pass that in wherever your class/library/whatever is expecting an InputStream.
Something like this (untested):
class MyDummyInputStream extends InputStream {
private Random rn = new Random(0);
#Override
public byte read() { return (byte)rn.nextInt(); }
}
Of course, if you need to know the data (for test comparison purposes), you'll either need to save this data somewhere, or you'll need to generate algorithmic data (i.e. a known pattern) rather than random data.
(Of course, I'm sure you'll find existing frameworks that do all this kind of thing for you...)

why does image rendring through ghostscript API takes so much time?

Im using Ghostscript to render images from PDFs through java using commands, however I’m trying to run Ghoscript for image rendering from PDF using ghost4j-0.5.0.jar with the below code that I took it from this website.
The problem is that the rending process takes more than two minutes to generate one image, though it takes a second to do it through command line, the thing is im trying to run every thing through java, I want to stop using imagemagick and ghostscript as a tools, please note that im satisfied with using ghoscript and i don't want to use any other tool as it provides me with the image quality and sizes i need,
the code im using is,:
public class SimpleRendererExample {
public static void main(String[] args) {
imageRenderingFromPdf();
}
public static void imageRenderingFromPdf() {
try {
PDFConverter converter = new PDFConverter();
PDFDocument doc;
// load PDF document
PDFDocument document = new PDFDocument();
document.load(new File("d:/cur/outputfile.pdf"));
// create renderer
SimpleRenderer renderer = new SimpleRenderer();
// set resolution (in DPI)
renderer.setResolution(100);
System.out.println("started");
// render
long before = System.currentTimeMillis();
List<Image> images = renderer.render(document);
long after = System.currentTimeMillis();
System.out.println("reder " + (after - before) / 1000);
// write images to files to disk as PNG
try {
before = System.currentTimeMillis();
ImageIO.write((RenderedImage) images.get(0), "png", new File(
"d:/dd" + ".png"));
after = System.currentTimeMillis();
System.out.println("write " + (after - before) / 1000);
} catch (IOException e) {
System.out.println("ERROR: " + e.getMessage());
}
} catch (Exception e) {
System.out.println("ERROR: " + e.getMessage());
}
}
There are couple of things what's slowing down the 'rendering' process.
First of all, it's not due Ghostscript, Ghostscript by it's self works a same and it doesn't matter if it's executed via command line or API.
The speed difference is the result of ghost4j rendering implementation. I just checked the source code of the ghost4j and I see that it's a mixture of the iText and Ghostscript implementation.
So, how the code that you use works:
First the pdf document is loaded and parsed by iText.
Then a copy of the complete document is made by writing loaded pdf document back to disk to a new place.
Then Ghostscript is initialized.
Then Ghostscript loads, parse and render the document from a new place for a second time.
For each page, Ghostscript is calling ghost4j display device callback.
For each display device callback, ghost4j takes rasterized page from the memory and stores it to the disk.
The end.
Week parts are iText and used display device callback. I thing that the speed could be gained by letting Ghostscript take care of the rasterized result storage instead of doing it manually from the Java...
I think now you can see why you noticed the speed difference.

Categories