Write large PDFs with Java sequentially - java

I am looking for a Java library that let's you write large PDFs sequentially with a minimum amount of memory. Most of the libraries I had a look at has to build up the document in memory first before you can actually write it.
The problem I have to deal with are OutOfMemoryErrors. It would be great if I could flush the writer programmatically whenever needed e.g. for each page.
Does anyone have any recommendations? I need something with a license along the lines of the LGPL (so not the GPL or the Affero GPL that iText uses).

You can do that with iText. It supports writing to OutputStreams.

The free version of Docmosis has a fairly open license so it might suit you. It uses a template-approach which is different from building from code. Docmosis processes all documents in a stream-based fashion since it's intended for singificant parallel use and for large documents. It also allows you to offload the most CPU intensive part processing to another server. Hope that helps.

I actually, had same issue as you do, a friend help me out, but he did in C# and using an api called GhostScriptSharp, you should check for it.
I can't give you a copy of the code, since its copyrighted, but i'm sure it would help you out, since the tool i think is builded on Java.

jPod can swap indirect objects and supports incremental writing.
This is still not optimal as you need an "increment" on each flush, but better than nothing...
EDIT
Öhhh - this is one of the famous examples of self describing code :-) Your're right, theres not much of a tutorial or that - but the Javadoc is quite good.
jPod writes incremental by default. See "CosDocument.setWriteModeHint" to set to full mode.
The example "CreateDoc" and "AppendPage" are simple examples of how to add pages. You may do the same and call "save" every 10 or 100 pages. This should "soften" all references to pages in memory and if not held by some other references of yours, the can be garbage collected.
THere's still the question how you fill the pages? THere are examples dealing with content streams, too (DrawText,..). BUT jPod is not like iText, jasper or whatever. There's only PDF model abstractions. You have no "Layouter" or "Renderer" that creates page content from text, html or something like that. How do you do this?

Related

Why is all code I see confined to only a few lines?

I'm fairly new to programming and have been taking courses on Lynda to learn the fundamentals. I have some knowledge of Java and HTML, but I wanted to refresh my memory so I can start learning Objective-C. The Lynda course has us working in JavaScript because of its pretty core syntax. So, in order to get a point of reference, I tried downloading some .js files integrated into HTML pages from various sources. However, this proved to be unhelpful and I am at a loss for understanding because of the way the files are formatted. It seems as if most files put one line of code after the other. I realize that because of flexible whitespace restrictions with JavaScript that this does not hinder the way the code runs, but why did the developers choose to put it all on one line like that? They obviously didn't write the code that way, as that would be extremely tedious and hard to work with, so why did it come out that way when I try to view it? Is it just something that happens when you try to download the resources of a page? Any clarification would be appreciated.
Below is a photo of a JavaScript file I tried viewing. As you can see, all the code is restricted to one single line.
Also, if anyone could offer some insight about where to go after I've finished my course if I'm looking to develop for iOS, that would be greatly appreciated. Lynda also offers an Objective-C Essentials course, as well as an iOS Development course, but I feel like it's a pretty linear path that could be expanded on greatly with some literature or other online documentation.
what you are seeing is a minified version of the javascript file. The main advantage of minification is that it reduces the amount of data that needs to be transferred (bandwidth usage).
If you wish to view the code in human readable format, you can use online tools like this
Yes as karthikr said it's minified. Which means its all there but without the line breaks. So to see it all you have to scroll right.
Or you can use http://jsbeautifier.org/ to bring back the break lines.
There are several reasons for minifying javascript. One is that it makes the code less readable (yeah, some devs don't want you to "steal" functions and see what it does easily). Another, and a big part of why, is that it reduces bandwidth. A file with long variable names and whitespaces everywhere can be multiple times bigger than a minified version - so it improves performance!
Bandwidth costs money, especially for users and especially if they're on mobile devices with a bandwidth limit.
So to solve this problem developers will minimize the file size of what ever they can.
The JavaScripts you are seeing have been minified by libraries such as Uglify or YUI compressor (list not exhaustive).
Doing this will take out unnecessary whitespace and reduce the lengths of variable and function names that are not globally exported.
Developers may also gzip the files too which will reduce the filesize even further.
Edit: grammar

Page code compression is really needed?

I don't really love when the code in the page is written in one line, when I wasting a lot of time to try to understand something in there, really the compressed code written in a page worth it? By the principles of programming, code should be readable for others programmers who will come to maintain it too.
and by the way, HTML comments could decrease page load time? because they are visible to others.
<!-- comment goes here -->
but java comments? they is not visible to others
<%-- comment goes here --%>
I think you are confusing many concepts here.
Page compression can be done at various levels. You can employ mod_gzip and mod_deflate or similar modules on your web or web-application servers, to compress the raw bytes served by the web/application server. This often saves a lot of bandwidth and is usually not a cause of problems for web-developers, because the browser will decompress the page content before rendering it (or displaying the source back in the "View Source" context).
The "page written in one line" is not compression. The technical term is minification or obfuscation. It is typically done for JavaScript, to reduce the size of the JavaScript file being served; this can reduce the filesize drastically, with the added benefit of being difficult to parse by human-readers. Web-developers who employ JavaScript minifiers are often clever enough to have the non-minified version of the source code available, so that debugging is not an issue.
One of the former customer sites that I've worked on, demonstrated a performance increase of upto 40% when employing GZIP compression on the wire, and between 5-10% when deployed with minified JavaScript files (there were thousands of such files). But again, your mileage might vary when using these techniques.
Finally, HTML comments (<!-- comment goes here -->) do have a performance hit, as it takes more time to serve pages with comments, than pages without them. The impact on rendering might be negligible, as comments are often stripped out by the lexical analyzer. This is not true of JavaScript comments in inline script tags that are first parsed by the HTML parser. The second type of comments (<%-- comment goes here -->) is never served by the application server, as it is a JSP-style comment, and the JSP compiler usually ignores these comments, thus not generating any comments in the resulting HTML content.
HTML isn't meant to be read by others when it's being used in production. Generally the original code is going to be readable and things like HTML and JavaScript are commonly minified to decrease load time.
And yes, any comment that your browser has to download is going to increase page load time.
I don't really love when the code in the page is written in one line, when I wasting a lot of time to try to understand something in there, really the compressed code written in a page worth it?
It can be
By the principles of programming, code should be readable for others programmers who will come to maintain it too.
That is why minification is done as part of the build process. Developers working on it get sensibly formatted code.
and by the way, HTML comments could decrease page load time? but java comments? they is not visible to others
If it is delivered to the client, then it takes up some bandwidth. That may or may not be a significant amount of bandwidth depending on the context.
Some do it intentionally to discourage examination of their code, although with some effort it can be well formatted and be readable again. This is a bit like code obfuscation seen in Java.

Java chart library for really large data?

I'm looking for a chart library capable to handle large amount of data points - 300 millions per a chart and even more. Surely drawing, caching and approximation should be implemented with intelligence there.
Actually I need to represent waveforms but not only them.
Target platform is Java, data comes from files.
UPD: PC, Swing.
Not java, but CERN does massive data crunching and distros/plots may well have these kinds of data volumes. They use the root package which is c++. You can download it, although couldn't see a licence. It's prob open source.
Or alternatively, take a look at R which might do what you need.
I have been happy with my use of JChart2D. Switching to it from JFreeChart saved us considerable processor use, and it has traces that compute multiple inputs into a mean point for speed and memory saving. I've never used them seeing as how I haven't needed to yet. I have put extremely large sets of data into a normal trace by accident, and it didn't seem to be a problem.
There may be a better charting system out there, but this one gets the job done quick and effectively, it's free, open-source, based off of JPanels, and the author is around to answer questions and correct problems.
I don't see a way to handle that amount of data on an android phone, whatever librairy you use. You should think about doing all this processing on a server or a cloud and then put either an approximated set of data that would approximate the chart or even the result of the chart as an image file so that android phones can download it from the server without processing the data.
Regards,
stéphane
I assume that you are talking about a Swing Application.
I make use of JGoodies for all my Swing applications including Graphs and Charts.
Takes a bit getting use to it, but once you are use to it building UI's is fairly quick and easy.
The only problem is that there is a developer license cost involved.
You can download the Java Webstart examples to have a look at what it is capable of.

Are there any tools to isolate the content of a webpage?

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.
I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/
You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.
im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure
I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

Categories