I'm developing an web application using Flex and JSP.
I am having some performance issues with displaying multiple PDF files.
I am trying to display about 50-100 PDF files. I know that is a little crazy.
Hence, I made the project to convert PDF files to JPG format and display the JPG files.
I'm wondering if there is a way to decrease the file size of PDF to size of JPG.
Additionally, I would like to seek other way that may improve the performance.
Does anyone know a good way to display many PDF files (that will be mostly just text) for web application? Or, should I just have it display JPG files?
If the PDF files are mostly text you should probably use HTML. Is there something that would prevent you from making regular pages from your PDFs?
You can convert the PDF to rtf text file, use the text from rtf file to populate your HTML page perhaps in a table.
Check out ghostscript lib for doing this conversion.
Related
the task is to write a Java file that analyzes a PDF file. PDFBox from Apache should be used.
The number of words, the number of images, the names of the fonts used, etc. are all no problem.
My problem is: How do I get all used Font Sizes in the PDF file? I read a lot, that I have to use TextStrippe and writeString, but I dont see a solution.
So how do I get the Font Sizes in pt. for a pdf file? Has anyone an idea or solution?
How to extract PDF file content in java completely as Text and render as HTML?
Not like extracting just text separately or just images separately, requirement is to display contents of PDF file (as like original file-means including images and tables right at place where it was in original file) as HTML content.
Some how same like the sample in the answer here Convert Word to HTML with Apache POI which extracts contents of MS Doc file into HTML using Apache POI.
Extracting data from a PDF file is fairly simple. There are multiple libraries out there that do it correctly. Extracting data, and preserving its layout, on the other hand (the workflow the OP describes) is a very difficult process. The reason behind it is simple - most* PDF files, don't really have any elements that define structure. When a PDF file, for example, displays a table, it's very easy for humans to see it, and understand this is indeed a table with some data in it. However, in the PDF file itself, this is a collection of vector lines, and some text runs in between. The PDF itself, or the PDF viewer, are not aware that this is a table. Therefore when this data is converted to HTML, we don't know that we need to draw a table, but instead see this as vector art. This is just one example of why this is difficult. There are many others that can be used to illustrate this point.
On the other hand, such a thing exists as "Tagged PDF" (section 10.7). It's a PDF where structure elements are actually defined, and extraction is fairly easy. However tagged PDF files are not as common as we would like, and in most cases you won't be guaranteed to work with one.
There are some tools on the market that use sophisticated logic to infer the structure of an untagged document. Some of them do a better job than others at this. I've worked with Adobe Acrobat, which does a decent job at creating an HTML file. There is also an offering from Datalogics (I work for Datalogics) called PDF Alchemist which converts PDF to HTML. Both of them are commercial solutions.
If you are looking for a free solution, PDFBox does a good job at extracting content from a PDF document. However, it doesn't have the ability to create an HTML file, and this is something that will have to be implemented outside of the library. I'm not aware of any free PDF to HTML solutions that do a good enough job, and I would be willing to recommend.
I have a bunch of banner files in .swf format. I ultimately want to place these inside a pdf using Java. I have been successful in putting jpegs inside a pdf in Java so far.
I want to know the following things:
Is it possible to place the banner files(.swf) in a pdf using either standard java or an external library.
If the above is not possible, is it possible to convert these files into animated gifs(again with standard/external java libraries) and put it in the pdf.
I would atleast like to extract a frame from these .swf files and use it as a jpeg in my PDF.
Some PDF reader may read the swf, like Acrobat, but most won't.
You'd better be converting them.
Here is some external tools that can help you in the process: http://www.swftools.org/
As of making that with native java, it is possible, but it would require re-rendering all the SWF vector drawing, which is a quite heavy work.
I currently have a program that can take a pdf saved to an android device and calls an intent to display the pdf. What I wish to do is to create a multipage pdf out of 2 pdf files saved to the device. Is there some sort of token or symbol that pdfs look for that tells the displayer to create a new page? What can I put between the two inputstreams to denote the fact that I want a new pdf page?
I did a lot of research and tried to use itext but it seemed to over-complicate my program. Is there a simple way to achieve my goal?
Thanks in advance!
How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply
You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.
What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.
Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )
Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java