How to verify text/content in thousands of PDF files - java

I want to verify/assert certain set of text or sentence in each PDF files automatically. I have 1000s of PDF files which needs to be verified whether a specific text/sentence is present in it.

You can do this by using Apache Lucene and Apache pdfbox.
Please refer to this post: http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html

Related

Unable to compress excel file (XLSX or XLS) using java

In one of java sprint MVC based application, i need to compress size of excel file, because application is unable to send email attachment.
any one please help
The formats with 'x' at the end are actually zip archives, so any more compression is moot. You could unzip the Excel sheet and look for data based compression.
Images in lower resolution.
Excel XML with less repetition of styles, attributes.
Repeated expressions.
Maybe there are embedded fonts.
Here is a tutorial of how to Zip single/multiple files in Java which should get you what you require:
http://www.kscodes.com/java/how-to-compress-files-in-java/

Retrieving pdf files saved into database for manipulation in pdfbox. Springbot and hibernate app

Hello i need to retrieve pdf files from database for further work with Pdfbox. How to merge multiple pdf files located in database with Pdfbox? I want to download them using Hibernate but Pdfbox need source location (String/direcory) and destination directory (also String/direcory). Another problem is that Spring is uploading and downloading files as Multipart files and Pdfbox needs Inputstream or File (bytes?). Do you have any idea how to merge multiple pdfs (saved in database) into 1 pdf using Pdfbox ? This merged pdf will be also saved into database. Thanks for help.
I think a good approach would be to separate the problems and solve them one by one.
Do you have any idea how to merge multiple pdfs (saved in database) into 1 pdf using Pdfbox?
I dont know Pdfbox in detail but if it asks for a file you could use a temporary direcotry and store the file there. from there you could read the bytes and save them to your database.
see also: How to create a temporary directory/folder in Java?
Another problem is that Spring is uploading and downloading files as Multipart files and Pdfbox needs Inputstream or File (bytes?)
I dont understand the question exactly. If PdfBox needs an Inputstram you could take a look at the ByteArrayInputStream class.

Apache Tika Output Format

I have an requirement where pdf files comes as an input and I have to read it and based some rules, I have to split each page of pdf. Rules will be drive based on data which will gets extracted from the given pdf.
I gone through with Apache Tika Toolkit which suppose to be build for such requirement, I believe. The data is getting extracted using this tool but in text format. I want the output back in pdf format. I am not sure whether its possible to not. Please suggest.
Thanks.
Manish.

How to search specific content (with or without regular expression) in pdf?

I have a list of PDF files. I want to search for the presence of specific content in each of these files and separate a file that has the content from the other files. I want a know whether such a search function is possible using the Java library iText.

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

Categories