How to read pdf file in java - java

I am working on a java project that needs to read a pdf file.
I know it is possible using some external libraries like itext.
But is it possible to read a pdf file using java inbuild features without using any external library?

Yes it is possible. For reading pdf file from java gone through Apache PDFBOX. This PDFBOX allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities.

You can to recover the text of a PDF file with Apache PDFBox. In maven project pom.xml, we must add dependence
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.8</version>
</dependency>
The code:
try {
DLFileEntry fileEntry = DLFileEntryLocalServiceUtil.getFileEntry(folder.getGroupId(), folder.getFolderId(), fileName);
File file = DLFileEntryLocalServiceUtil.getFile(themeDisplay.getUserId(), fileEntry.getFileEntryId(), fileEntry.getVersion(), true);
PDDocument pddDocument=PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
String text = textStripper.getText(pddDocument);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
To read/create a PDF, see the documentation:
https://pdfbox.apache.org/

Related

How to convert docx to PDF without split tables

I have dynamics docx with few tables and I'm trying to convert to a PDF. When I converted to PDF then it covers two pages. I use Apache POI XWPF converter in 2.0.2 version.
In docx file everything is okey but when I convert to PDF then tables are spited
Someone have any idea or better library to convert docx to pdf?
PdfOptions options = PdfOptions.getDefault();
options.fontProvider((familyName, encoding, size, style, color) -> {
try {
BaseFont baseFont = BaseFont.createFont("fonts/times.ttf", encoding, BaseFont.EMBEDDED);
return new Font(baseFont, size, style, color);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
PdfConverter.getInstance().convert(document, out, options);
There is no library to convert a doc[x] file into a completely correctly formatted PDF. The only program that can do that is Word itself.
I have achieved this by using the Word API in a PowerShell script:
$document_path = $args[0]
$document_parent_folder = $args[1]
$file_name = $args[2]
$word_app = New-Object -ComObject Word.Application
$document = $word_app.Documents.Open($document_path)
$pdf_filename = "$($document_parent_folder)\$($file_name)"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close()
$word_app.Quit()
Yes it is not the best solution and it is heavily dependent on having Microsoft Office installed in the machine and a lot of other problems that accompany this solution... But it is the only solution that formatted my documents exactly how I wanted them.
The script takes three arguments
The path of the document that will be converted
The folder where it is located
The name of the pdf file

how to know whether a file is .docx or .doc format from Apache POI

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.
If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)
You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.
There is a way, no strightforward though. But with Apache POI, you can locate it.
Try to read a .docx file using HWPFDocument Class. It would give you the following error
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied
data appears to be in the Office 2007+ XML. You are calling the part
of POI that deals with OLE2 Office Documents. You need to call a
different part of POI to process this data (eg XSSF instead of HSSF)
String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
inStream = new FileInputStream(new File(filePath));
HWPFDocument doc = new HWPFDocument(inStream);
WordExtractor wordExtractor = new WordExtractor(doc);
System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
System.out.print("Its not a .doc format");
}
.docx can be read using XWPFDocument Class.
Why dont you use Apache Tika:
File file = new File('File Here');
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);
Assuming you're using Apache POI, you have a few options.
One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF
Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx
If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that

I can't import com.itextpdf.text.Document class

I'm building an android app and I want to use iText for creating pdf file, but I can't use Document class. As I seen in tutorials, there should be import com.itextpdf.text.Document for using Document class. For this app, I'm using com.itextpdf:itext-pdfa:5.5.9 library. I want to create a simple pdf file with 2 paragraphs, something like this:
try{
File pdfFolder = new File(Environment.getExternalStoragePublicDirectory(
Environment.DIRECTORY_DOCUMENTS), "pdfdemo");
if (!pdfFolder.exists()) {
pdfFolder.mkdir();
}
Date date = new Date() ;
String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(date);
File myFile = new File(pdfFolder + timeStamp + ".pdf");
OutputStream output = new FileOutputStream(myFile);
Document document = new Document();
PdfAWriter.getInstance(document, output);
document.open();
document.add(new Paragraph(mSubjectEditText.getText().toString()));
document.add(new Paragraph(mBodyEditText.getText().toString()));
document.close();
}catch (Exception e) {}
'
Could anyone help me with this problem? What am I doing wrong?
You say:
I'm using com.itextpdf:itext-pdfa:5.5.9 library
That is wrong for two reasons:
itext-pdfa is an addon to iText that is meant for writing or manipulating PDF/A documents. It requires the core iText libary. Read about the different parts of iText on the official web site: https://developers.itextpdf.com/itext-java
You say you want to use iText on Android, but you are referring to iText for Java. iText for Java contains classes that are not allowed on Android (java.awt.*, javax.nio,...). You should use the Android port for iText, which is called iTextG: https://developers.itextpdf.com/itextg-android
It's as if you're using iText without having visited the official iText web site. How is that even possible?
Just open your app level gradle file and add following code into your dependencies
implementation 'com.itextpdf:itext-pdfa:5.5.9'
It works for me

how to add custom properties metadata to the pdf using apache fop

I am using apache FOP to create PDF files and need to add specific metadata to the PDF. In adobe reader it is called "custom properties" and it contains name and value.
I can add simple metadata like this:
out = new ByteArrayOutputStream();
fop = fopFactory.newFop(MimeConstants.MIME_PDF, foUserAgent, out);
foUserAgent.setKeywords("some keywords");
But I need to add customised metadata with name and value. Any idea how to do it?
Maybe you're lucky with the XMP support in FOP 1.1? Try with some keys that you find in the XMP specification.

Generating docx in java

I have a docx template that I am saving as .xml and then parsing the content.
Then I am generating a new updated word document. After the word document is generated I am unable to open it. It says " document corrupt ". I press ok. Then it says " Press OK if Do you want to retrieve the document ". I press ok. Then I get the updated document. This happens everytime. I have created the same program as stand alone java application. The document generated through the stand alone Java application opens without any errors. Could anyone provide me an insight into this ? I have used the same code for the server side also.
Here is the code that I use to generate the docuemnt.
try {
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
FileOutputStream file = new FileOutputStream(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance()
.newTransformer();
xformer.transform(source, result);
file.close();
} catch (TransformerConfigurationException e) {
System.out.println("Transformation Configuration Excepiton in WriteXMLFile");
} catch (TransformerException e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
} catch (Exception e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
e.printStackTrace();
}
I use POI library to generate Word documents (.doc, not .docx but it should work too).
With POI you can :
- open your word document
- edit whatever you want with a clean API (not mess up in XML)
- write the result
http://poi.apache.org/
You can use POI or docx4j to ensure you create valid Word documents.
Had you check the encoding of JVM?. I had have that problem, and finally I discovered that in Eclipse I had UTF-8, but in standalone I didn't specify encoding, so JVM take ISO-8559.
Please, check it with parameter -Dfile.encoding=UTF-8.
I have used both Apache POI and docx4j extensively, and having said that docx4j is more robust as it offers more support out of the box for not only the document itself but for tables and images. A lot of what docx4j does is automated, where areas Apache's POI you have to do a lot of manual coding for docx support. Unfortunately not much has been done for POI for docx support. I would suggest using docx4j as they have native support for opening and saving a new .docx file out of the box.

Categories