I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.
http://poi.apache.org/spreadsheet/quick-guide.html#Embedded
I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.
public class ReadExcel1 {
public static void main(String[] args) {
try {
FileInputStream file = new FileInputStream(new File("C:\\test.xls"));
POIFSFileSystem fs = new POIFSFileSystem(file);
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
if(oleName.equals("Acrobat Document")){
System.out.println("Acrobat reader document");
try{
DirectoryNode dn = (DirectoryNode) obj.getDirectory();
for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {
DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
byte[] data = new byte[nativeEntry.getSize()];
ByteArrayInputStream bao= new ByteArrayInputStream(data);
PDFParser pdfparser = new PDFParser(bao);
pdfparser.parse();
COSDocument cosDoc = pdfparser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
}
}catch(Exception e){
System.out.println("Error reading "+ e.getMessage());
}finally{
System.out.println("Finally ");
}
}else{
System.out.println("nothing ");
}
}
file.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Below is the output in eclipse
Acrobat reader document
Error reading Error: End-of-File, expected line
Finally
nothing
The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me.
This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...
I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.
import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;
public class EmbeddedPdfInExcel {
public static void main(String[] args) throws Exception {
NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
DirectoryNode dn = (DirectoryNode)obj.getDirectory();
if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
InputStream is = dn.createDocumentInputStream("CONTENTS");
FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
fs.close();
}
}
Related
I want to open a password protected docx file using Apache POI. Can anyone help me with the complete code please? Am not getting solution with this code
Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:126)
at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:113)
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:301)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.(HSSFWorkbook.java:413)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.(HSSFWorkbook.java:394)
POIFSFileSystem fs=new POIFSFileSystem(new FileInputStream("D:/abc.docx"));
EncryptionInfo info=new EncryptionInfo(fs);
Decryptor decryptor=Decryptor.getInstance(info);
if(!decryptor.verifyPassword("user"))
{
throw new RuntimeException("document is encrypted");
}
InputStream in=decryptor.getDataStream(fs);
HSSFWorkbook wb=new HSSFWorkbook(in);
File f=new File("D:/abc5.docx");
wb.write(f);
The basic code for decryption the XML-based formats of Microsoft Office is shown in XML-based formats - Decryption.
But of course one must know that *.docx, which is a Word file in Office Open XML format, cannot be a HSSFWorkbook, which would be a Excel workbook in binary BIFF file format, but instead must be a XWPFDocument.
So:
import java.io.InputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.crypt.EncryptionInfo;
import org.apache.poi.poifs.crypt.Decryptor;
import java.security.GeneralSecurityException;
public class ReadEncryptedXWPF {
static XWPFDocument decryptdocx(POIFSFileSystem filesystem, String password) throws Exception {
EncryptionInfo info = new EncryptionInfo(filesystem);
Decryptor d = Decryptor.getInstance(info);
try {
if (!d.verifyPassword(password)) {
throw new RuntimeException("Unable to process: document is encrypted");
}
InputStream dataStream = d.getDataStream(filesystem);
return new XWPFDocument(dataStream);
} catch (GeneralSecurityException ex) {
throw new RuntimeException("Unable to process encrypted document", ex);
}
}
public static void main(String[] args) throws Exception {
POIFSFileSystem filesystem = new POIFSFileSystem(new FileInputStream("abc.docx"));
XWPFDocument document = decryptdocx(filesystem, "user");
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
System.out.println(extractor.getText());
extractor.close();
}
}
I have solved this. The code is below
POIFSFileSystem fs=new POIFSFileSystem(new FileInputStream("D:/abc.docx"));
EncryptionInfo info=new EncryptionInfo(fs);
Decryptor decryptor=Decryptor.getInstance(info);
XWPFDocument document=null;
if(decryptor.verifyPassword("password"))
{
InputStream dataStream = decryptor.getDataStream(fs);
document = new XWPFDocument(dataStream);
}else{
throw new Exception("file is protected with password...please open with right password");
}
File f=new File("D:/abc.docx");
FileOutputStream fos = new FileOutputStream(f);
document.write(fos);
document.close();
I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf.
I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.
I was looking for a way and found something with InputStream and OpenConnection but didn't managed to actually do it.
I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.
So what is the best way to do this?
EDIT:
I want to be able to do something like this:
File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");
System.out.println(doc.toString());
You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://website.com/document.pdf");
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
System.out.println(contenthandler.toString());
}
}
You can use this code first download the PDF file then read the text with apache lib. you need to add some jar manually.
you need to set your local pdf file address which is by defualt "download.pdf".
import com.gnostice.pdfone.PdfDocument;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.ConnectException;
import java.net.URL;
import java.net.URLConnection;
public class LoadDocumentFromURL {
public static void main(String[] args) throws IOException {
URL url1 = new URL("https://arxiv.org/pdf/1811.06933.pdf");
byte[] ba1 = new byte[1024];
int baLength;
FileOutputStream fos1 = new FileOutputStream("download.pdf");
try {
// Contacting the URL
// System.out.print("Connecting to " + url1.toString() + " ... ");
URLConnection urlConn = url1.openConnection();
// Checking whether the URL contains a PDF
if (!urlConn.getContentType().equalsIgnoreCase("application/pdf")) {
System.out.println("FAILED.\n[Sorry. This is not a PDF.]");
} else {
try {
// Read the PDF from the URL and save to a local file
InputStream is1 = url1.openStream();
while ((baLength = is1.read(ba1)) != -1) {
fos1.write(ba1, 0, baLength);
}
fos1.flush();
fos1.close();
is1.close();
// Load the PDF document and display its page count
//System.out.print("DONE.\nProcessing the PDF ... ");
PdfDocument doc = new PdfDocument();
try {
doc.load("download.pdf");
// System.out.println("DONE.\nNumber of pages in the PDF is " + doc.getPageCount());
// System.out.println(doc.getAuthor());
// System.out.println(doc.getKeywords());
// System.out.println(doc.toString());
doc.close();
} catch (Exception e) {
System.out.println("FAILED.\n[" + e.getMessage() + "]");
}
} catch (ConnectException ce) {
//System.out.println("FAILED.\n[" + ce.getMessage() + "]\n");
}
}
} catch (NullPointerException npe) {
//System.out.println("FAILED.\n[" + npe.getMessage() + "]\n");
}
File file = new File("your local pdf file address which is download.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
document.close();
}
}
I am not getting how to update meta data (title,subject,author etc..) for docx file using apache poi.
I have tried it for a doc file using apache poi:
File poiFilesystem = new File(file_path1);
/* Open the POI filesystem. */
InputStream is = new FileInputStream(poiFilesystem);
POIFSFileSystem poifs = new POIFSFileSystem(is);
is.close();
/* Read the summary information. */
DirectoryEntry dir = poifs.getRoot();
SummaryInformation si;
try
{
DocumentEntry siEntry = (DocumentEntry)
dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
dis.close();
si = new SummaryInformation(ps);
}
catch (FileNotFoundException ex)
{
/* There is no summary information yet. We have to create a new
* one. */
si = PropertySetFactory.newSummaryInformation();
}
si.setAuthor("xzy");
System.out.println("Author changed to " + si.getAuthor() + ".");
si.setSubject("mysubject");
si.setTitle("mytitle");
Below work with POI-3.10. You can set some metadata with PackageProperties:
import java.util.Date;
import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.openxml4j.util.Nullable;
class SetDOCXMetadata{
public static void main(String[] args){
try{
OPCPackage opc = OPCPackage.open("metadata.docx");
PackageProperties pp = opc.getPackageProperties();
Nullable<String> foo = pp.getLastModifiedByProperty();
System.out.println(foo.hasValue()?foo.getValue():"empty");
//Set some properties
pp.setCreatorProperty("M Kazarian");
pp.setLastModifiedByProperty("M Kazarian " + System.currentTimeMillis());
pp.setModifiedProperty(new Nullable<Date>(new Date()));
pp.setTitleProperty("M Kazarian document");
opc.close();
} catch (Exception e) {}
}
}
I am able to generate pdf from docx file using docx4j.But i need to convert doc file to pdf including images and tables.
Is there any way to convert doc to docx in java. or (doc to pdf)?
docx4j contains org.docx4j.convert.in.Doc, which uses POI to read the .doc, but it is a proof of concept, not production ready code. Last I checked, there were limits to POI's HWPF parsing of a binary .doc.
Further to mqchen's comment, you can use LibreOffice or OpenOffice to convert doc to docx. But if you are going to use LibreOffice or OpenOffice, you may as well use it to convert both .doc and .docx directly to PDF. Google 'jodconverter'.
Cribbing off the POI unit tests, I came up with this to extract the text from a word document:
public String getText(String document) {
try {
ZipInputStream is = new ZipInputStream(new FileInputStream(document));
try {
is.getNextEntry();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
IOUtils.copy(is, baos);
} finally {
baos.close();
}
byte[] byteArray = baos.toByteArray();
ByteArrayInputStream bais = new ByteArrayInputStream(byteArray);
HWPFDocument doc = new HWPFDocument(bais);
extractor = new WordExtractor(doc);
extractor.getText();
} finally {
is.close();
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
I do hope that points you in the right direction, if not sorts you entirely.
You can use jWordConvert for this.
jWordConvert is a Java library that can read and render Word documents
natively to convert to PDF, to convert to images, or to print the
documents automatically.
Details can be found at following link
http://www.qoppa.com/wordconvert/
https://github.com/guptachunky/Conversion-Work
This Github Link might be helpful for that.
https://github.com/guptachunky/Conversion-Work/blob/main/src/main/java/com/conversion/Conversion/Service/ConversionService.java
public void docToPdf(FileDetail fileDetail, HttpServletResponse response) {
InputStream doc;
try {
File docFile = converterToFile(fileDetail.getFile());
doc = new FileInputStream(docFile);
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = PdfOptions.create();
File file = File.createTempFile("output", ".pdf");
OutputStream out = new FileOutputStream(file);
PdfConverter.getInstance().convert(document, out, options);
getClaimFiles(file, response);
} catch (IOException e) {
response.setStatus(AppConstant.SOMETHING_WENT_WRONG);
}
}
public void getClaimFiles(File file, HttpServletResponse response) {
try {
response.setContentType("application/pdf");
response.setHeader("Content-Disposition",
"attachment; filename=dummy.pdf");
response.getOutputStream().write(Files.readAllBytes(file.toPath()));
} catch (Exception e) {
response.setStatus(AppConstant.SOMETHING_WENT_WRONG);
}
}
i'am trying to get the summary information from file with JAVA and I can't found anything. I tried with org.apache.poi.hpsf.* .
I need Author, Subject, Comments, Keywords and Title.
File rep = new File("C:\\Cry_ReportERP006.rpt");
/* Read a test document <em>doc</em> into a POI filesystem. */
final POIFSFileSystem poifs = new POIFSFileSystem(new FileInputStream(rep));
final DirectoryEntry dir = poifs.getRoot();
DocumentEntry dsiEntry = null;
try
{
dsiEntry = (DocumentEntry) dir.getEntry(DocumentSummaryInformation.DEFAULT_STREAM_NAME);
}
catch (FileNotFoundException ex)
{
/*
* A missing document summary information stream is not an error
* and therefore silently ignored here.
*/
}
/*
* If there is a document summry information stream, read it from
* the POI filesystem.
*/
if (dsiEntry != null)
{
final DocumentInputStream dis = new DocumentInputStream(dsiEntry);
final PropertySet ps = new PropertySet(dis);
final DocumentSummaryInformation dsi = new DocumentSummaryInformation(ps);
final SummaryInformation si = new SummaryInformation(ps);
/* Execute the get... methods. */
System.out.println(si.getAuthor());
As explained in the POI overview at http://poi.apache.org/overview.html there are more types of file parsers.
The following examples extract the Author/Creator from 2003 office files:
public static String parseOLE2FileAuthor(File file) {
String author=null;
try {
FileInputStream stream = new FileInputStream(file);
POIFSFileSystem poifs = new POIFSFileSystem(stream);
DirectoryEntry dir = poifs.getRoot();
DocumentEntry siEntry = (DocumentEntry)dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
SummaryInformation si = new SummaryInformation(ps);
author=si.getAuthor();
stream.close();
} catch (IOException ex) {
ex.getStackTrace();
} catch (NoPropertySetStreamException ex) {
ex.getStackTrace();
} catch (MarkUnsupportedException ex) {
ex.getStackTrace();
} catch (UnexpectedPropertySetTypeException ex) {
ex.getStackTrace();
}
return author;
}
For docx,pptx,xlsx the POI has specialized classes.
Example for .docx file:
public static String parseDOCX(File file){
String author=null;
FileInputStream stream;
try {
stream = new FileInputStream(file);
XWPFDocument docx = new XWPFDocument(stream);
CoreProperties props = docx.getProperties().getCoreProperties();
author=props.getCreator();
stream.close();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
}
return author;
}
Use for PPTX use XMLSlideShow or XMLWorkbook instead of XMLDocument.
Please find the sample code here- Appache POI how to
In brief, you can a listener MyPOIFSReaderListener:
SummaryInformation si = (SummaryInformation)
PropertySetFactory.create(event.getStream());
String title = si.getTitle();
String Author= si.getLastAuthor();
......
and register it as :
POIFSReader r = new POIFSReader();
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");
r.read(new FileInputStream(filename));
for 2003 office files, you can use classes inherited from POIDocument. here is an example for doc file:
FileInputStream in = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(in);
author = doc.getSummaryInformation().getAuthor();
and HSLFSlideShowImpl for ppt,
HSSFWorkbook for xls,
HDGFDiagram for vsd.
there are many other file information within the SummaryInformation class.
for 2007 or above office file, see the answer of #Dragos Catalin Trieanu