How to get file summary information with Java/Apache POI

How to get file summary information with Java/Apache POI - java

i'am trying to get the summary information from file with JAVA and I can't found anything. I tried with org.apache.poi.hpsf.* .
I need Author, Subject, Comments, Keywords and Title.
File rep = new File("C:\\Cry_ReportERP006.rpt");
/* Read a test document <em>doc</em> into a POI filesystem. */
final POIFSFileSystem poifs = new POIFSFileSystem(new FileInputStream(rep));
final DirectoryEntry dir = poifs.getRoot();
DocumentEntry dsiEntry = null;
try
{
dsiEntry = (DocumentEntry) dir.getEntry(DocumentSummaryInformation.DEFAULT_STREAM_NAME);
}
catch (FileNotFoundException ex)
{
/*
* A missing document summary information stream is not an error
* and therefore silently ignored here.
*/
}
/*
* If there is a document summry information stream, read it from
* the POI filesystem.
*/
if (dsiEntry != null)
{
final DocumentInputStream dis = new DocumentInputStream(dsiEntry);
final PropertySet ps = new PropertySet(dis);
final DocumentSummaryInformation dsi = new DocumentSummaryInformation(ps);
final SummaryInformation si = new SummaryInformation(ps);
/* Execute the get... methods. */
System.out.println(si.getAuthor());

As explained in the POI overview at http://poi.apache.org/overview.html there are more types of file parsers.
The following examples extract the Author/Creator from 2003 office files:
public static String parseOLE2FileAuthor(File file) {
String author=null;
try {
FileInputStream stream = new FileInputStream(file);
POIFSFileSystem poifs = new POIFSFileSystem(stream);
DirectoryEntry dir = poifs.getRoot();
DocumentEntry siEntry = (DocumentEntry)dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
SummaryInformation si = new SummaryInformation(ps);
author=si.getAuthor();
stream.close();
} catch (IOException ex) {
ex.getStackTrace();
} catch (NoPropertySetStreamException ex) {
ex.getStackTrace();
} catch (MarkUnsupportedException ex) {
ex.getStackTrace();
} catch (UnexpectedPropertySetTypeException ex) {
ex.getStackTrace();
}
return author;
}
For docx,pptx,xlsx the POI has specialized classes.
Example for .docx file:
public static String parseDOCX(File file){
String author=null;
FileInputStream stream;
try {
stream = new FileInputStream(file);
XWPFDocument docx = new XWPFDocument(stream);
CoreProperties props = docx.getProperties().getCoreProperties();
author=props.getCreator();
stream.close();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
}
return author;
}
Use for PPTX use XMLSlideShow or XMLWorkbook instead of XMLDocument.

Please find the sample code here- Appache POI how to
In brief, you can a listener MyPOIFSReaderListener:
SummaryInformation si = (SummaryInformation)
PropertySetFactory.create(event.getStream());
String title = si.getTitle();
String Author= si.getLastAuthor();
......
and register it as :
POIFSReader r = new POIFSReader();
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");
r.read(new FileInputStream(filename));

for 2003 office files, you can use classes inherited from POIDocument. here is an example for doc file:
FileInputStream in = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(in);
author = doc.getSummaryInformation().getAuthor();
and HSLFSlideShowImpl for ppt,
HSSFWorkbook for xls,
HDGFDiagram for vsd.
there are many other file information within the SummaryInformation class.
for 2007 or above office file, see the answer of #Dragos Catalin Trieanu

Related

File not found in Directory (Static PDF attachment)

My requirement is with a generic PDF i have to attach a static PDF for an email , i can attach the generic PDF without any issues , but it is giving me an issue with static PDF while fetching it from the directory, i have tried several ways could you please assist ....
Below is error and the code related to it....
Error :java.io.FileNotFoundException: /mnt/DGB/Correspondence/Systems/PROD_DOCS/How_to_access_member_information.pdf (No such file or directory)
Code :
try {
File pdfFile = new File("//mnt/DGB/Correspondence/Systems/PROD_DOCS/How_to_access_member_information.pdf");
byte[] bytesArray = new byte[(int) pdfFile.length()];
FileInputStream fis = new FileInputStream(pdfFile);
fis.read(bytesArray); //read file into bytes[]
fis.close();
String registerId = notificationEngineService.registerFileOnNe("application/pdf", "How_to_access_member_information.pdf", bytesArray);
System.out.println("registerId 1=============================== " + registerId);
notificationEngineService.sendRegisteredAttViaNe(registerId, emailBody, dispInfo);
} catch (Exception e) {
System.out.println("Exception 10============================================================");
e.printStackTrace();
}
try {
File pdfFile = new File("\\\\dcpcifs01\\DGB\\Correspondence\\Systems\\PROD_DOCS\\How_to_access_member_information.pdf");
byte[] bytesArray = new byte[(int) pdfFile.length()];
FileInputStream fis = new FileInputStream(pdfFile);
fis.read(bytesArray); //read file into bytes[]
fis.close();
String registerId = notificationEngineService.registerFileOnNe("application/pdf", "How_to_access_member_information.pdf", bytesArray);
System.out.println("registerId 2=============================== " + registerId);
notificationEngineService.sendRegisteredAttViaNe(registerId, emailBody, dispInfo);
} catch (Exception e) {
System.out.println("Exception 10============================================================");
e.printStackTrace();
}
} catch (Exception ex) {
ex.printStackTrace();
throw new GroupRiskSystemException(ExceptionCode.COMPASS_ERROR.name());
}
return "";
}
private void sendEmail(MbsMembers memberObject) {
try {
System.out.println(" ======================Start0=================================== ");
za.co.discoverygrouprisk.common.jaxb.email.AttachmentType attachmentType = new za.co.discoverygrouprisk.common.jaxb.email.AttachmentType();
attachmentType.setMember(new MemberType());
attachmentType.setCamundaProcessId("0");
attachmentType.setFileName("How_to_access_member_information.pdf");
attachmentType.setChildBusinessKey(0l);
attachmentType.setNeID(0l);
DGRMultiAttachmentEmailDetailV01 emailDetail = new DGRMultiAttachmentEmailDetailV01();
SchemeDataType schemeDataType = new SchemeDataType();
SchemeType schemeType = new SchemeType();
SchemeNumberType schemeNumberType = new SchemeNumberType();
schemeNumberType.setValue(01);
schemeType.setSchemeNumber(schemeNumberType);
schemeDataType.setScheme(schemeType);
emailDetail.setSchemeData(schemeDataType);
EmailDataType emailDataType = new EmailDataType();
EmailType emailType = new EmailType();
emailType.setSubject("How to access member information");
emailType.setFromAddress("groupinfo#discovery.co.za");
emailType.setToAddress(memberObject.getEmailAddress());
emailDataType.setEmail(emailType);
emailDetail.setEmailData(emailDataType);
AttachmentDataType attachmentDataType = new AttachmentDataType();
// attachmentDataType.setLocation("//mnt/DGB/Correspondence/Systems/PROD_DOCS/");
attachmentDataType.setLocation("\\\\dcpcifs01\\DGB\\Correspondence\\Systems\\PROD_DOCS\\");
//mnt/DGB/Correspondence/2020/QA/MEMBER_REQUIREMENT_LETTER
attachmentDataType.setParentBusinessKey(01);
attachmentDataType.getAttachment().add(attachmentType);
emailDetail.setAttachmentData(attachmentDataType);
EmailDataSource adHocDS = new AdHocEmailDataSource(emailDetail);
String emailBody = createEmailBody(memberObject);
StandardEmailTemplate template = new StandardEmailTemplate(emailBody);
Email email = new StandardEmail(adHocDS, template);
email.createEmail();
email.sendEmail();
System.out.println(" ======================End0=================================== ");
} catch (Exception e) {
System.out.println(" ======================Exception0=================================== ");
e.printStackTrace();
}
}

Use below :
File pdfFile = new File("/mnt/DGB/Correspondence/Systems/PROD_DOCS/How_to_access_member_information.pdf");
Adding the below :
Create a directory under user home directory say : /home/user_name/java-pdf.
Then try the below code once to check if your code is able to access the file:
File homedir = new File(System.getProperty("user.home"));
File pdfFile = new File(homedir, "java-pdf/How_to_access_member_information.pdf");
The above code runs fine for me.

Unzip *.docx file in memory without write to disk - Java

I want to unzip *.docx file in memory without to write the output to the disk. I found the following implementation but it allows only to read the compressed files but not to see the directory structure. It is important for me to know the location of each file in the directory tree. can somebody give me a direction?
private static void UnzipFileInMemory() {
try {
ZipFile zf = new ZipFile("d:\\a.docx");
int i = 0;
for (Enumeration e = zf.entries(); e.hasMoreElements();) {
InputStream in = null;
try {
ZipEntry entry = (ZipEntry) e.nextElement();
System.out.println(entry);
in = zf.getInputStream(entry);
} catch (IOException ex) {
//Logger.getLogger(Tester.class.getName()).log(Level.SEVERE, null, ex);
} finally {
try {
in.close();
} catch (IOException ex) {
//Logger.getLogger(Tester.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
} catch (IOException ex) {
//Logger.getLogger(Tester.class.getName()).log(Level.SEVERE, null, ex);
}
}

Use ZipInputStream : zEntry in this example gives you file location.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public class unzip {
public static void main(String[] args) {
String filePath = "D:/Tmp/Tmp.zip";
String oPath = "D:/Tmp/";
new unzip().unzipFile(filePath, oPath);
}
public void unzipFile(String filePath, String oPath) {
FileInputStream fis = null;
ZipInputStream zipIs = null;
ZipEntry zEntry = null;
try {
fis = new FileInputStream(filePath);
zipIs = new ZipInputStream(new BufferedInputStream(fis));
while ((zEntry = zipIs.getNextEntry()) != null) {
try {
FileOutputStream fos = null;
String opFilePath = oPath + zEntry.getName();
fos = new FileOutputStream(opFilePath);
System.out.println(zEntry.getName());
fos.flush();
fos.close();
} catch (Exception ex) {
}
}
zipIs.close();
fis.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

You associate the zip format file as a virtual file system (FileSystem). For that java already has a protocol handler, for jar:file://.... So you have to prepend a File.toURI() with "jar:".
URI docxUri = ,,, // "jar:file:/C:/... .docx"
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties)) {
Path documentXmlPath = zipFS.getPath("/word/document.xml");
Now you may use Files.delete() or Files.copy between real disk filesystem and zip.
When using XML:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
//Element root = doc.getDocumentElement();
You can then use XPath to find the places, and write the XML back again.
It even might be that you do not need XML but could replace place holders:
byte[] content = Files.readAllBytes(documentXmlPath);
String xml = new String(content, StandardCharsets.UTF_8);
xml = xml.replace("#DATE#", "2014-09-24");
xml = xml.replace("#NAME#", StringEscapeUtils.escapeXml("Sniper")));
...
content = xml.getBytes(StandardCharsets.UTF_8);
Files.delete(documentXmlPath);
Files.write(documentXmlPath, content);
For a fast development, rename a copy of the .docx to a name with the .zip file extension, and inspect the files.

Simply add a file checking code within your loop:
if (!entry.isDirectory()) // Alternatively: if(entry.getName().contains("."))
System.out.println(entry);

Checking if file exists, if so, dont create new file and append instead

private void saveFormActionPerformed(java.awt.event.ActionEvent evt) {
name = nameFormText.getText();
surname = surnameFormText.getText();
age = Integer.parseInt(ageFormText.getText());
stadium = stadiumFormText.getText();
Venues fix = new Venues();
fix.setName(name);
fix.setSurname(surname);
fix.setAge(age);
fix.setStadium(stadium);
File outFile;
FileOutputStream fStream;
ObjectOutputStream oStream;
try {
outFile = new File("output.data");
fStream = new FileOutputStream(outFile);
oStream = new ObjectOutputStream(fStream);
oStream.writeObject(fix);
JOptionPane.showMessageDialog(null, "File written successfully");
oStream.close();
} catch (IOException e) {
System.out.println(e);
}
}
This is what I have so far. Any ideas on what I could do with it to append the file if it's already created?

You have first to check if the file exists before, if not create a new one. To learn how to append object to objectstream take a look at this question.
File outFile = new File("output.data");
FileOutputStream fStream;
ObjectOutputStream oStream;
try {
if(!outFile.exists()) outFile.createNewFile();
fStream = new FileOutputStream(outFile);
oStream = new ObjectOutputStream(fStream);
oStream.writeObject(fix);
JOptionPane.showMessageDialog(null, "File written successfully");
oStream.close();
} catch (IOException e) {
System.out.println(e);
}

Using Java 7, it is simple:
final Path path = Paths.get("output.data");
try (
final OutputStream out = Files.newOutputStream(path, StandardOpenOption.CREATE,
StandardOpenOption.APPEND);
final ObjectOutputStream objOut = new ObjectOutputStream(out);
) {
// work here
} catch (IOException e) {
// handle exception here
}
Drop File!

how to update metadata of docx file using apache poi in java?

I am not getting how to update meta data (title,subject,author etc..) for docx file using apache poi.
I have tried it for a doc file using apache poi:
File poiFilesystem = new File(file_path1);
/* Open the POI filesystem. */
InputStream is = new FileInputStream(poiFilesystem);
POIFSFileSystem poifs = new POIFSFileSystem(is);
is.close();
/* Read the summary information. */
DirectoryEntry dir = poifs.getRoot();
SummaryInformation si;
try
{
DocumentEntry siEntry = (DocumentEntry)
dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
dis.close();
si = new SummaryInformation(ps);
}
catch (FileNotFoundException ex)
{
/* There is no summary information yet. We have to create a new
* one. */
si = PropertySetFactory.newSummaryInformation();
}
si.setAuthor("xzy");
System.out.println("Author changed to " + si.getAuthor() + ".");
si.setSubject("mysubject");
si.setTitle("mytitle");

Below work with POI-3.10. You can set some metadata with PackageProperties:
import java.util.Date;
import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.openxml4j.util.Nullable;
class SetDOCXMetadata{
public static void main(String[] args){
try{
OPCPackage opc = OPCPackage.open("metadata.docx");
PackageProperties pp = opc.getPackageProperties();
Nullable<String> foo = pp.getLastModifiedByProperty();
System.out.println(foo.hasValue()?foo.getValue():"empty");
//Set some properties
pp.setCreatorProperty("M Kazarian");
pp.setLastModifiedByProperty("M Kazarian " + System.currentTimeMillis());
pp.setModifiedProperty(new Nullable<Date>(new Date()));
pp.setTitleProperty("M Kazarian document");
opc.close();
} catch (Exception e) {}
}
}

Read embedded pdf file in excel using Java

I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.
http://poi.apache.org/spreadsheet/quick-guide.html#Embedded
I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.
public class ReadExcel1 {
public static void main(String[] args) {
try {
FileInputStream file = new FileInputStream(new File("C:\\test.xls"));
POIFSFileSystem fs = new POIFSFileSystem(file);
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
if(oleName.equals("Acrobat Document")){
System.out.println("Acrobat reader document");
try{
DirectoryNode dn = (DirectoryNode) obj.getDirectory();
for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {
DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
byte[] data = new byte[nativeEntry.getSize()];
ByteArrayInputStream bao= new ByteArrayInputStream(data);
PDFParser pdfparser = new PDFParser(bao);
pdfparser.parse();
COSDocument cosDoc = pdfparser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
}
}catch(Exception e){
System.out.println("Error reading "+ e.getMessage());
}finally{
System.out.println("Finally ");
}
}else{
System.out.println("nothing ");
}
}
file.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Below is the output in eclipse
Acrobat reader document
Error reading Error: End-of-File, expected line
Finally
nothing

The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me.
This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...
I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.
import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;
public class EmbeddedPdfInExcel {
public static void main(String[] args) throws Exception {
NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
DirectoryNode dn = (DirectoryNode)obj.getDirectory();
if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
InputStream is = dn.createDocumentInputStream("CONTENTS");
FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
fs.close();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get file summary information with Java/Apache POI - java

Related

File not found in Directory (Static PDF attachment)

Unzip *.docx file in memory without write to disk - Java

Checking if file exists, if so, dont create new file and append instead

how to update metadata of docx file using apache poi in java?

Read embedded pdf file in excel using Java

Categories

Resources