I want to open a password protected docx file using Apache POI. Can anyone help me with the complete code please? Am not getting solution with this code
Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:126)
at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:113)
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:301)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.(HSSFWorkbook.java:413)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.(HSSFWorkbook.java:394)
POIFSFileSystem fs=new POIFSFileSystem(new FileInputStream("D:/abc.docx"));
EncryptionInfo info=new EncryptionInfo(fs);
Decryptor decryptor=Decryptor.getInstance(info);
if(!decryptor.verifyPassword("user"))
{
throw new RuntimeException("document is encrypted");
}
InputStream in=decryptor.getDataStream(fs);
HSSFWorkbook wb=new HSSFWorkbook(in);
File f=new File("D:/abc5.docx");
wb.write(f);
The basic code for decryption the XML-based formats of Microsoft Office is shown in XML-based formats - Decryption.
But of course one must know that *.docx, which is a Word file in Office Open XML format, cannot be a HSSFWorkbook, which would be a Excel workbook in binary BIFF file format, but instead must be a XWPFDocument.
So:
import java.io.InputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.crypt.EncryptionInfo;
import org.apache.poi.poifs.crypt.Decryptor;
import java.security.GeneralSecurityException;
public class ReadEncryptedXWPF {
static XWPFDocument decryptdocx(POIFSFileSystem filesystem, String password) throws Exception {
EncryptionInfo info = new EncryptionInfo(filesystem);
Decryptor d = Decryptor.getInstance(info);
try {
if (!d.verifyPassword(password)) {
throw new RuntimeException("Unable to process: document is encrypted");
}
InputStream dataStream = d.getDataStream(filesystem);
return new XWPFDocument(dataStream);
} catch (GeneralSecurityException ex) {
throw new RuntimeException("Unable to process encrypted document", ex);
}
}
public static void main(String[] args) throws Exception {
POIFSFileSystem filesystem = new POIFSFileSystem(new FileInputStream("abc.docx"));
XWPFDocument document = decryptdocx(filesystem, "user");
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
System.out.println(extractor.getText());
extractor.close();
}
}
I have solved this. The code is below
POIFSFileSystem fs=new POIFSFileSystem(new FileInputStream("D:/abc.docx"));
EncryptionInfo info=new EncryptionInfo(fs);
Decryptor decryptor=Decryptor.getInstance(info);
XWPFDocument document=null;
if(decryptor.verifyPassword("password"))
{
InputStream dataStream = decryptor.getDataStream(fs);
document = new XWPFDocument(dataStream);
}else{
throw new Exception("file is protected with password...please open with right password");
}
File f=new File("D:/abc.docx");
FileOutputStream fos = new FileOutputStream(f);
document.write(fos);
document.close();
Related
I'm trying to apply encryption to a binary xls file with Apache POI.
While I can successfully encrypt xml based files, if I encrypt a binary one I can't open it and I get the wrong password error.
This is my code:
#Test
public void testEncryption() throws Exception {
File file = new File("file.xls");
Workbook workbook = new HSSFWorkbook();
setData(workbook);
FileOutputStream fileOutputStream = new FileOutputStream(file);
workbook.write(fileOutputStream);
fileOutputStream.close();
encryptFile(file, "pass");
}
public void encryptFile(File file, String encryptKey) throws Exception {
FileInputStream fileInput = new FileInputStream(file.getPath());
BufferedInputStream bufferInput = new BufferedInputStream(fileInput);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bufferInput);
Biff8EncryptionKey.setCurrentUserPassword(encryptKey);
HSSFWorkbook workbook = new HSSFWorkbook(poiFileSystem, true);
FileOutputStream fileOut = new FileOutputStream(file.getPath());
workbook.writeProtectWorkbook(Biff8EncryptionKey.getCurrentUserPassword(), "");
workbook.write(fileOut);
bufferInput.close();
fileOut.close();
Biff8EncryptionKey.setCurrentUserPassword(null);
}
To encrypt HSSF you simply do Biff8EncryptionKey.setCurrentUserPassword before writing the HSSFWorkbook.
Simplest example is as this:
import java.io.FileOutputStream;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
public class EncryptHSSF {
static void setData(HSSFWorkbook workbook) {
Sheet sheet = workbook.createSheet();
Row row = sheet.createRow(0);
Cell cell = row.createCell(0);
cell.setCellValue("Test");
}
public static void main(String[] args) throws Exception {
try (HSSFWorkbook workbook = new HSSFWorkbook();
FileOutputStream out = new FileOutputStream("file.xls") ) {
setData(workbook);
Biff8EncryptionKey.setCurrentUserPassword("pass");
workbook.write(out);
}
}
}
This works for me. If I open file.xls using Excel it asks me for password and if I type pass there, the workbook opens.
If the aim is encrypting an existing unencrypted workbook, then this would be as so:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
public class EncryptHSSFExistingFile {
public static void main(String[] args) throws Exception {
try (HSSFWorkbook workbook = new HSSFWorkbook(new FileInputStream("Unencrypted.xls"));
FileOutputStream out = new FileOutputStream("Encrypted.xls") ) {
Biff8EncryptionKey.setCurrentUserPassword("pass");
workbook.write(out);
}
}
}
So same solution. Simply do Biff8EncryptionKey.setCurrentUserPassword before writing the HSSFWorkbook.
This code only creates correct encrypted workbooks for usage with Microsoft Excel. LibreOffice Calc is not able to open those files. So seems encrypting method used is unknown for LibreOffice Calc. But I have not found a way to change encrypting method for HSSF. So HSSF encryption seems not be fully provided in apache poi. Apache POI - Encryption support also not shows an example. So you are able opening and rewriting encrypted HSSFWorkbook. The new written workbook is encrypted too then if Biff8EncryptionKey.setCurrentUserPassword is not set null. But you cannot write an encrypted HSSFWorkbook from scratch.
While Trying to write into Excel file getting Null pointer Exception coming from line sheet.createRow(1).createCell(5).setCellValue("Pass");
Not getting why this error is coming :(
package com.qtpselenium.Test;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import com.qtpselenium.util.Xls_Reader;
public class ReturnTestCaseResult {
public static void main(String[] args) {
String path =System.getProperty("user.dir") + "\\src\\com\\qtpselenium\\xls\\suiteA.xlsx";
/* Xls_Reader xlsr = new Xls_Reader(System.getProperty("user.dir") + "\\src\\com\\qtpselenium\\xls\\suiteA.xlsx");
ReportDataSetResult(xlsr, "TestCaseA1", 3, "Pass" , path);*/
ReportDataSetResult("TestCaseA1", path);
}
public static void ReportDataSetResult( String TestCaseName , String path){
System.out.println(TestCaseName +"----"+ path);
try {
FileInputStream fileinp = new FileInputStream(path);
XSSFWorkbook workbook = new XSSFWorkbook();
XSSFSheet sheet = workbook.getSheet(TestCaseName);
sheet.createRow(1).createCell(5).setCellValue("Pass");
FileOutputStream fileout = new FileOutputStream(path);
workbook.write(fileout);
fileout.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
You have used the no arg constructor to create the workbook:
XSSFWorkbook workbook = new XSSFWorkbook();
which means there are no sheets in the workbook. This means your sheet variable will be null. I think you want to pass the FileInputStream fileinp into the workbook constructor so that it reads from an existing file?
XSSFWorkbook workbook = new XSSFWorkbook(fileinp);
Otherwise you will need to create a sheet called TestCaseName in the workbook before you can start adding rows to it.
Maybe, row(1) is null, you can try to create it first.
I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf.
I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.
I was looking for a way and found something with InputStream and OpenConnection but didn't managed to actually do it.
I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.
So what is the best way to do this?
EDIT:
I want to be able to do something like this:
File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");
System.out.println(doc.toString());
You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://website.com/document.pdf");
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
System.out.println(contenthandler.toString());
}
}
You can use this code first download the PDF file then read the text with apache lib. you need to add some jar manually.
you need to set your local pdf file address which is by defualt "download.pdf".
import com.gnostice.pdfone.PdfDocument;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.ConnectException;
import java.net.URL;
import java.net.URLConnection;
public class LoadDocumentFromURL {
public static void main(String[] args) throws IOException {
URL url1 = new URL("https://arxiv.org/pdf/1811.06933.pdf");
byte[] ba1 = new byte[1024];
int baLength;
FileOutputStream fos1 = new FileOutputStream("download.pdf");
try {
// Contacting the URL
// System.out.print("Connecting to " + url1.toString() + " ... ");
URLConnection urlConn = url1.openConnection();
// Checking whether the URL contains a PDF
if (!urlConn.getContentType().equalsIgnoreCase("application/pdf")) {
System.out.println("FAILED.\n[Sorry. This is not a PDF.]");
} else {
try {
// Read the PDF from the URL and save to a local file
InputStream is1 = url1.openStream();
while ((baLength = is1.read(ba1)) != -1) {
fos1.write(ba1, 0, baLength);
}
fos1.flush();
fos1.close();
is1.close();
// Load the PDF document and display its page count
//System.out.print("DONE.\nProcessing the PDF ... ");
PdfDocument doc = new PdfDocument();
try {
doc.load("download.pdf");
// System.out.println("DONE.\nNumber of pages in the PDF is " + doc.getPageCount());
// System.out.println(doc.getAuthor());
// System.out.println(doc.getKeywords());
// System.out.println(doc.toString());
doc.close();
} catch (Exception e) {
System.out.println("FAILED.\n[" + e.getMessage() + "]");
}
} catch (ConnectException ce) {
//System.out.println("FAILED.\n[" + ce.getMessage() + "]\n");
}
}
} catch (NullPointerException npe) {
//System.out.println("FAILED.\n[" + npe.getMessage() + "]\n");
}
File file = new File("your local pdf file address which is download.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
document.close();
}
}
I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.
http://poi.apache.org/spreadsheet/quick-guide.html#Embedded
I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.
public class ReadExcel1 {
public static void main(String[] args) {
try {
FileInputStream file = new FileInputStream(new File("C:\\test.xls"));
POIFSFileSystem fs = new POIFSFileSystem(file);
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
if(oleName.equals("Acrobat Document")){
System.out.println("Acrobat reader document");
try{
DirectoryNode dn = (DirectoryNode) obj.getDirectory();
for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {
DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
byte[] data = new byte[nativeEntry.getSize()];
ByteArrayInputStream bao= new ByteArrayInputStream(data);
PDFParser pdfparser = new PDFParser(bao);
pdfparser.parse();
COSDocument cosDoc = pdfparser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
}
}catch(Exception e){
System.out.println("Error reading "+ e.getMessage());
}finally{
System.out.println("Finally ");
}
}else{
System.out.println("nothing ");
}
}
file.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Below is the output in eclipse
Acrobat reader document
Error reading Error: End-of-File, expected line
Finally
nothing
The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me.
This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...
I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.
import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;
public class EmbeddedPdfInExcel {
public static void main(String[] args) throws Exception {
NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
DirectoryNode dn = (DirectoryNode)obj.getDirectory();
if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
InputStream is = dn.createDocumentInputStream("CONTENTS");
FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
fs.close();
}
}
This question already has answers here:
Converting HTML files to PDF [closed]
(8 answers)
Closed 9 years ago.
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import com.itextpdf.text.Document;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
public class GeneratePDF {
public static void main(String[] args) {
try {
String k = "<html><body> This is my Project </body></html>";
OutputStream file = new FileOutputStream(new File("E:\\Test.pdf"));
Document document = new Document();
PdfWriter.getInstance(document, file);
document.open();
document.add(new Paragraph(k));
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
This is my code to convert HTML to PDF. I am able to convert it but in PDF file it saves as whole HTML while I need to display only text. <html><body> This is my Project </body></html> gets saved to PDF while it should save only This is my Project.
You can do it with the HTMLWorker class (deprecated) like this:
import com.itextpdf.text.html.simpleparser.HTMLWorker;
//...
try {
String k = "<html><body> This is my Project </body></html>";
OutputStream file = new FileOutputStream(new File("C:\\Test.pdf"));
Document document = new Document();
PdfWriter.getInstance(document, file);
document.open();
HTMLWorker htmlWorker = new HTMLWorker(document);
htmlWorker.parse(new StringReader(k));
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}
or using the XMLWorker, (download from this jar) using this code:
import com.itextpdf.tool.xml.XMLWorkerHelper;
//...
try {
String k = "<html><body> This is my Project </body></html>";
OutputStream file = new FileOutputStream(new File("C:\\Test.pdf"));
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
InputStream is = new ByteArrayInputStream(k.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}
This links might be helpful to convert.
https://code.google.com/p/flying-saucer/
https://today.java.net/pub/a/today/2007/06/26/generating-pdfs-with-flying-saucer-and-itext.html
If it is a college Project, you can even go for these,
http://pd4ml.com/examples.htm
Example is given to convert HTML to PDF