Word found unreadable content in .docx after replacing content through docx4j

Word found unreadable content in .docx after replacing content through docx4j - java

I am getting error Word found unreadable content in .docx after replacing content through docx4j.
Please find code snippet.
I am using docx4j-6.1.2 jar
public class Testt {
public static void main(String[] args) throws Exception {
final String TEMPLATE_NAME = "D://fileuploadtemp//123.docx";
InputStream templateInputStream = new FileInputStream(TEMPLATE_NAME);
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(templateInputStream);
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String xpath = "//w:r[w:t[contains(text(),'TEST')]]";
List<Object> list = documentPart.getJAXBNodesViaXPath(xpath, true);
for (Object obj : list) {
org.docx4j.wml.ObjectFactory factory = new org.docx4j.wml.ObjectFactory();
org.docx4j.wml.Text t = factory.createText();
t.setValue("\r\n");
((R) obj).getContent().clear();
((R) obj).getContent().add(t);
}
OutputStream os = new FileOutputStream(new File("D://fileuploadtemp//1234.docx"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
wordMLPackage.save(outputStream);
outputStream.writeTo(os);
os.close();
outputStream.close();
templateInputStream.close();
}
}

Related

Using Apache POI to extract excel text type attachment encode issue

I am now using Apache POI to extract the attachments from the Excel file, here is part of my code.
Sheet sheetAt = workbook.getSheet(sheetName);
Drawing<?> drawingPatriarch = sheetAt.getDrawingPatriarch();
if (drawingPatriarch != null) {
Iterator<?> iterator = drawingPatriarch.iterator();
if (iterator.hasNext()) {
Object next = iterator.next();
if (next instanceof ObjectData) {
ObjectData o = (ObjectData) next;
IOUtil.write(o.getObjectData(), outputPath);
} else if (next instanceof Picture) {
Picture o = (Picture) next;
IOUtil.write(o.getData(), outputPath);
}
}
}
When the attachment is a binary type, such as exe, png, etc., the file extracted in this way is normal, but if the attachment is a text type, such as txt, pdf, etc., the extracted file can not be opened normally, view the binary content, in addition to the original file has a lot of extra data, how do I parse this data.

I suspect your ObjectData is of type oleObject or Objekt-Manager-Shellobjekt. Those objects are stored in embedded oleObject*.bin files. Those files have file systems of their own which need to be read. To do so, first get the DirectoryEntry and DirectoryNode and then get the Ole10Native. Having that you can get the file data as byte[] data = ole10Native.getDataBuffer().
Complete example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
public class ExcelGetObjectData {
public static void main(String[] args) throws Exception {
//String inFilePath = "./ExcelExampleIn.xlsx"; String outFilePath = "./ExcelExampleOut.xlsx";
String inFilePath = "./ExcelExampleIn.xls"; String outFilePath = "./ExcelExampleOut.xls";
try (Workbook workbook = WorkbookFactory.create(new FileInputStream(inFilePath));
FileOutputStream out = new FileOutputStream(outFilePath ) ) {
Sheet sheet = workbook.getSheetAt(0);
Drawing<?> drawingPatriarch = sheet.getDrawingPatriarch();
if (drawingPatriarch != null) {
for (Shape shape : drawingPatriarch) {
System.out.println(shape);
if (shape instanceof ObjectData) {
ObjectData objectData = (ObjectData) shape;
System.out.println(objectData.getFileName());
System.out.println(objectData.getOLE2ClassName());
System.out.println(objectData.getContentType());
if(objectData.getOLE2ClassName().equals("Objekt-Manager-Shellobjekt")) {
if (objectData.hasDirectoryEntry()) {
DirectoryEntry directory = objectData.getDirectory();
if (directory instanceof DirectoryNode) {
DirectoryNode directoryNode = (DirectoryNode)directory;
Ole10Native ole10Native = Ole10Native.createFromEmbeddedOleObject(directoryNode);
System.out.println(ole10Native.getCommand());
System.out.println(ole10Native.getFileName());
System.out.println(ole10Native.getLabel());
byte[] data = ole10Native.getDataBuffer(); //data now contains the embedded data
try (FileOutputStream dataOut = new FileOutputStream("./" + ole10Native.getLabel())) {
dataOut.write(data);
}
}
}
} else if(objectData.getOLE2ClassName().equals("...")) {
//...
}
} else if (shape instanceof /*other*/Object) {
//...
}
}
}
workbook.write(out);
}
}
}
Using EmbeddedExtractor extraction of all embedded objects could be done like so:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.ss.extractor.EmbeddedExtractor;
import org.apache.poi.ss.extractor.EmbeddedData;
public class ExcelEmbeddedExtractor {
public static void main(String[] args) throws Exception {
String inFilePath = "./ExcelExampleIn.xlsx"; String outFilePath = "./ExcelExampleOut.xlsx";
//String inFilePath = "./ExcelExampleIn.xls"; String outFilePath = "./ExcelExampleOut.xls";
try (Workbook workbook = WorkbookFactory.create(new FileInputStream(inFilePath));
FileOutputStream out = new FileOutputStream(outFilePath ) ) {
Sheet sheet = workbook.getSheetAt(0);
EmbeddedExtractor extractor = new EmbeddedExtractor();
for (EmbeddedData embeddedData : extractor.extractAll(sheet)) {
Shape shape = embeddedData.getShape();
System.out.println(shape);
String filename = embeddedData.getFilename();
System.out.println(filename);
byte[] data = embeddedData.getEmbeddedData(); //data now contains the embedded data
try (FileOutputStream dataOut = new FileOutputStream("./" + filename)) {
dataOut.write(data);
}
}
workbook.write(out);
}
}
}

How can replace words in pdf file in java?

I tried code like this, but it's not replacing words. Is it correct way to replace words in pdf file?
#SpringBootApplication
public class DocReadWriteApplication {
public static final String SRC = "../Downloads/Debt LOI.pdf";
public static final String DEST = "../Downloads/hello.pdf";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
manipulatePdf(SRC, DEST);
}
public static void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
PdfArray refs = null;
if (dict.get(PdfName.CONTENTS).isArray()) {
refs = dict.getAsArray(PdfName.CONTENTS);
} else if (dict.get(PdfName.CONTENTS).isIndirect()) {
refs = new PdfArray(dict.get(PdfName.CONTENTS));
}
for (int i = 0; i < refs.getArrayList().size(); i++) {
PRStream stream = (PRStream) refs.getDirectObject(i);
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("transaction", "Data").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
}
Anybody have done like this?

Adding UNICODE emoticon on pdf with itext

I have some problem adding unicode emoticon on pdf created with itext pdf. I tried with this and itextpdf core 5.5.13
public class MathSymbols {
public static final String DEST = "EXAMPLE.pdf";
public static final String FONT = "/res/fonts/arialuni.ttf";
public static String TEXT ;
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
TEXT = "this "+Character.toChars(0x1F600)+" string \uD83D\uDE00 contains \ud83d\ude00 special \u2609 characters like this \u2208, \u2229, \u2211, \u222b, \u2206";
new MathSymbols().createPdf(DEST);
}
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
BaseFont bf = BaseFont.createFont(FONT, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Font f = new Font(bf,12);
Paragraph p = new Paragraph(TEXT, f);
document.add(p);
document.close();
}}
I have the same proble on itextpdf 7.x with this snippet
public class Main {
public static final String DEST = "example.pdf";
public static void main(String args[]) throws IOException {
File file = new File(DEST);
new Main().createPdf(DEST);
}
public void createPdf(String dest) throws IOException {
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdf = new PdfDocument(writer);
Document document = new Document(pdf);
PdfFont f = PdfFontFactory.createFont("/resources/arialuni.ttf", PdfEncodings.IDENTITY_H,true);
Paragraph p = new Paragraph("H\u2082SO\u2074 1 \uD83D\uDE00 contains \ud83d\ude00 spe \u2702 cial \u2609 characters like this \u2208, \u2229, \u2211, \u222b, \u2206").setFont(f).setFontSize(10);
document.add(p);
document.close();
}}
I tried with different fonts and different way in java like here.
But I only obtaing a withe space, or a square, in the pdf and no emoticon

Extracting font color along with font type from PDF using PDFBox

I need to extract Font color as well as Font type[E.g.-Black, Tahoma, Bold] from PDF by Java(Using PDFBox). Below is the code I have written to extract font type and embed the same in the extracted text.
public class PDFParse {
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("Sample Bill.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper() {
String prevBaseFont = "";
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
StringBuilder builder = new StringBuilder();
for (TextPosition position : textPositions)
{
String baseFont = position.getFont().getBaseFont();
if (baseFont != null && !baseFont.equals(prevBaseFont))
{
builder.append('[').append(baseFont).append(']');
prevBaseFont = baseFont;
}
builder.append(position.getCharacter());
}
writeString(builder.toString());
}
};
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
pdfStripper.setSortByPosition(true);
String parsedText = pdfStripper.getText(pdDoc);
PrintWriter out = new PrintWriter("sample.txt");
out.println(parsedText);
out.close();
System.out.println(parsedText);
}
}
How to extract the font color for each word and embed the same in the same extracted file? Thanks :)

adding image to word document using docx4j

I am trying to add an image to the word document I want to create from docx4j..
Here goes my code..
package presaleshelperapplication;
import java.io.ByteArrayOutputStream;
import org.docx4j.dml.wordprocessingDrawing.Inline;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;
import sun.misc.IOUtils;
public class PreSalesHelperApplication {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
//wordMLPackage.getMainDocumentPart().addStyledParagraphOfText("Title", "Hello World");
//wordMLPackage.getMainDocumentPart().addParagraphOfText("Text");
java.io.InputStream is = new java.io.FileInputStream("/D:/Development/PreSalesData/sample.jpg");
// commons-io.jar
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] bytes = baos.toByteArray();
String filenameHint = null;
String altText = null;
int id1 = 0;
int id2 = 1;
org.docx4j.wml.P p = newImage( wordMLPackage, bytes,filenameHint, altText,id1, id2,6000 );
// Now add our p to the document
wordMLPackage.getMainDocumentPart().addObject(p);
wordMLPackage.save(new java.io.File("helloworld.docx") );
is.close();
}
public static org.docx4j.wml.P newImage( WordprocessingMLPackage wordMLPackage,
byte[] bytes,
String filenameHint, String altText,
int id1, int id2, long cx) throws Exception {
BinaryPartAbstractImage imagePart = BinaryPartAbstractImage.createImagePart(wordMLPackage, bytes);
Inline inline = imagePart.createImageInline(filenameHint, altText,id1, id2, cx,false);
// Now add the inline in w:p/w:r/w:drawing
org.docx4j.wml.ObjectFactory factory = new org.docx4j.wml.ObjectFactory();
org.docx4j.wml.P p = factory.createP();
org.docx4j.wml.R run = factory.createR();
p.getContent().add(run);
org.docx4j.wml.Drawing drawing = factory.createDrawing();
run.getContent().add(drawing);
drawing.getAnchorOrInline().add(inline);
return p;
}
}
When compiling I am getting the following error...
Exception in thread "main" java.lang.NoClassDefFoundError:org/apache/xmlgraphics/image/loader/ImageContext
My image file is good but getting this error..what could be the prob?

docx4j has dependencies.
One of them is:
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>xmlgraphics-commons</artifactId>
<version>1.5</version>
</dependency>
You need to add this to your class path.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Word found unreadable content in .docx after replacing content through docx4j - java

Related

Using Apache POI to extract excel text type attachment encode issue

How can replace words in pdf file in java?

Adding UNICODE emoticon on pdf with itext

Extracting font color along with font type from PDF using PDFBox

adding image to word document using docx4j

Categories

Resources