I'm writing a bidi String to an MS Word file using Apache POI after wrapping it with the sequence
aString = "\u202E" + aString + "\u202C";
The text renders correctly in the file, and reads fine when I retrieve the string again. But if I modify the file in anyway, suddenly, reading that string returns true with isBlank().
Thank you in advance for any suggestions/help!
When Microsoft Word stores bidirectional text in it's Office Open XML *.docx format, then it sometimes uses special text run elements w:bdo (bi directional orientation). Apache poi does not read those elements until now. So if a XWPFParagraph contains such elements, then paragraph.getText() will return an empty string.
One could using org.apache.xmlbeans.XmlCursor to really get all text from all XWPFParagraphs like so:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlCursor;
public class ReadWordParagraphs {
static String getAllTextFromParagraph(XWPFParagraph paragraph) {
XmlCursor cursor = paragraph.getCTP().newCursor();
return cursor.getTextValue();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
System.out.println(paragraph.getText()); // will not return text in w:bdo elements
System.out.println(getAllTextFromParagraph(paragraph)); // will return all text content of paragraph
}
}
}
Related
My goal is to insert a docx (with keeping the style / formatting) into another docx's specific row. In the second docx there is a word, "placeholder" and first, I have to find this word, and then change it to first docx text, keeping the inserted docx styles and formats.
I have an idea. Maybe I should create a new docx, divide the second docx with the "placeholder", put the first part to the new docx, then put the whole docx, and then put the second part of the second docx. But how can I keep the styles and formats? I don't have images / tablets or anything, just texts and formatting stuff, like lists, tabs, text style, etc.
Currently I am using apache POI and java. (I tried docx4j, but I had less success)
The example code does a simple merging but nothing more. How can I find the "placeholder" word and insert my docx there?
public static void merge(InputStream src1, InputStream src2, OutputStream dest) throws Exception {
OPCPackage src1Package = OPCPackage.open(src1);
OPCPackage src2Package = OPCPackage.open(src2);
XWPFDocument src1Document = new XWPFDocument(src1Package);
CTBody src1Body = src1Document.getDocument().getBody();
XWPFDocument src2Document = new XWPFDocument(src2Package);
CTBody src2Body = src2Document.getDocument().getBody();
appendBody(src1Body, src2Body);
src1Document.write(dest);
}
private static void appendBody(CTBody src, CTBody append) throws Exception {
XmlOptions optionsOuter = new XmlOptions();
optionsOuter.setSaveOuter();
String appendString = append.xmlText(optionsOuter);
String srcString = src.xmlText();
String prefix = srcString.substring(0, srcString.indexOf(">") + 1);
String mainPart = srcString.substring(srcString.indexOf(">") + 1, srcString.lastIndexOf("<"));
String suffix = srcString.substring(srcString.lastIndexOf("<"));
String addPart = appendString.substring(appendString.indexOf(">") + 1, appendString.lastIndexOf("<"));
CTBody makeBody = CTBody.Factory.parse(prefix + mainPart + addPart + suffix);
src.set(makeBody);
}
Re docx4j you can insert a docx at a specific location (eg in a table cell) using MergeDocx in our commercial Docx4j Enterprise.
You can get a trial version from https://www.plutext.com/m/index.php/products
Then see the MergeIntoTableCell sample and documentation.
Other solution is: in my example in mainPart, we can find the text (using indexof / lastindexof / substring are better, than using regex) and add (and replace the text to) the addPart and ready to go.
2 possible problem:
1: if we have numbered lists / bulleted lists in addPart, that can be be messy after adding to the other document.
2: inserting picture is not possible in this way, it has to be handle in other way.
We are using ASPOSE for content download in Word & PDF format. We don’t have separate code format for PDF or WORD.
There is only one base code format to retrieve data from database,finally will add the response type based on PDF(SaveFormat.PDF) or WORD (SaveFormat.DOCX).
When we change running head styles we get the correct format in WORD as expected but not in PDF.
Note : We do have updated ASPOSE JAR still its not working.
Could you please help on this issue. Thanks in advance.
package com.sam.test;
import java.text.MessageFormat;
import com.aspose.words.Document;
import com.aspose.words.DocumentBuilder;
import com.aspose.words.HeaderFooterType;
import com.aspose.words.ParagraphAlignment;
import com.aspose.words.SaveFormat;
public class SuperScriptTest {
public static void main(String[] args) throws Exception {
String fontName = "Times New Roman";
String fontColour = "black";
Double fontSize = 15.9996;
Double lineheight = 100.0;
String footerVariable = "";
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.writeln("Aspose Sample document Content for Word file.");
com.aspose.words.Section currentSection = builder.getCurrentSection();
com.aspose.words.PageSetup pageSetup = currentSection.getPageSetup();
pageSetup.setDifferentFirstPageHeaderFooter(true);
// --- Create header for the first page. ---
pageSetup.setHeaderDistance(0.5 * 72 );
pageSetup.setFooterDistance(0.5 * 72);
builder.moveToHeaderFooter(HeaderFooterType.HEADER_FIRST);
builder.getParagraphFormat().setAlignment(ParagraphAlignment.LEFT);
String runningHead = "Running Head Test";
runningHead = MessageFormat
.format("<span style=\"margin:0px; font-family:{0}; font-size:{1}px; color:{2}; line-height:{3}%;\">{4}</span>",
fontName, fontSize, fontColour, lineheight,
runningHead);
if (!doc.getLastSection().getBody().hasChildNodes())
doc.getLastSection().remove();
builder.insertHtml(runningHead);
doc.save("C:/ASPOSE/Examples/ASPOSEPOC1/Aspose_word_doc.docx",SaveFormat.DOCX);
doc.save("C:/ASPOSE/Examples/ASPOSEPOC1/Aspose_pdf_doc.pdf",SaveFormat.PDF);
}
}
I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.
PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text from pdf:" + text);
} else{
log.info("File is encrypted!");
}
document.close();
Sample:
Sentence 1, nth line of file
Needed line
Sentence 3, n+2th line of file
I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution.
Solution 1:
String[] lines = myString.split(System.getProperty("line.separator"));
Solution 2:
String neededline = (String) FileUtils.readLines(file).get("n+2th")
In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
First, you captures lines in an array like you said.
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
Here's a link to a tutorial that also does what you are trying to do:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...
Hey I am trying to replace the a regex pattern in a directory of files and replace with this character 'X'. I started out trying alter one file but that is not working. I cam eup with the following code any help would be appreciated.
My goal is to read all the file content find the regex pattern and replace it.
Also this code is not working it runs but dose nothing to the text file.
import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
public class DataChange {
public static void main(String[] args) throws IOException {
String absolutePathOne = "C:\\Users\\hoflerj\\Desktop\\After\\test.txt";
String[] files = { "test.txt" };
for (String file : files) {
File f = new File(file);
String content = FileUtils.readFileToString(new File(absolutePathOne));
FileUtils.writeStringToFile(f, content.replaceAll("2018(.+)", "X"));
}
}
}
File Content inside the file is:
3-MAAAA2017/2/00346
I am trying to have it read through and replace 2017/2/00346 with XXX's
my goal is to do this for like 3 files at one time also.
I have a word/docx file which has equations as under images
I want read data of file word/docx and save to my database
and when need I can get data from database and show on my html page
I used apache Poi for read data form docx file but It can't take equations
Please help me!
Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).
Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.
A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.
So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.
My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.
The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.
Word document:
Java code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.
Adding to #Axel Richter answer, I found it really hard to find the required set of dependencies
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
And with Office 2019 I guess they don't provide OMML2MML.XSL so here's the link for it https://github.com/Versal/word2markdown/blob/master/libs/omml2mml.xsl