Discover titles/paragraphs in word docs

Discover titles/paragraphs in word docs - java

I'm trying to discover paragraphs/titles in word documents.
I use Apache POI to do this.
An example that I use is:
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
ArrayList titles = new ArrayList();
try {
for (int i = 0; i < we.getText().length() - 1; i++) {
int startIndex = i;
int endIndex = i + 1;
Range range = new Range(startIndex, endIndex, doc);
CharacterRun cr = range.getCharacterRun(0);
if (cr.isBold() || cr.isItalic() || cr.getUnderlineCode() != 0) {
while (cr.isBold() || cr.isItalic() || cr.getUnderlineCode() != 0) {
i++;
endIndex += 1;
range = new Range(endIndex, endIndex + 1, doc);
cr = range.getCharacterRun(0);
}
range = new Range(startIndex, endIndex - 1, doc);
titles.add(range.text());
}
}
}
catch (IndexOutOfBoundsException iobe) {
//sometimes this happens have to find out why.
}`enter code here`
This works for all bold, italic or underlined text.
But what I want is to discover the font that is used most often. And then to discover variations compared to that font style. Anyone an Idea?

Well, some thoughts would be to try some of the following:
cr.getFontSize() could be used at the beginning of a paragraph to see if the range changes font size. That in conjunction with bold, italic or underlined would be a good identifier.
cr.getFontName() could also be used to determine when and where the font changes in a given range.
cr.getColor() would be another possibility to help identify if the user is using different colors for a font.
I guess I would iterate over the range and create multiple CharacterRun items each time the text characteristics change. Then evaluate each item based on position in the paragraph as well as all of the afore-mentioned characteristics (size, color, name, bold, italics, etc.). Perhaps create some sort of weighting scale based on the most common values.
It might also be of value to create a Title object and store the values for each set of characteristics to help optimize searches in later character runs in the same document.

You might want to take a look at the buildParagraphTagAndStyle method in Tika's WordExtractor:
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
For HWPF (.doc), to call it you'd do:
StyleDescription style =
document.getStyleSheet().getStyleDescription(p.getStyleIndex());
TagAndStyle tas = buildParagraphTagAndStyle(
style.getName(), (parentTableLevel>0)
);
For XWPF (.docx) you'd do:
XWPFStyle style = styles.getStyle(paragraph.getStyleID());
TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(
style.getName(), paragraph.getPartType() == BodyType.TABLECELL
);

It will be easier if you process the data by converting it into paragraphs.
WordExtractor we = new WordExtractor(doc);
String[] para = we.getParagraphText();
Then work paragraph wise. If your code already couldn't figure out the titles, then you can check for bold and underlines in each paragraph.
The paragraphs function as follows:
for(int i=0;i<para.length;i++)
{
System.out.println("Length of paragraph "+(i+1)+": "+ para[i].length());
System.out.println(para[i].toString());
}
A working example can be found here:
http://sanjaal.com/java/120/java-file/how-to-read-doc-file-using-java-and-apache-poi/#comments

Related

Get start position of character styled texts in a paragraph in Aspose Words for Android

It has being parsed Ms Word documents with Aspose Words for Android below code. All of paragraphs in the document have inline character styled texts seperatelly. I've text and style of them but are there any way to get start position of them in its paragraph string like String.indexOf() ? It may be convert to string, but style control is not possible in this case.
Document doc = new Document(file); // Get word document.
NodeCollection paras = doc.getChildNodes(NodeType.PARAGRAPH, true); // get all paragraphs.
for (Paragraph prg : (Iterable<Paragraph>) paras) {
for (Run run : (Iterable<Run>) prg.getChildNodes(NodeType.RUN, true)){
boolean defaultPrgFont = run.getFont().getStyle().getName().equals("Default Paragraph Font");
// Get different styled texts only.
if (!defaultPrgFont){
// Text in different styled according to paragraph.
String runText = run.getText();
// Style of the different styled text.
String runStyle = run.getFont().getStyle().getName()
// Start position of the different styled text in its paragraph.
int runStartPosition; // ?
}
}
}

You can calculate length of text in runs before the styled run. Something like this.
Document doc = new Document("C:\\Temp\\in.docx"); // Get word document.
NodeCollection paras = doc.getChildNodes(NodeType.PARAGRAPH, true); // get all paragraphs.
for (Paragraph prg : (Iterable<Paragraph>) paras) {
int runStartPosition = 0;
for (Run run : (Iterable<Run>) prg.getChildNodes(NodeType.RUN, true)){
boolean defaultPrgFont = run.getFont().getStyle().getName().equals("Default Paragraph Font");
// Get different styled texts only.
if (!defaultPrgFont){
// Text in different styled according to paragraph.
String runText = run.getText();
// Style of the different styled text.
String runStyle = run.getFont().getStyle().getName();
System.out.println(runStartPosition);
}
// Position is increased for all runs in the paragraph.
// Note that some runs might represent field codes and are not normally displayed.
runStartPosition += run.getText().length();
}
}

How to read PDF sections using Header font size using PDFBox?

I am trying to read PDF documents and I need them to be separated by sections using header font size or font and font size I currently have it implemented based on the answer of this post. But due to my PDF having the same font for header and the sub-header I need to modify the code so it would search based on font size or both.
List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
new TextSectionDefinition("Section", x -> x.get(0).get(0).getFont().getName().contains("Calibri,Bold"), TextSectionDefinition.MultiLine.multiLineHeader, true)
);
document.getClass();
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);
System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections()) {
String text = textSection.toString();
System.out.println(text);
texts.add(text);
}
return ResponseEntity.ok(texts);
My problem stems if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16 (font size).

In the answer you refer to there are text section definitions like this:
new TextSectionDefinition("Titel",
x->x.get(0).get(0).getFont().getName().contains("CMBX12"),
MultiLine.singleLine,
false)
I assume your remark
if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16
indicates that you want to exchange the lambda expression in the second parameter
x->x.get(0).get(0).getFont().getName().contains("CMBX12")
by something that tests the font size. Thus, have you tried replacing it by
x->x.get(0).get(0).getFontSize() == 16
or
x->x.get(0).get(0).getFontSizeInPt() == 16
or
x-> {
float size = x.get(0).get(0).getFontSizeInPt();
return size > 15 && size < 17;
}
yet?

How do I ADD bullet points to a word document using Apache POI in Java

I have a word document which is used as a template. Inside this template I have some tables that contain predefined bullet points. Now I'm trying to replace the placeholder string with a set of strings.
I'm totally stuck on this. My simplified methods looks like this.
replaceKeyValue.put("[DescriptionOfItem]", new HashSet<>(Collections.singletonList("This is the description")));
replaceKeyValue.put("[AllowedEntities]", new HashSet<>(Arrays.asList("a", "b")));
replaceKeyValue.put("[OptionalEntities]", new HashSet<>(Arrays.asList("c", "d")));
replaceKeyValue.put("[NotAllowedEntities]", new HashSet<>(Arrays.asList("e", "f")));
try (XWPFDocument template = new XWPFDocument(OPCPackage.open(file))) {
template.getTables().forEach(
xwpfTable -> xwpfTable.getRows().forEach(
xwpfTableRow -> xwpfTableRow.getTableCells().forEach(
xwpfTableCell -> replaceInCell(replaceKeyValue, xwpfTableCell)
)
));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
template.write(baos);
return new ByteArrayResource(baos.toByteArray());
} finally {
if (file.exists()) {
file.delete();
}
}
private void replaceInCell(Map<String, Set<String>> replacementsKeyValuePairs, XWPFTableCell xwpfTableCell) {
for (XWPFParagraph xwpfParagraph : xwpfTableCell.getParagraphs()) {
for (Map.Entry<String, Set<String>> replPair : replacementsKeyValuePairs.entrySet()) {
String keyToFind = replPair.getKey();
Set<String> replacementStrings = replacementsKeyValuePairs.get(keyToFind);
if (xwpfParagraph.getText().contains(keyToFind)) {
replacementStrings.forEach(replacementString -> {
XWPFParagraph paragraph = xwpfTableCell.addParagraph();
XWPFRun run = paragraph.createRun();
run.setText(replacementString);
});
}
}
}
I was expecting that some more bullet points will be added to the current cell. Am I missing something? The paragraph is the one containing the placeholder string and format.
Thanks for any help!
UPDATE: This is how part of the template looks like. I would like to automatically search for the terms and replace them. Searching works so far. But trying to replace the bullet points ends in an unlocatable NullPointer.
Would it be easier to use fields? I need to keep the bullet point style though.
UPDATE 2: added download link and updated the code. Seems I can't alter the paragraphs if I'm iterating through them. I get a null-pointer.
Download link: WordTemplate

Since Microsoft Word is very, very "strange" in how it divides text in different runs in it's storage, such questions are not possible to answer without having a complete example including all code and the Word documents in question. Having a general usable code for adding content to Word documents seems not be possible, except all the adding or replacement is only in fields (form fields or content controls or mail merge fields).
So I downloaded your WordTemplate.docx which looks like so:
Then I runned the following code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;
import org.apache.xmlbeans.XmlCursor;
import java.util.*;
import java.math.BigInteger;
public class WordReadAndRewrite {
static void addItems(XWPFTableCell cell, XWPFParagraph paragraph, Set<String> items) {
XmlCursor cursor = null;
XWPFRun run = null;
CTR cTR = null; // for a deep copy of the run's low level object
BigInteger numID = paragraph.getNumID();
int indentationLeft = paragraph.getIndentationLeft();
int indentationHanging = paragraph.getIndentationHanging();
boolean first = true;
for (String item : items) {
if (first) {
for (int r = paragraph.getRuns().size()-1; r > 0; r--) {
paragraph.removeRun(r);
}
run = (paragraph.getRuns().size() > 0)?paragraph.getRuns().get(0):null;
if (run == null) run = paragraph.createRun();
run.setText(item, 0);
cTR = (CTR)run.getCTR().copy(); // take a deep copy of the run's low level object
first = false;
} else {
cursor = paragraph.getCTP().newCursor();
boolean thereWasParagraphAfter = cursor.toNextSibling(); // move cursor to next paragraph
// because the new paragraph shall be **after** that paragraph
// thereWasParagraphAfter is true if there is a next paragraph, else false
if (thereWasParagraphAfter) {
paragraph = cell.insertNewParagraph(cursor); // insert new paragraph if there are next paragraphs in cell
} else {
paragraph = cell.addParagraph(); // add new paragraph if there are no other paragraphs present in cell
}
paragraph.setNumID(numID); // set template paragraph's numbering Id
paragraph.setIndentationLeft(indentationLeft); // set template paragraph's indenting from left
if (indentationHanging != -1) paragraph.setIndentationHanging(indentationHanging); // set template paragraph's hanging indenting
run = paragraph.createRun();
if (cTR != null) run.getCTR().set(cTR); // set template paragraph's run formatting
run.setText(item, 0);
}
}
}
public static void main(String[] args) throws Exception {
Map<String, Set<String>> replaceKeyValue = new HashMap<String, Set<String>>();
replaceKeyValue.put("[AllowedEntities]", new HashSet<>(Arrays.asList("allowed 1", "allowed 2", "allowed 3")));
replaceKeyValue.put("[OptionalEntities]", new HashSet<>(Arrays.asList("optional 1", "optional 2", "optional 3")));
replaceKeyValue.put("[NotAllowedEntities]", new HashSet<>(Arrays.asList("not allowed 1", "not allowed 2", "not allowed 3")));
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTemplate.docx"));
List<XWPFTable> tables = document.getTables();
for (XWPFTable table : tables) {
List<XWPFTableRow> rows = table.getRows();
for (XWPFTableRow row : rows) {
List<XWPFTableCell> cells = row.getTableCells();
for (XWPFTableCell cell : cells) {
int countParagraphs = cell.getParagraphs().size();
for (int p = 0; p < countParagraphs; p++) { // do not for each since new paragraphs were added
XWPFParagraph paragraph = cell.getParagraphArray(p);
String placeholder = paragraph.getText();
placeholder = placeholder.trim(); // this is the tricky part to get really the correct placeholder
Set<String> items = replaceKeyValue.get(placeholder);
if (items != null) {
addItems(cell, paragraph, items);
}
}
}
}
}
FileOutputStream out = new FileOutputStream("Result.docx");
document.write(out);
out.close();
document.close();
}
}
The Result.docx looks like so:
The code loops trough the table cells in the Word document and looks for a paragraph which contains exactly the placeholder. This even might be the tricky part since that placeholder might be splitted into differnt text runs by Word. If found it runs a method addItems which takes the found paragraph as a template for numbering and indention (might be incomplter though). Then it sets the first new item in first text run of found paragraph and removes all other text runs which possibly are there. Then it determines wheter new paragraphs must be inserted or added to the cell. For this a XmlCursor is used. In new inserted or added paragrahs the other items are placed and the numbering and indention settings are taken from the placeholder's paragraph.
As said, this is code for showing the principles of how to do. It would must be extended very much to be general usable. In my opinion those trials using text placeholders in Word documents for text replacements are not really good. Placeholders for variable text in Word documents should be fields. This could be form fields, content controls or mail merge fields. Advantage of fields in contrast of text placeholders is that Word knows the fields being entities for variable texts. It will not split them into multiple text runs for multiple strange reasons as it often does with normal text.

Divide page in 2 parts so we can fill each with different source

I need to create an User guide, where I've to put the content in 2 different language but on the same page. so the first half of the page would be in English while the second part would be in French. (In future they might ask for 3rd language also, but maximum 3). So each page would have 2 blocks. How can I achieve this using iTextPDF in java ?
UPDATE
Following is the structure for more insight of the question.

If I understand your question correctly, you need to create something like this:
In this screen shot, you see the first part of the first book of Caesar's Commentaries on the Gallic War. Gallia omnia est divisa in partes tres, and so is each page in this document: the upper part shows the text in Latin, the middle part shows the text in English, the lower part shows the text in French. If you read the text, you'll discover that Belgians like me are considered being the bravest of all (although we aren't as civilized as one would wish). See three_parts.pdf if you want to take a look at the PDF.
This PDF was created with the ThreeParts example. In this example, I have 9 text files:
http://itextpdf.com/sites/default/files/liber1_1_la.txt
http://itextpdf.com/sites/default/files/liber1_1_en.txt
http://itextpdf.com/sites/default/files/liber1_1_fr.txt
http://itextpdf.com/sites/default/files/liber1_2_la.txt
http://itextpdf.com/sites/default/files/liber1_2_en.txt
http://itextpdf.com/sites/default/files/liber1_2_fr.txt
http://itextpdf.com/sites/default/files/liber1_3_la.txt
http://itextpdf.com/sites/default/files/liber1_3_en.txt
http://itextpdf.com/sites/default/files/liber1_3_fr.txt
Liber is the latin word for book, so all files are snippets from the first book, more specifically sections 1, 2, and 3, in Latin, English and French.
This is how I defined the languages and he rectangles for each language:
public static final String[] LANGUAGES = { "la", "en", "fr" };
public static final Rectangle[] RECTANGLES = {
new Rectangle(36, 581, 559, 806),
new Rectangle(36, 308.5f, 559, 533.5f),
new Rectangle(36, 36, 559, 261) };
In my code, I loop over the different sections, and I create a ColumnText object for each language:
PdfContentByte cb = writer.getDirectContent();
ColumnText[] columns = new ColumnText[3];
for (int section = 1; section <= 3; section++) {
for (int la = 0; la < 3; la++) {
columns[la] = createColumn(cb, section, LANGUAGES[la], RECTANGLES[la]);
}
while (addColumns(columns)) {
document.newPage();
for (int la = 0; la < 3; la++) {
columns[la].setSimpleColumn(RECTANGLES[la]);
}
}
document.newPage();
}
If you examine the body of the inner loop, you see that I first define three ColumnText objects, one for each language:
public ColumnText createColumn(PdfContentByte cb, int i, String la, Rectangle rect)
throws IOException {
ColumnText ct = new ColumnText(cb);
ct.setSimpleColumn(rect);
Phrase p = createPhrase(String.format("resources/text/liber1_%s_%s.txt", i, la));
ct.addText(p);
return ct;
}
In this case, I'm using ColumnText in text mode, and I read the text from the different files into a Phrase like this:
public Phrase createPhrase(String path) throws IOException {
Phrase p = new Phrase();
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(path), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
p.add(str);
}
in.close();
return p;
}
Once I have defined the ColumnText objects and added their content, I need to render the content to one of more pages until all the text is rendered from all columns. To achieve this, we use this method:
public boolean addColumns(ColumnText[] columns) throws DocumentException {
int status = ColumnText.NO_MORE_TEXT;
for (ColumnText column : columns) {
if (ColumnText.hasMoreText(column.go()))
status = ColumnText.NO_MORE_COLUMN;
}
return ColumnText.hasMoreText(status);
}
As you can see, I also create a new page for every new section I start. This isn't really necessary: I could add all the section to a single ColumnText, but depending on how the Latin text translated into English and French, you could end up with large discrepancies where section X of the Latin text starts on one page and the same section in English or French starts on another page. Hence my choice to start a new page, although it's not really necessary in this small proof of concept.

JTextPane - HTMLDocument: when adding/removing a new style, other attributes also changes

I have a JTextPane (or JEditorPane) in which I want to add some buttons to format text (as shown in the picture).
When I change the selected text to Bold (making a new Style), the font family (and others attributes) also changes. Why? I want to set (or remove) the bold attribute in the selected text and other stays unchanged, as they were.
This is what I'm trying:
private void setBold(boolean flag){
HTMLDocument doc = (HTMLDocument) editorPane.getDocument();
int start = editorPane.getSelectionStart();
int end = editorPane.getSelectedText().length();
StyleContext ss = doc.getStyleSheet();
//check if BoldStyle exists and then add / remove it
Style style = ss.getStyle("BoldStyle");
if(style == null){
style = ss.addStyle("BoldStyle", null);
style.addAttribute(StyleConstants.Bold, true);
} else {
style.addAttribute(StyleConstants.Bold, false);
ss.removeStyle("BoldStyle");
}
doc.setCharacterAttributes(start, end, style, true);
}
But as I explained above, other attributes also change:
Any help will be appreciated. Thanks in advance!
http://oi40.tinypic.com/riuec9.jpg

What you are trying to do can be accomplished with one of the following two lines of code:
new StyledEditorKit.BoldAction().actionPerformed(null);
or
editorPane.getActionMap().get("font-bold").actionPerformed(null);
... where editorPane is an instance of JEditorPane of course.
Both will seamlessly take care of any attributes already defined and supports text selection.
Regarding your code, it does not work with previously styled text because you are overwriting the corresponding attributes with nothing. I mean, you never gather the values for the attributes already set for the current selected text using, say, the getAttributes() method. So, you are effectively resetting them to whatever default the global stylesheet specifies.
The good news is you don't need to worry about all this if you use one of the snippets above. Hope that helps.

I made some minor modifications to your code and it worked here:
private void setBold(boolean flag){
HTMLDocument doc = (HTMLDocument) editorPane.getDocument();
int start = editorPane.getSelectionStart();
int end = editorPane.getSelectionEnd();
if (start == end) {
return;
}
if (start > end) {
int life = start;
start = end;
end = life;
}
StyleContext ss = doc.getStyleSheet();
//check if BoldStyle exists and then add / remove it
Style style = ss.getStyle(editorPane.getSelectedText());
if(style == null){
style = ss.addStyle(editorPane.getSelectedText(), null);
style.addAttribute(StyleConstants.Bold, true);
} else {
style.addAttribute(StyleConstants.Bold, false);
ss.removeStyle(editorPane.getSelectedText());
}
doc.setCharacterAttributes(start, end - start, style, true);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Discover titles/paragraphs in word docs - java

Related

Get start position of character styled texts in a paragraph in Aspose Words for Android

How to read PDF sections using Header font size using PDFBox?

How do I ADD bullet points to a word document using Apache POI in Java

Divide page in 2 parts so we can fill each with different source

JTextPane - HTMLDocument: when adding/removing a new style, other attributes also changes

Categories

Resources