Extract word document comments and the text they comment on - java

I need to extract word document comments and the text they comment on. Below is my current solution, but it is not working as expcted
public class Main {
public static void main(String[] args) throws Exception {
var document = new Document("sample.docx");
NodeCollection<Paragraph> paragraphs = document.getChildNodes(PARAGRAPH, true);
List<MyComment> myComments = new ArrayList<>();
for (Paragraph paragraph : paragraphs) {
var comments = getComments(paragraph);
int commentIndex = 0;
if (comments.isEmpty()) continue;
for (Run run : paragraph.getRuns()) {
var runText = run.getText();
for (int i = commentIndex; i < comments.size(); i++) {
Comment comment = comments.get(i);
String commentText = comment.getText();
if (paragraph.getText().contains(runText + commentText)) {
myComments.add(new MyComment(runText, commentText));
commentIndex++;
break;
}
}
}
}
myComments.forEach(System.out::println);
}
private static List<Comment> getComments(Paragraph paragraph) {
#SuppressWarnings("unchecked")
NodeCollection<Comment> comments = paragraph.getChildNodes(COMMENT, false);
List<Comment> commentList = new ArrayList<>();
comments.forEach(commentList::add);
return commentList;
}
static class MyComment {
String text;
String commentText;
public MyComment(String text, String commentText) {
this.text = text;
this.commentText = commentText;
}
#Override
public String toString() {
return text + "-->" + commentText;
}
}
}
sample.docx contents are:
And the output is (which is incorrect):
factors-->This is word comment
%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
Expected output is:
factors-->This is word comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->Second paragraph comment
These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5%–10% of cancers are caused by inherited genetic defects from a person's parents.-->First paragraph comment
Please help me with a better way of extarcting word document comments and the text they comment on. If you need additional details let me know, I will provide all the required details

The commented text is marked by special nodes CommentRangeStart and CommentRangeEnd. CommentRangeStart and CommentRangeEnd nodes has Id, which corresponds the Comment id the range is linked to. So you need to extract content between the corresponding start and end nodes.
By the way, the code example in the Aspose.Words API reference shows how print the contents of all comments and their comment ranges using a document visitor. Looks like exactly what you are looking for.
EDIT: You can use code like the following to accomplish your task. I did not provide full code for extracting content between nodes, is is availabel on GitHub
Document doc = new Document("C:\\Temp\\in.docx");
// Get the comments in the document.
Iterable<Comment> comments = doc.getChildNodes(NodeType.COMMENT, true);
Iterable<CommentRangeStart> commentRangeStarts = doc.getChildNodes(NodeType.COMMENT_RANGE_START, true);
Iterable<CommentRangeEnd> commentRangeEnds = doc.getChildNodes(NodeType.COMMENT_RANGE_END, true);
for (Comment c : comments)
{
System.out.println(String.format("Comment %d : %s", c.getId(), c.toString(SaveFormat.TEXT)));
CommentRangeStart start = null;
CommentRangeEnd end = null;
// Search for an appropriate start and end.
for (CommentRangeStart s : commentRangeStarts)
{
if (c.getId() == s.getId())
{
start = s;
break;
}
}
for (CommentRangeEnd e : commentRangeEnds)
{
if (c.getId() == e.getId())
{
end = e;
break;
}
}
if (start != null && end != null)
{
// Extract content between the start and end nodes.
// Code example how to extract content between nodes is here
// https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/src/main/java/com/aspose/words/examples/programming_documents/document/ExtractContentBetweenCommentRange.java
}
else
{
System.out.println(String.format("Comment %d Does not have comment range"));
}
}

Related

Replace Text using Apache POI in Header/Footer not working when a dot is inside the placeholder

I use templ4docx/Apache POI (2.0.3/3.17).
There you can set a VariablePatten like this:
private static final String START_PATTERN = "#{";
private static final String END_PATTERN = "}";
...
targetDocx.setVariablePattern(new VariablePattern(START_PATTERN, END_PATTERN));
When i use a placeholder with dots, it´s not working inside Header/Footer. In the Body with dots it works. And Images works too with placeholder and dots inside!
Example in Word-Template:
#{Person.Name} // works in Body NOT in Header/Footer!
#{Person.Name} // works in Body and Header/Footer!
#{Person} // works in Body and Header/Footer!
#{Image.Name} // works in Body and Header/Footer! Here i use ImageVariable instead of Textvariable.
I debug the code an saw the "run.setText()" is called with the right Text but in the final document it´s not.
#Override
public void insert(Insert insert, Variable variable) {
if (!(insert instanceof TextInsert)) {
return;
}
if (!(variable instanceof TextVariable)) {
return;
}
TextInsert textInsert = (TextInsert) insert;
TextVariable textVariable = (TextVariable) variable;
for (XWPFRun run : textInsert.getParagraph().getRuns()) {
String text = run.getText(0);
if (StringUtils.contains(text, textInsert.getKey().getKey())) {
text = StringUtils.replace(text, textVariable.getKey(), textVariable.getValue());
if (text.contains("\n")) {
String[] textLines = text.split("\n");
run.setText(textLines[0], 0);
for (int i = 1; i < textLines.length; i++) {
run.addBreak();
run.setText(textLines[i]);
}
} else {
run.setText(text, 0);
}
}
}
}
Any ideas why it didn´t work with Placeholder "#{Person.Name}" as a TextVariable in Header/Footer? But it works with "#{PersonName}" and ImageVariable "#{Images.Logo1}"???
It looks like Word sometimes separates the placeholders, so you'll only find parts of the placeholder in the runs.
In the class "TextInsertStrategy" i check in the loop of the runs for split placeholders and treat it accordingly. With that I could solve the problem.

Java string array doesn't print correctly

I am currently working on a Java program that crawls a webpage and prints out some information from it.
There is one part that I can't figure out, and thats when I try to print out one specific String Array with some information in it, all it gives me is " ] " for that line. However, a few lines before, I also try printing out another String array in the exact same way and it prints out fine. When I test what is actually being passed to the "categories" variable, its the correct information and can be printed out there.
public class Crawler {
private Document htmlDocument;
String [] keywords, categories;
public void printData(String urlToCrawl)
{
nextURL=urlToCrawl;
crawl();
//This does what its supposed to do. (Print Statement 1)
System.out.print("Keywords: ");
for (String i :keywords) {System.out.print(i+", ");}
//This doesnt. (Print Statement 2)
System.out.print("Categories: ");
for (String b :categories) {System.out.print(b+", ");}
}
public void crawl()
{
//Gather Data
//open up JSOUP for HTTP parsing.
Connection connection = Jsoup.connect(nextURL).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument=htmlDocument;
System.out.println("Recieved Webpage "+ nextURL);
int guacCounter = 0;
for(Element guac : htmlDocument.select("script"))
{
if(guacCounter==5)
{
//String concentratedGuac = guac.toString();
String[] items = guac.toString().split("\\n");
categories = processGuac(items);
break;
}
else if(guacCounter<5) {
guacCounter++;
}
}
}
public String[] processKeywords(String totalKeywords)
{
String [] separatedKeywords = totalKeywords.split(",");
//System.out.println(separatedKeywords.toString());
return separatedKeywords;
}
public String[] processGuac(String[] inputGuac)
{
int categoryIsOnLine = 6;
String categoryData = inputGuac[categoryIsOnLine-1];
categoryData = categoryData.replace(",","");
categoryData = categoryData.replace("'","");
categoryData = categoryData.replace("|",",");
categoryData = categoryData.split(":")[1];
//this prints out the list of categories in string form.(Print Statement 3)
System.out.println("Testing here: " + categoryData.toString());
String [] categoryList=categoryData.split(",");
//This prints out the list of categories in array form correctly.(Print statement 4)
System.out.println("Testing here too: " );
for(String a : categoryList) {System.out.println(a);}
return categoryList;
}
}
I cut out a lot of the irrelevant parts of my code so there might be some missing variables.
Here is what my printouts look like:
PS1:
Keywords: What makes a good friend, making friends, signs of a good friend, supporting friends, conflict management,
PS2:
]
PS3:
Testing here: wellbeing,friends-and-family,friendships
PS4:
Testing here too:
wellbeing
friends-and-family
friendships

Play Framework IF statement and FOR loop

I am new to MVC and following this link I have a search page for resulting pdf metadatas by using Solr. My if statement and for loop in html side do not work
Searching.java in models folder:
public class Searching {
public String q;
public String outputTitle;
public String outputAuthor;
public String outputContent;
public String outputPage;
public String outputPath;
}
search function in Application.java:
final static Form<Searching> searchForm = form(Searching.class);
final static List<Searching> searchList = new ArrayList<Searching>();
public static Result search() {
Form<Searching> filledForm = searchForm.bindFromRequest();
Searching searched = filledForm.get();
....(database connection lines)
QueryResponse response = solr.query(query);
SolrDocumentList results = response.getResults();
if(results.isEmpty())
System.out.println("SEARCH NOT FOUND");
else {
for (int i = 0; i < results.size(); ++i) {
searched.outputTitle = (String)results.get(i).getFirstValue("title");
searched.outputAuthor = (String)results.get(i).getFirstValue("author");
searched.outputPage =results.get(i).getFirstValue("pageNumber").toString();
searched.outputContent = (String)results.get(i).getFirstValue("content");
searched.outputPath = (String)results.get(i).getFirstValue("path");
searchList.add(searched);
}
System.out.println("\nresults.getNumFound(): "+ searched.outputFound);
System.out.println("results.size(): "+results.size());
}
return play.mvc.Results.ok(search.render(searched, searchForm, searchList));
}
search.scala.html
#(searched: Searching, searchForm: Form[Searching], searchList: List[Searching])
.. some buttons,a search bar...
#if(searchList.isEmpty()) {
<h1>Error</h1>
} else {
#for(search <- searchList) {
<ul>Title: #search.outputTitle</ul>
<ul>Author: #search.outputAuthor <a href="#search.outputPath" download>Download PDF</a></ul>
<ul>Number of Page(s): #search.outputPage</ul>
}
}
Java code works well. I can see outputs on the terminal, but my html side has problem and it shows one book many times according to size of searchList
I am posting the answer explicitly even though I was able to help the OP in the chat - maybe somebody else is running into such problem but did not check the chat:
The problem is even though you have the line in the for-loop you are still using the same searched variable. What you have to do is to reinitialize the variable when entering the loop. Something like:
for (...) {
searched = new Searching();
searched.outputTitle = (String)results.get(i).getFirstValue("title");
....
searchList.add(searched);
}
This solves the problem with the duplicates and everything is fine now.

How to list items of a collection with Word Ole Automation

I'm using the SWT OLE api to edit a Word document in an Eclipse RCP. I read articles about how to read properties from the active document but now I'm facing a problem with collections like sections.
I would like to retrieve only the body section of my document but I don't know what to do with my sections object which is an IDispatch object. I read that the item method should be used but I don't understand how.
I found the solution so I'll share it with you :)
Here is a sample code to list all paragraphs of the active document of the word editor :
OleAutomation active = activeDocument.getAutomation();
if(active!=null){
int[] paragraphsId = getId(active, "Paragraphs");
if(paragraphsId.length > 0) {
Variant vParagraphs = active.getProperty(paragraphsId[0]);
if(vParagraphs != null){
OleAutomation paragraphs = vParagraphs.getAutomation();
if(paragraphs!=null){
int[] countId = getId(paragraphs, "Count");
if(countId.length > 0) {
Variant count = paragraphs.getProperty(countId[0]);
if(count!=null){
int numberOfParagraphs = count.getInt();
for(int i = 1 ; i <= numberOfParagraphs ; i++) {
Variant paragraph = paragraphs.invoke(0, new Variant[]{new Variant(i)});
if(paragraph!=null){
System.out.println("paragraph " + i + " added to list!");
listOfParagraphs.add(paragraph);
}
}
return listOfParagraphs;
}
}
}
}
}

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code:
public class NewClass {
public String noTags(String str){
return Jsoup.parse(str).text();
}
public static void main(String args[]) {
String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
"<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> googlez</p></BODY> </HTML> ";
NewClass text = new NewClass();
System.out.println((text.noTags(strings)));
}
And I have the result:
hello world yo googlez
But I want to break the line:
hello world
yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a <br> in the markup I parse, how can I get a line break in my resulting output?
The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).
With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
We're using this method here:
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
By passing it Whitelist.none() we make sure that all HTML is removed.
By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.
On Jsoup v1.11.2, we can now use Element.wholeText().
String cleanString = Jsoup.parse(htmlString).wholeText();
user121196's answer still works. But wholeText() preserves the alignment of texts.
Try this by using jsoup:
public static String cleanPreserveLineBreaks(String bodyHtml) {
// get pretty printed html with preserved br and p tags
String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
// get plain text with preserved line breaks by disabled prettyPrint
return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:
Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);
(version 1.10.3)
You can traverse a given element
public String convertNodeToText(Element element)
{
final StringBuilder buffer = new StringBuilder();
new NodeTraversor(new NodeVisitor() {
boolean isNewline = true;
#Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim();
if(!text.isEmpty())
{
buffer.append(text);
isNewline = false;
}
} else if (node instanceof Element) {
Element element = (Element) node;
if (!isNewline)
{
if((element.isBlock() || element.tagName().equals("br")))
{
buffer.append("\n");
isNewline = true;
}
}
}
}
#Override
public void tail(Node node, int depth) {
}
}).traverse(element);
return buffer.toString();
}
And for your code
String result = convertNodeToText(JSoup.parse(html))
Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.
Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java
The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.
To avoid link rot, here is Jonathan Hedley's solution in full:
package org.jsoup.examples;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
import java.io.IOException;
/**
* HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
* plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
* scrape.
* <p>
* Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
* </p>
* <p>
* To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
* <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
* where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
*
* #author Jonathan Hedley, jonathan#hedley.net
*/
public class HtmlToPlainText {
private static final String userAgent = "Mozilla/5.0 (jsoup)";
private static final int timeout = 5 * 1000;
public static void main(String... args) throws IOException {
Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
final String url = args[0];
final String selector = args.length == 2 ? args[1] : null;
// fetch the specified URL and parse to a HTML DOM
Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();
HtmlToPlainText formatter = new HtmlToPlainText();
if (selector != null) {
Elements elements = doc.select(selector); // get each element that matches the CSS selector
for (Element element : elements) {
String plainText = formatter.getPlainText(element); // format that element to plain text
System.out.println(plainText);
}
} else { // format the whole doc
String plainText = formatter.getPlainText(doc);
System.out.println(plainText);
}
}
/**
* Format an Element to plain-text
* #param element the root element to format
* #return formatted text
*/
public String getPlainText(Element element) {
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
#Override
public String toString() {
return accum.toString();
}
}
}
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
works if the html itself doesn't contain "br2n"
So,
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();
works more reliable and easier.
Try this:
public String noTags(String str){
Document d = Jsoup.parse(str);
TextNode tn = new TextNode(d.body().html(), "");
return tn.getWholeText();
}
Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator.
Here's some scala code I use for this, java port should be easy:
val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
.asScala.mkString("<br />\n")
Try this by using jsoup:
doc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \n after that
doc.select("br").after("\\n");
//select all <p> tags and prepend \n before that
doc.select("p").before("\\n");
//get the HTML from the document, and retaining original new lines
String str = doc.html().replaceAll("\\\\n", "\n");
This is my version of translating html to text (the modified version of user121196 answer, actually).
This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).
It's originally written in Scala, but you can change it to Java easily
def html2text( rawHtml : String ) : String = {
val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
htmlDoc.select("br").append("\\nl")
htmlDoc.select("div").prepend("\\nl").append("\\nl")
htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")
org.jsoup.parser.Parser.unescapeEntities(
Jsoup.clean(
htmlDoc.html(),
"",
Whitelist.none(),
new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
),false
).
replaceAll("\\\\nl", "\n").
replaceAll("\r","").
replaceAll("\n\\s+\n","\n").
replaceAll("\n\n+","\n\n").
trim()
}
/**
* Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
* #param html
* #param linebreakerString
* #return the html as String with proper java newlines instead of br
*/
public static String replaceBrWithNewLine(String html, String linebreakerString){
String result = "";
if(html.contains(linebreakerString)){
result = replaceBrWithNewLine(html, linebreakerString+"1");
} else {
result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
result = result.replaceAll(linebreakerString, "\n");
}
return result;
}
Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder.
For example:
replaceBrWithNewLine(element.html(), "br2n")
The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.
Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:
org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

Categories