I have XHTML content, and I have to create from this content a PDF file on the fly. I use iText pdf converter.
I tried the simple way, but I always get bad result after calling the XMLWorkerHelper parser.
XHTML:
<ul>
<li>First
<ol>
<li>Second</li>
<li>Second</li>
</ol>
</li>
<li>First</li>
</ul>
The expected value:
First
Second
Second
First
PDF result:
First Second Second
First
In the result there is no nested list. I need a solution for calling the parser, and not creating an iText Document instance.
Please take a look at the example NestedListHtml
In this example, I take your code snippet list.html:
<ul>
<li>First
<ol>
<li>Second</li>
<li>Second</li>
</ol>
</li>
<li>First</li>
</ul>
And I parse it into an ElementList:
// CSS
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
// HTML
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.autoBookmark(false);
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline end = new ElementHandlerPipeline(elements, null);
HtmlPipeline html = new HtmlPipeline(htmlContext, end);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
Now I can add this list to the Document:
for (Element e : elements) {
document.add(e);
}
Or I can list this list to a Paragraph:
Paragraph para = new Paragraph();
for (Element e : elements) {
para.add(e);
}
document.add(para);
You will get the desired result as shown in nested_list.pdf
You can not add nested lists to a PdfPCell or to a ColumnText. For instance: this will not work:
PdfPTable table = new PdfPTable(2);
table.addCell("Nested lists don't work in a cell");
PdfPCell cell = new PdfPCell();
for (Element e : elements) {
cell.addElement(e);
}
table.addCell(cell);
document.add(table);
This is due to a limitation in the ColumnText class that has been there for many years. We have evaluated the problem and the only way to fix this, would be to rewrite ColumnText entirely. This is not an item on our current technical road map.
Here's a workaround for nested ordered and un-ordered lists.
The rich Text editor I am using giving the class attribute "ql-indent-1/2/2/" for li tags, based on the attribute adding ul/ol starting and ending tags.
public String replaceIndentSubList(String htmlContent) {
org.jsoup.nodes.Document document = Jsoup.parseBodyFragment(htmlContent);
Elements element_UL = document.select("ul");
Elements element_OL = document.select("ol");
if (!element_UL.isEmpty()) {
htmlContent = replaceIndents(htmlContent, element_UL, "ul");
}
if (!element_OL.isEmpty()) {
htmlContent = replaceIndents(htmlContent, element_OL, "ol");
}
return htmlContent;
}
public String replaceIndents(String htmlContent, Elements element, String tagType) {
String attributeKey = "class";
String startingULTgas = "<" + tagType + ">";
String endingULTags = "</" + tagType + ">";
int lengthOfQLIndenet = new String("ql-indent-").length();
HashMap<String, String> startingLiTagMap = new HashMap<String, String>();
HashMap<String, String> lastLiTagMap = new HashMap<String, String>();
Pattern regex = Pattern.compile("ql-indent-\\d");
HashSet<String> hash_Set = new HashSet<String>();
Elements element_Tag = element.select("li");
for (org.jsoup.nodes.Element element2 : element_Tag) {
org.jsoup.nodes.Attributes att = element2.attributes();
if (att.hasKey(attributeKey)) {
String attributeValue = att.get(attributeKey);
Matcher matcher = regex.matcher(attributeValue);
if (matcher.find()) {
if (!startingLiTagMap.containsKey(attributeValue)) {
startingLiTagMap.put(attributeValue, element2.toString());
}
hash_Set.add(matcher.group(0));
if (!startingLiTagMap.get(attributeValue)
.equalsIgnoreCase(element2.toString())) {
lastLiTagMap.put(attributeValue, element2.toString());
}
}
}
}
System.out.println(htmlContent);
Iterator value = hash_Set.iterator();
while (value.hasNext()) {
String liAttributeKey = (String) value.next();
int noOfIndentes = Integer
.parseInt(liAttributeKey.substring(lengthOfQLIndenet));
if (noOfIndentes > 1)
for (int i = 1; i < noOfIndentes; i++) {
startingULTgas = startingULTgas + "<" + tagType + ">";
endingULTags = endingULTags + "</" + tagType + ">";
}
htmlContent = htmlContent.replace(startingLiTagMap.get(liAttributeKey),
startingULTgas + startingLiTagMap.get(liAttributeKey));
if (lastLiTagMap.get(liAttributeKey) != null) {
System.out.println("Inside last Li Map");
htmlContent = htmlContent.replace(lastLiTagMap.get(liAttributeKey),
lastLiTagMap.get(liAttributeKey) + endingULTags);
}
else {
htmlContent = htmlContent.replace(startingLiTagMap.get(liAttributeKey),
startingLiTagMap.get(liAttributeKey) + endingULTags);
}
startingULTgas = "<" + tagType + ">";
endingULTags = "</" + tagType + ">";
}
System.out.println(htmlContent);[enter image description here][1]
return htmlContent;
}
Related
I am trying to scrap links in pagination of GitHub repositories
I have scraped them separately but what Now I want is to optimize it using some loop. Any idea how can i do it? here is code
ComitUrl= "http://github.com/apple/turicreate/commits/master";
Document document2 = Jsoup.connect(ComitUrl ).get();
Element pagination = document2.select("div.pagination a").get(0);
String Url1 = pagination.attr("href");
System.out.println("pagination-link1 = " + Url1);
Document document3 = Jsoup.connect(Url1).get();
Element pagination2 = document3.select("div.pagination a").get(1);
String Url2 = pagination2.attr("href");
System.out.println("pagination-link2 = " + Url2);
Document document4 = Jsoup.connect(Url2).get();
Element check = document4.select("span.disabled").first();
if (check.text().equals("Older")) {
System.out.println("No pagination link more");
}
else { Element pagination3 = document4.select("div.pagination a").get(1);
String Url3 = pagination3.attr("href");
System.out.println("pagination-link3 = " + Url3);
}
Try something like given below:
public static void main(String[] args) throws IOException{
String url = "http://github.com/apple/turicreate/commits/master";
//get first link
String link = Jsoup.connect(url).get().select("div.pagination a").get(0).attr("href");
//an int just to count up links
int i = 1;
System.out.println("pagination-link_"+ i + "\t" + link);
//parse next page using link
//check if the div on next page has more than one link in it
while(Jsoup.connect(link).get().select("div.pagination a").size() >1){
link = Jsoup.connect(link).get().select("div.pagination a").get(1).attr("href");
System.out.println("pagination-link_"+ (++i) +"\t" + link);
}
}
I'm generating a .docx file with a .html file using docx4j.
The html file is first converted in xhtml with jTidy.
This file is the body of my document.
I'm doing the same thing with the header, I have a file for this too.
I can generate a header and add it to my document, but only as plain text, not as html.
Here is my code for the header :
//Header Part start
HeaderPart headerPart = new HeaderPart();
Relationship rel = wordMLPackage.getMainDocumentPart().addTargetPart(headerPart);
String hdrXml = "<w:hdr xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">"
+ "<w:p>"
+ "<w:pPr>"
//+ "<w:pStyle w:val=\"Header\"/>"
+ "<w:jc w:val=\"center\"/>"
+ "</w:pPr>"
+ "<w:r>"
+ "<w:t xml:space=\"preserve\">" + myFileContentInString + "</w:t>"
+ "</w:r>"
// + "<w:fldSimple w:instr=\" PAGE \\* MERGEFORMAT \">"
// + "<w:r>"
// + "<w:rPr>"
// + "<w:noProof/>"
// + "</w:rPr>"
// + "</w:r>"
// + "</w:fldSimple>"
+ "</w:p>"
+ "</w:hdr>";
Hdr hdr = (Hdr) XmlUtils.unmarshalString(hdrXml);
wordMLPackage.getDocumentModel().getSections().get(0).getHeaderFooterPolicy().getFirstHeader();
headerPart.setJaxbElement(hdr);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
SectPr sectPr = sections.get(sections.size() - 1).getSectPr();
// There is always a section wrapper, but it might not contain a sectPr
if (sectPr == null) {
sectPr = objectFactory.createSectPr();
wordMLPackage.getMainDocumentPart().addObject(sectPr);
sections.get(sections.size() - 1).setSectPr(sectPr);
}
HeaderReference headerReference = objectFactory.createHeaderReference();
headerReference.setId(rel.getId());
headerReference.setType(HdrFtrRef.DEFAULT);
sectPr.getEGHdrFtrReferences().add(headerReference);
//Header Part End
The xhtml content is in "myFileContentInString".
I couldn't be able to find something about this, so if somebody has any idea ?
EDIT : After your answer, I updated my code this way (here is the full code):
String inputfilepath = "Offers/" + param.getKey1() + "_" + param.getKey2() + "/c.xhtml";
String inputfilepath2 = "Offers/" + param.getKey1() + "_" + param.getKey2() + "/cc.xhtml";
// Create an empty docx package
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
ObjectFactory objectFactory = Context.getWmlObjectFactory();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.unmarshalDefaultNumbering();
XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
xHTMLImporter.setHyperlinkStyle("Hyperlink");
wordMLPackage.getMainDocumentPart().getContent().addAll(xHTMLImporter.convert(new File(inputfilepath), null));
//Header Part start
HeaderPart headerPart = new HeaderPart();
Relationship rel = wordMLPackage.getMainDocumentPart().addTargetPart(headerPart);
Hdr hdr = Context.getWmlObjectFactory().createHdr();
hdr.getContent().addAll(xHTMLImporter.convert(new File(inputfilepath2), null));
wordMLPackage.getDocumentModel().getSections().get(0).getHeaderFooterPolicy().getFirstHeader();
headerPart.setJaxbElement(hdr);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
SectPr sectPr = sections.get(sections.size() - 1).getSectPr();
// There is always a section wrapper, but it might not contain a sectPr
if (sectPr == null) {
sectPr = objectFactory.createSectPr();
wordMLPackage.getMainDocumentPart().addObject(sectPr);
sections.get(sections.size() - 1).setSectPr(sectPr);
}
HeaderReference headerReference = objectFactory.createHeaderReference();
headerReference.setId(rel.getId());
headerReference.setType(HdrFtrRef.DEFAULT);
sectPr.getEGHdrFtrReferences().add(headerReference);
// Saving file
wordMLPackage.save(new java.io.File("Offers/" + param.getKey1() + "_" + param.getKey2() + "/html_output.docx"));
"inputfilepath2" contains the path fo my header xhtml file.
I tried to insert a simple Hello World in my header but it seems it is taking the "inputfilepath" for the body and the header.
Assuming you have a reference to the correct headerPart, something like:
Hdr hdr = Context.getWmlObjectFactory().createHdr();
hdr.getContent().addAll(
XHTMLImporter.convert( /* fill this in */ );
headerPart.setJaxbElement(hdr);
_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}
I'm trying to do a test program deal with Filenet Documents. I know how to get a document Instance with Object Id or path like ,
doc = Factory.Document.fetchInstance(os,ID,null);
doc = Factory.Document.fetchInstance(os,path,null);
but I like to add more finding options so I can fetch,
Document with name or a custom property . I am trying out this search as a approach to that:
String mySQLString = "SELECT * FROM DEMO WHERE DocumentTitle LIKE '"+prp+"'";
SearchSQL sqlObject = new SearchSQL();
sqlObject.setQueryString(mySQLString);
// System.out.println(mySQLString);
SearchScope searchScope = new SearchScope(os);
RepositoryRowSet rowSet = searchScope.fetchRows(sqlObject, null, null, new Boolean(true));
Iterator ppg = rowSet.iterator();
if (ppg.hasNext()) {
RepositoryRow rr = (RepositoryRow) ppg.next();
System.err.println(rr.getProperties());
Properties properties = rr.getProperties();
String ID = properties.getStringValue("ID");
System.out.println(ID);
doc = Factory.Document.fetchInstance(os,ID,null);
But ID is not a Document property , it's a System property. How can I get the document? How can I get the path or id with a Search and fetch this document ? Is there a fast way ?
After few little changes I have made it work. Following is the code
String mySQLString = "SELECT ID FROM Document WHERE DocumentTitle LIKE '"+prp+"'";
SearchSQL sqlObject = new SearchSQL();
sqlObject.setQueryString(mySQLString);
SearchScope searchScope = new SearchScope(os);
RepositoryRowSet rowSet = searchScope.fetchRows(sqlObject, null, null, new Boolean(true));
Iterator ppg = rowSet.iterator();
if (ppg.hasNext()) {
RepositoryRow rr = (RepositoryRow) ppg.next();
System.err.println(rr.getProperties());
Properties properties = rr.getProperties();
String ID = properties.getIdValue("ID").toString();
System.out.println(ID);
doc = Factory.Document.fetchInstance(os,ID,null);
}
now with few changes this can be use to get any document with property.
You can iterate through all the documents. Take a look at this:
public void iterateThruAllDocs(ObjectStore os, String foldername) {
Folder folder = Factory.Folder.fetchInstance(os, foldername, null);
Document doc = Factory.Document.getInstance(os, null, foldername);
DocumentSet docset = folder.get_ContainedDocuments();
Iterator<DocumentSet> docitr = docset.iterator();
while (docitr.hasNext()) {
doc = (Document) docitr.next();
String mydocid = doc.get_Id().toString();
Document mydoc = Factory.Document.fetchInstance(os, mydocid, null);
AccessPermissionList apl = mydoc.get_Permissions();
Iterator ite = apl.iterator();
while (ite.hasNext()) {
Properties documentProperties = doc.getProperties();
String DateCreated = documentProperties.getDateTimeValue("DateCreated").toString();
String DateLastModified = documentProperties.getDateTimeValue("DateLastModified").toString();
Permission permission = (Permission) ite.next();
String docTitle = doc.getProperties().getStringValue("DocumentTitle");
System.out.println("Document Title for document with DocumentID \"" + mydoc.get_Id() + "\" is \"" + docTitle + "\"");
//String someprop = documentProperties.getStringValue("someprop");
System.out.println("Document was Created on :: " + DateCreated);
System.out.println("Document was Last Modified on :: " + DateLastModified);
System.out.println("Grantee Name :: " + permission.get_GranteeName());
//if (permission.get_GranteeName().equals(groupname/username)) {
//permission.set_GranteeName("groupname/username");
// System.out.println("Security REMOVED for document....");
//mydoc.save(RefreshMode.REFRESH);
// Go Nuts on Documents Here......
break;
}
}
}
I am using docx4j to deal with word document formatting. I have one word document which is divided in number of tables. I want to read all the tables and if I find some keywords then I want to take those contents to another word document with all the formatting. My word document is as follow.
Like from above I want to take content which is below Some Title. Here my keyword is Sample Text. So whenever Sample Text gets repeated, content needs to be fetched to new word document.
I am using following code.
MainDocumentPart mainDocumentPart = null;
WordprocessingMLPackage docxFile = WordprocessingMLPackage.load(new File(fileName));
mainDocumentPart = docxFile.getMainDocumentPart();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
ClassFinder finder = new ClassFinder(Tbl.class);
new TraversalUtil(mainDocumentPart.getContent(), finder);
Tbl tbl = null;
int noTbls = 0;
int noRows = 0;
int noCells = 0;
int noParas = 0;
int noTexts = 0;
for (Object table : finder.results) {
noTbls++;
tbl = (Tbl) table;
// Get all the Rows in the table
List<Object> allRows = DocxUtility.getDocxUtility()
.getAllElementFromObject(tbl, Tr.class);
for (Object row : allRows) {
Tr tr = (Tr) row;
noRows++;
// Get all the Cells in the Row
List<Object> allCells = DocxUtility.getDocxUtility()
.getAllElementFromObject(tr, Tc.class);
toCell:
for (Object cell : allCells) {
Tc tc = (Tc) cell;
noCells++;
// Get all the Paragraph's in the Cell
List<Object> allParas = DocxUtility.getDocxUtility()
.getAllElementFromObject(tc, P.class);
for (Object para : allParas) {
P p = (P) para;
noParas++;
// Get all the Run's in the Paragraph
List<Object> allRuns = DocxUtility.getDocxUtility()
.getAllElementFromObject(p, R.class);
for (Object run : allRuns) {
R r = (R) run;
// Get the Text in the Run
List<Object> allText = DocxUtility.getDocxUtility()
.getAllElementFromObject(r, Text.class);
for (Object text : allText) {
noTexts++;
Text txt = (Text) text;
}
System.out.println("No of Text in Para No: " + noParas + "are: " + noTexts);
}
}
System.out.println("No of Paras in Cell No: " + noCells + "are: " + noParas);
}
System.out.println("No of Cells in Row No: " + noRows + "are: " + noCells);
}
System.out.println("No of Rows in Table No: " + noTbls + "are: " + noRows);
}
System.out.println("Total no of Tables: " + noTbls );
Assuming your text is in a single run (ie not split across runs), then you can search for it via XPath. Or you can manually traverse using TraversalUtil. See docx4j's Getting Started for more info.
So finding your stuff is pretty easy. Copying the formatting it uses, and any rels in it, is in the general case, complicated. See my post http://www.docx4java.org/blog/2010/11/merging-word-documents/ for more on the issues involved.