Extract the first page content from docx file by XML parsing

Extract the first page content from docx file by XML parsing - java

I need to extract the first page content from the docx file and save it as a seperate document. I need everything from the first page( images, tables, text) to be saved as it is in new docx file.
What i tried is :
I looked into the xml of the unzipped docx file. Since word document is reflowable i couldnt find a page break after each page ends. So i couldnt find the end of each page via the document.xml
Is there any way to get the XML content of the first page of the document alone using java XML DOM parser ?

Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?).
Use Apache POI for example, as #JFB suggested.

Related

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Restoring the deleted tag elements from a HTML file using jsoup

I am reading a HTML file from a folder and delete some unwanted html tags From the HTML file and I should save the modified HTML file.
I have done all the above things using jsoup parsing library. But the problem is if in future if I want to exclude some of the tags from the unwanted list of tags, how should I do that? Because once I deleted the unwanted tags the modified HTML will not have the unwanted content.

set the original file as filename:
full_featured_template.html
then parse it with jsoup and save it as
template_version_1.html
then in future:
parse the original again ans save it as
template_version_2.html

Merging HTML, RTF to Docx using Docx4J

I'm new to Docx4j and I need some advice.
Currently I'm creating a simple (X)HTML document with Java. It contains some information from a database. After creating this html, Docx4j creates a Word-Docx file by using a very simple word template. This works fine.
Now I have to enhance this HTML. One database value contains a byte array which holds an RTF file.
Currently I'm putting this data into HTML as a string.
String content = new String(allbytes,"UTF-8");
html+=content;
At least the html files looks like this:
<html>
....
<td>
{\rtf1\ansi\deflang1033\ftnbj\uc1\deff1.....
</td>
...
</html>
Docx4J now creates a Word-Docx which shows this RTF as a string and not as an imported RTF file.
Off course not, but I wish to see it as imported RTF.
How can I archive this?
Is there a simple way to do this?

Converting rtf to docx content is outside the scope of docx4j.
You'll need to look for a third party solution which does rtf to docx, or failing that, rtf to (x)html (see Convert Rtf to HTML)
You could try http://sourceforge.net/projects/rtf2xml/ and then transform the XML to WordML.
Another possibility may be LibreOffice via JODConverter.

How to parse JSP Pages into a XML file?

I am trying to convert a JSP page document into a XML file.I have been using jsoup and very well reading the whole content except server tags, but I can't understand how can the whole HTML be converted to XML tags. I mean how can I fetch data line by line?
My Code:
File Html=new File("genXML.jsp");
Document doc=Jsoup.parse(Html,"UTF-8","http://www.example.com");
System.out.println(doc.html());
Any assistance would be great

First of all, it is not the same to convert JSP to XML with converting HTML to XML. I suppose you want to translate the HTML generated from a JSP to XML. Second of all, you don't want to do this line by line. An HTML block usually does not begin and ends in a line.
Anyway, you could use a tool like tagsoup to convert HTML code to XHTML. XHTML is actually XML. Tagsoup can be called to make the translation. I don't know if it has a usefule API, but at least it could be called from your code as an external process using something like this:
Process tr = Runtime.getRuntime().exec(new String[]{ "..." } );
Then if you want to transform it to a target XML schema, you could apply an XSLT transformation using a tool like ones found online (check this and this). You could apply the XSLT transformation programmatically using JAXP.
Hope I helped!

docx4j convert docx in wrong html format

I have some problems with docx4j samples. I need to convert a file from docx in html format and back. I'm try to compile ConvertInXHTMLDocument.java sample. Html file it creates fine, but when trying to convert it back into docx, throws an exception that is missing close tags (META, img etc). Has anyone encountered this problem?

XHTMLImporter requires its input to be well-formed XML. So you need to ensure you don't have missing close tags (META, img etc); if you do, run JTidy or similar first.
docx4j's (X)HTML output can either be HTML or XML. From 3.0, the property Convert.Out.HTML.OutputMethodXML will control which.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract the first page content from docx file by XML parsing - java

Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?). Use Apache POI for example, as #JFB suggested.

Related

How to write data to pdf file which contains html tags using itext lib in Java

Restoring the deleted tag elements from a HTML file using jsoup

Merging HTML, RTF to Docx using Docx4J

How to parse JSP Pages into a XML file?

docx4j convert docx in wrong html format

Categories

Resources