How do you convert an RTF string to plain text in Java? The obvious answer is to use Swing's RTFEditorKit, and that seems to be the common answer around the Internet. However the write method that claims to return plain text isn't actually implemented... it's hard-coded to just throw an IOException in Java6.
I use Swing's RTFEditorKit in Java 6 like this:
RTFEditorKit rtfParser = new RTFEditorKit();
Document document = rtfParser.createDefaultDocument();
rtfParser.read(new ByteArrayInputStream(rtfBytes), document, 0);
String text = document.getText(0, document.getLength());
and thats working.
Try Apache Tika: http://tika.apache.org/0.9/formats.html#Rich_Text_Format
You might consider RTF Parser Kit as a lightweight alternative to the Swing RTFEditorKit. The line below shows plain text extraction from an RTF file. The RTF file is read from the input stream, the extracted text is written to the output stream.
new StreamTextConverter().convert(new RtfStreamSource(inputStream), outputStream, "UTF-8");
(full disclosure: I'm the author of RTF Parser Kit)
Here is the full code to parse & write RTF as a plain text
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import javax.swing.text.BadLocationException;
import javax.swing.text.Document;
import javax.swing.text.rtf.RTFEditorKit;
public class rtfToJson {
public static void main(String[] args)throws IOException, BadLocationException {
// TODO Auto-generated method stub
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
FileInputStream fis = new FileInputStream("C:\\SampleINCData.rtf");
InputStreamReader i =new InputStreamReader(fis,"UTF-8");
rtf.read(i,doc,0);
// System.out.println(doc.getText(0,doc.getLength()));
String doc1 = doc.getText(0,doc.getLength());
try{
FileWriter fw=new FileWriter("B:\\Sample INC Data.txt");
fw.write(doc1);
fw.close();
}catch(Exception e)
{
System.out.println(e);
}
System.out.println("Success...");
}
}
Related
I'm having problems with Apache POI and File Mime Type.
I Use a file template (Microsoft Word DOCX) to modify some values with Apache Poi.
The original file has the mime type "application/vnd.openxmlformats-officedocument.wordprocessingml.document" (in linux: file -i {filename}), but and I process the file with POI and save then again I got "application/octet-stream" and I wish to Keep the File with the original mime type.
I open the file with HEX editor, both files original and modified and both has the same "magical numbers" (50 4B 03 04), but the file size is different, even when the texts are the same.
So It's possible to fix it? Anyone have the same problem? I check it in LibreOffice and appears to have same behavior of Apache POI.
Any help, any information will help.
As you already stated in a comment, the kind how Apache POI rearranges the Office Open XML ZIP package leads to misinterpreting the content type by some tools. An Office Open XML file (*.docx, *.xlsx, *.pptx) is a ZIP archive but somewhat how Microsoft Office is packing that archive must be special. I have not found what exactly it is though.
Example:
Start having a Document.docx having some simple content, which was saved by Microsoft Word.
For this, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i Document.docx
Document.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
Now run that code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordReadAndReWrite {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
XWPFDocument doc = new XWPFDocument(new FileInputStream(inFilePath));
doc.createParagraph().createRun().setText("new text inserted");
FileOutputStream out = new FileOutputStream(outFilePath);
doc.write(out);
out.close();
doc.close();
}
}
For the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/octet-stream; charset=binary
But if we are doing the same without using Apache POI's ZipPackage but instead using FileSystem for getting the XML out of the Office Open XML ZIP package using following code:
import java.nio.file.Files;
import java.nio.file.FileSystems;
import java.nio.file.FileSystem;
import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.nio.file.StandardOpenOption;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
public class WordReadAndReWriteFileSystem {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
FileSystem fileSystem = FileSystems.newFileSystem(Paths.get(inFilePath), null);
Path wordDocumentXml = fileSystem.getPath("/word/document.xml");
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDocument = documentBuilder.parse(Files.newInputStream(wordDocumentXml, StandardOpenOption.READ));
Node p = xmlDocument.createElement("w:p");
Node r = xmlDocument.createElement("w:r");
p.appendChild(r);
Node t = xmlDocument.createElement("w:t");
r.appendChild(t);
Node text = xmlDocument.createTextNode("new text inserted");
t.appendChild(text);
Node body = xmlDocument.getElementsByTagName("w:body").item(0);
Node sectPr = xmlDocument.getElementsByTagName("w:sectPr").item(0);
body.insertBefore(p, sectPr);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource domSource = new DOMSource(xmlDocument);
Path tmpDoc = Files.createTempFile("wordDocument", "tmp");
tmpDoc.toFile().deleteOnExit();
StreamResult streamResult = new StreamResult(Files.newOutputStream(tmpDoc, StandardOpenOption.WRITE));
transformer.transform(domSource, streamResult);
fileSystem.close();
Path tmpZip = Files.createTempFile("zipDocument", "tmp");
tmpZip.toFile().deleteOnExit();
Path path = Files.copy(Paths.get(inFilePath), tmpZip, StandardCopyOption.REPLACE_EXISTING);
fileSystem = FileSystems.newFileSystem(path, null);
wordDocumentXml = fileSystem.getPath("/word/document.xml");
Files.copy(tmpDoc, wordDocumentXml, StandardCopyOption.REPLACE_EXISTING);
fileSystem.close();
Files.copy(tmpZip, Paths.get(outFilePath), StandardCopyOption.REPLACE_EXISTING);
}
}
Then for the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
This Code shows the correct mime type of file for all files that I test:
public static void main(String[] args) {
String fileName = "model_libreoffice.docx";
// String fileName = "model_poi.docx";
// String fileName = "model_msoffice.docx";
// String fileName = "model_repacked_bz2.docx";
try {
InputStream is = Main.class.getResourceAsStream("/" + fileName);
Tika t = new Tika();
String mime = t.detect(is, fileName);
System.out.println("----> " + mime);
} catch (IOException e) {
e.printStackTrace();
}
}
After long debugging and testing I think that's problem with Third party validation of files.
This simple code shows me the correct mime type for all files that I try, modified by MicrosoftOffice, LibreOffice, Apache Poi, Unzip and Zipping again (renamed to DOCX) the content files of DOCX...
So I think that's problem can be mark as "solved" at all.
I allready saw other questions about the same problem but i still get an error. Hier is the small part of code where i try to modify exosting xml files. But it modifies some characters in text.
import org.jdom2.Document;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import java.io.FileOutputStream;
import java.io.IOException;
public class ModyfyXml {
public static void main(String[] args) throws JDOMException, IOException {
try {
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build("F:\\c\\test.xml");
XMLOutputter xmlOutput = new XMLOutputter();
Format format = Format.getPrettyFormat();
format.setEncoding("UTF-8");
xmlOutput.setFormat(format);
xmlOutput.output(doc, (new FileOutputStream("F:\\c\\test2.xml")));
}catch (IOException io) {
io.printStackTrace();
} catch (JDOMException e) {
e.printStackTrace();
}
}}
Hier a small xml file that i try to modify (in this case just copy)
<?xml version="1.0" encoding="utf-8"?><page>
䕶法喇嘛所居此處𡸁仲無妻室亦降神附體
</page>
After program start i get the following:
<?xml version="1.0" encoding="UTF-8"?>
<page>䕶法喇嘛所居此處𡸁仲無妻室亦降神附體</page>
Some chineese characters can't be right transformed
Dang I never noticed this bug in JDOM 2.
You will have the same results with any non-BMP character. You can try with the emoji mania of these latest years and see you get the same results.
It happens because of the escape strategy automatically set for UTF-whatever encodings. What it does is rather wrong.
That will be fixed if you replace the strategy with one that doesn't escape anything beside XML reserved chars:
format.setEscapeStrategy((c) -> false);
I have an rtf file. It has lots of tables in it. I have been trying to use java (POI and tika) to extract the tables. This is easy enough in a .doc where the tables are defined as such. However in a rtf file there doesn't seem to be any 'this is a table' tag as part of the meta data. Does anyone know what the best strategy is for extracting a table from such a file? Would converting it to another file format help. Any clues for me to look up?
There is a linux tool called unrtf, look at manual
With the app you can transform your rtf file into html:
unrtf --html your_input_file.rtf > your_output_file.html
Now you can use any programming api for manipulation of html/xml and extract tables easily. Is it enough you need?
Thanks hexin for your answer. In the end I was able to use Tika by using the TXTParser and then putting all the segments between bold tags(which is how my tables are separated) into an arraylist. I had to use the tab seperators to define tables from there.
Here is the code without the bit to extract the tables based on tabs (still working on it):
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.parser.rtf.RTFParser;
import org.apache.tika.parser.txt.TXTParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TextParser {
public static void main(final String[] args) throws IOException,TikaException{
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("/Users/mydoc.rtf"));
ParseContext pcontext = new ParseContext();
//Text document parser
TXTParser TXTParser = new TXTParser();
try {
TXTParser.parse(inputstream, handler, metadata,pcontext);
} catch (SAXException e) {
e.printStackTrace();
}
String s=handler.toString();
Pattern pattern = Pattern.compile("(\\\\b\\\\f1\\\\fs24.+?\\\\par .+?)\\\\b\\\\f1\\\\fs24.*?\\{\\\\",Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
ArrayList<String> arr= new ArrayList<String>();
while (matcher.find()) {
arr.add(matcher.group(1));
}
for(String name : arr){
System.out.println("The array number is: "+arr.indexOf(name)+" \n\n "+name);
}
}
}
I have an XML file which have a node called "CONTENIDO", in this node I have a PDF file encoded in base64 string.
I'm trying to read this node, decode the string in base64 and download the PDF file to my computer.
The problem is that the file is downloaded with the same size (in kb) as the original PDF and has the same number of pages, but... all the pages are in blank without any content and when I open the downloaded file a popup appears with an error saying "unknown distinctive 806.6n". I don't know what that means.
I've tried to find a solution in the internet, with diferents ways to decode the string, but always get the same result... The XML is Ok I've checked the base64 string and is Ok.
I've also debugged the code and I've seen that the content of the var "fichero" where I'm reading the base64 string is also Ok, so I don't know what can be the problem.
This is my code:
package prueba.sap.com;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import sun.misc.BASE64Decoder;
import javax.xml.bind.DatatypeConverter;
public class anexoPO {
public static void main(String[] args) throws Exception {
FileInputStream inFile =
new FileInputStream("C:/prueba/prueba_attach_b64.xml");
FileOutputStream outFile =
new FileOutputStream("C:/prueba/salida.pdf");
anexoPO myMapping = new anexoPO();
myMapping.execute(inFile, outFile);
System.out.println("Success");
System.out.println(inFile);
}
public void execute(InputStream in, OutputStream out)
throws com.sap.aii.mapping.api.StreamTransformationException {
try {
//************************Code To Generate The XML Parsing Objects*****************************//
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
Document docout = db.newDocument();
NodeList CONTENIDO = doc.getElementsByTagName("CONTENIDO");
String fichero = CONTENIDO.item(0).getChildNodes().item(0).getNodeValue();
//************** decode *************/
//import sun.misc.BASE64Decoder;
//BASE64Decoder decoder = new BASE64Decoder();
//byte[] decoded = decoder.decodeBuffer(fichero);
//import org.apache.commons.codec.binary.*;
//byte[] decoded = Base64.decode(fichero);
//import javax.xml.bind.DatatypeConverter;
byte[] decoded = DatatypeConverter.parseBase64Binary(fichero);
//************** decode *************/
String str = new String(decoded);
out.write(str.getBytes());
} catch (Exception e) {
System.out.print("Problem parsing the file");
e.printStackTrace();
}
}
}
Thanks in advance.
Definitely:
out.write(decoded);
out.close();
Strings cannot represent all bytes, and PDF is binary.
Also remove the import of sun.misc.BASE64Decoder, as this package does not exist everywhere. It might be removed by the compiler, however I would not bet on it.
The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}