How to extract rtf tables - java

I have an rtf file. It has lots of tables in it. I have been trying to use java (POI and tika) to extract the tables. This is easy enough in a .doc where the tables are defined as such. However in a rtf file there doesn't seem to be any 'this is a table' tag as part of the meta data. Does anyone know what the best strategy is for extracting a table from such a file? Would converting it to another file format help. Any clues for me to look up?

There is a linux tool called unrtf, look at manual
With the app you can transform your rtf file into html:
unrtf --html your_input_file.rtf > your_output_file.html
Now you can use any programming api for manipulation of html/xml and extract tables easily. Is it enough you need?

Thanks hexin for your answer. In the end I was able to use Tika by using the TXTParser and then putting all the segments between bold tags(which is how my tables are separated) into an arraylist. I had to use the tab seperators to define tables from there.
Here is the code without the bit to extract the tables based on tabs (still working on it):
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.parser.rtf.RTFParser;
import org.apache.tika.parser.txt.TXTParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TextParser {
public static void main(final String[] args) throws IOException,TikaException{
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("/Users/mydoc.rtf"));
ParseContext pcontext = new ParseContext();
//Text document parser
TXTParser TXTParser = new TXTParser();
try {
TXTParser.parse(inputstream, handler, metadata,pcontext);
} catch (SAXException e) {
e.printStackTrace();
}
String s=handler.toString();
Pattern pattern = Pattern.compile("(\\\\b\\\\f1\\\\fs24.+?\\\\par .+?)\\\\b\\\\f1\\\\fs24.*?\\{\\\\",Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
ArrayList<String> arr= new ArrayList<String>();
while (matcher.find()) {
arr.add(matcher.group(1));
}
for(String name : arr){
System.out.println("The array number is: "+arr.indexOf(name)+" \n\n "+name);
}
}
}

Related

Avro Schema for GenericRecord: Be able to leave blank fields

I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow.
The Avro schema is created on runtime using SchemaBuilder.
One of the fields I define in the schema is an optional LONG field, it is defined like this:
SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields();
Schema concreteType = SchemaBuilder.nullable().longType();
fields.name("key1").type(concreteType).noDefault();
Now when I create a GenericRecord using the schema above, and "key1" is not set, when putting the resulted GenericRecord to the context of my DoFn: context.output(res); I get the following error:
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.avro.UnresolvedUnionException: Not in union ["long","null"]: 256
I also tried doing the same thing with withDefault(0L) and got the same result.
What do I miss?
Thanks
It works fine for me when trying as below and you can try to print the schema that will help to compare also you can remove the nullable() for long type to try.
fields.name("key1").type().nullable().longType().longDefault(0);
Provided the complete code that I used to test:
import org.apache.avro.AvroRuntimeException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.SchemaBuilder.FieldAssembler;
import org.apache.avro.SchemaBuilder.RecordBuilder;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
import java.io.IOException;
public class GenericRecordExample {
public static void main(String[] args) {
FieldAssembler<Schema> fields;
RecordBuilder<Schema> record = SchemaBuilder.record("Customer");
fields = record.namespace("com.example").fields();
fields = fields.name("first_name").type().nullable().stringType().noDefault();
fields = fields.name("last_name").type().nullable().stringType().noDefault();
fields = fields.name("account_number").type().nullable().longType().longDefault(0);
Schema schema = fields.endRecord();
System.out.println(schema.toString());
// we build our first customer
GenericRecordBuilder customerBuilder = new GenericRecordBuilder(schema);
customerBuilder.set("first_name", "John");
customerBuilder.set("last_name", "Doe");
customerBuilder.set("account_number", 999333444111L);
Record myCustomer = customerBuilder.build();
System.out.println(myCustomer);
// writing to a file
final DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
dataFileWriter.create(myCustomer.getSchema(), new File("customer-generic.avro"));
dataFileWriter.append(myCustomer);
System.out.println("Written customer-generic.avro");
} catch (IOException e) {
System.out.println("Couldn't write file");
e.printStackTrace();
}
// reading from a file
final File file = new File("customer-generic.avro");
final DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
GenericRecord customerRead;
try (DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, datumReader)){
customerRead = dataFileReader.next();
System.out.println("Successfully read avro file");
System.out.println(customerRead.toString());
// get the data from the generic record
System.out.println("First name: " + customerRead.get("first_name"));
// read a non existent field
System.out.println("Non existent field: " + customerRead.get("not_here"));
}
catch(IOException e) {
e.printStackTrace();
}
}
}
If I understand your question correctly, you're trying to accept JSON strings and save them in a Cloud Storage bucket, using Avro as your coder for the data as it moves through Dataflow. There's nothing immediately obvious from your code that looks wrong to me. I have done this, including saving the data to Cloud Storage and to BigQuery.
You might consider using a simpler, and probably less error prone approach: Define a Java class for your data and use Avro annotations on it to enable the coder to work properly. Here's an example:
import org.apache.avro.reflect.Nullable;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
#DefaultCoder(AvroCoder.class)
public class Data {
public long nonNullableValue;
#Nullable public long nullableValue;
}
Then, use this type in your DnFn implementations like you likely already are. Beam should be able to move the data between workers properly using Avro, even when the fields marked #Nullable are null.

How do you add attachments from a generic filetype using Apose.Slides using java?

How do you add attachments from a generic filetype using Apose.Slides using java?
The manual PowerPoint operation I’m trying to do programmatically is:
Insert -> Object -> From file
Is this possible with Aspose.Slides insert an Excel file as a link using java?
The below code is working fine for attaching the Excel file using aspose slides
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import com.aspose.slides.IOleEmbeddedDataInfo;
import com.aspose.slides.IOleObjectFrame;
import com.aspose.slides.OleEmbeddedDataInfo;
import com.aspose.slides.Presentation;
import com.aspose.slides.SaveFormat;
public class SetFileTypeForAnEmbeddingObject2 {
public static void main(String[] args) throws IOException {
Presentation pres = new Presentation();
try {
// Add known Ole objects
byte[] fileBytes = Files.readAllBytes(Paths.get("C:\\work\\Demo uploadt.xlsm"));
// Create Ole embedded file info
IOleEmbeddedDataInfo dataInfo = new OleEmbeddedDataInfo(fileBytes, "xls");
// Create OLE object
IOleObjectFrame oleFrame = pres.getSlides().get_Item(0).getShapes().addOleObjectFrame(150, 420, 250, 50,
dataInfo);
oleFrame.setObjectIcon(true);
pres.save("C:\\work\\" + "SetFileTypeForAnEmbeddingObject7.pptx", SaveFormat.Pptx);
} finally {
if (pres != null)
pres.dispose();
}
}
}

Is it possible to fetch specific data from excel table (from a column or a row) using apache tika in java?

There is a simple way of extracting data which I think only fetch it as a text ( using toSting() method), but I want to fetch data according to specified column or row’s name.
Following is a sample code which simply print the content of the MS Excel along with its metadata (avoid this part).
It uses tika-app-1.13.jar (if you want to run this code add this library)
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class MSExcelParse {
public static void main(final String[] args)
throws IOException, TikaException, SAXException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new
File("C:\\Users\\Username\\IdeaProjects\\Tika\\src\\example.xlsx"));
ParseContext pcontext = new ParseContext();
//OOXml parser
OOXMLParser msofficeparser = new OOXMLParser();
msofficeparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
example.xlsx contains the data (kindly go through the link to see the data)
What I want to ask if I want to extract / fetch only the data from let say, the column of "age" using Apache Tika in Java, is there any way of doing it?
I believe this will answer all you questions http://poi.apache.org/spreadsheet/quick-guide.html#ReadWriteWorkbook

Error decoding base64 string

I have an XML file which have a node called "CONTENIDO", in this node I have a PDF file encoded in base64 string.
I'm trying to read this node, decode the string in base64 and download the PDF file to my computer.
The problem is that the file is downloaded with the same size (in kb) as the original PDF and has the same number of pages, but... all the pages are in blank without any content and when I open the downloaded file a popup appears with an error saying "unknown distinctive 806.6n". I don't know what that means.
I've tried to find a solution in the internet, with diferents ways to decode the string, but always get the same result... The XML is Ok I've checked the base64 string and is Ok.
I've also debugged the code and I've seen that the content of the var "fichero" where I'm reading the base64 string is also Ok, so I don't know what can be the problem.
This is my code:
package prueba.sap.com;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import sun.misc.BASE64Decoder;
import javax.xml.bind.DatatypeConverter;
public class anexoPO {
public static void main(String[] args) throws Exception {
FileInputStream inFile =
new FileInputStream("C:/prueba/prueba_attach_b64.xml");
FileOutputStream outFile =
new FileOutputStream("C:/prueba/salida.pdf");
anexoPO myMapping = new anexoPO();
myMapping.execute(inFile, outFile);
System.out.println("Success");
System.out.println(inFile);
}
public void execute(InputStream in, OutputStream out)
throws com.sap.aii.mapping.api.StreamTransformationException {
try {
//************************Code To Generate The XML Parsing Objects*****************************//
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
Document docout = db.newDocument();
NodeList CONTENIDO = doc.getElementsByTagName("CONTENIDO");
String fichero = CONTENIDO.item(0).getChildNodes().item(0).getNodeValue();
//************** decode *************/
//import sun.misc.BASE64Decoder;
//BASE64Decoder decoder = new BASE64Decoder();
//byte[] decoded = decoder.decodeBuffer(fichero);
//import org.apache.commons.codec.binary.*;
//byte[] decoded = Base64.decode(fichero);
//import javax.xml.bind.DatatypeConverter;
byte[] decoded = DatatypeConverter.parseBase64Binary(fichero);
//************** decode *************/
String str = new String(decoded);
out.write(str.getBytes());
} catch (Exception e) {
System.out.print("Problem parsing the file");
e.printStackTrace();
}
}
}
Thanks in advance.
Definitely:
out.write(decoded);
out.close();
Strings cannot represent all bytes, and PDF is binary.
Also remove the import of sun.misc.BASE64Decoder, as this package does not exist everywhere. It might be removed by the compiler, however I would not bet on it.

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

Categories