Extract raw text from rtf file - java

Once upon a time I was using apache POI to extract rtf files. I was actually using the TXTParser class because then I could use the raw output from the rtf (with all the formatting in text) to do various text extraction wizadry based on the formatting.
Then one day it just started to output blank strings and I have no idea why.
public class TextParser {
//#SuppressWarnings({ "rawtypes", "unchecked" })
public TextParser() {
// TODO Auto-generated constructor stub
}
public static void main(final String[] args) throws IOException,TikaException{
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("/Users/sebastianzeki/Documents/PhysJava/dance.rtf"));
ParseContext pcontext = new ParseContext();
//Text document parser
TXTParser TXTParser = new TXTParser();
try {
TXTParser.parse(inputstream, handler, metadata,pcontext);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//Some tidying up
String s=handler.toString();
System.out.println(s);
I know there's nothing wrong with the file because if I use another class (ie RTFParser) I get the whole file.
HtmlParser would be an alternative but only gives me half the file returned.
Can anyone suggest an alternative way to get the rtf as required or a fix for this weird problem

Related

Delete existing file and create/write data into same file using TransformerFactory within a loop

I want to transform the input.XML file with XSL and overwrite the output.XML file in a loop so that I get the final output.XML file.
I am able to do this, however instead of overwriting the output.XML file, it's appending the same file with all iteration data of loop.
To resolve this issue, I tried to delete the existing output.XML file in loop only just before transforming the input.XML file with XSL, but getting error -
java.nio.file.FileSystemException: Output.xml: The process cannot access the file because it is being used by another process.
So, I'm NOT able to delete the existing file also not able to overwrite the output.xml file.
Can anyone help on this pls.
I believe resolving any one of these issue should help out.
Thanks
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException {
// TODO Auto-generated method stub
try {
Double currentValue = 1.0;
String inputXMLPath = "C:/MySystem/Input.xml";
String outputXMLPath = "C:/MySystem/Output.xml";
StreamSource inputStream = new StreamSource(inputXMLPath);
FileOutputStream opStream = new FileOutputStream(new File(outputXMLPath));
while (currentValue != 7.0) {
String xslPath = "C:/MySystem/input.xsl";
Path path = FileSystems.getDefault().getPath(outputXMLPath);
Files.delete(path);
performTransformation(xslPath, inputStream, opStream, outputXMLPath);
StreamSource secondStream = new StreamSource(outputXMLPath);
inputStream = secondStream;
currentValue++;
opStream.flush();
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void performTransformation(String xslPath, StreamSource inputStream, FileOutputStream opStream,
String outputXMLPath) throws Exception {
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = null;
transformer = tFactory.newTransformer(new StreamSource(xslPath));
transformer.transform(inputStream, new StreamResult(opStream));
opStream.flush();
}
Pretty sure you can't edit or delete this file because the input stream is still open, that's why it is still in use if you can close the input stream then delete the XML file and then make a new one it should work no problem.
This should fit your use case assuming the file is always called the same thing.
Also the deleteIfExists(Path) method is useful as it will still delete the file but will not throw an exception if it does not exist.
You can't delete the file because you haven't closed it. But instead of deleting it, why don't you re-create the file inside the loop, instead of outside it:
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException {
// TODO Auto-generated method stub *** DELETE THIS LINE ***
try {
Double currentValue = 1.0;
String inputXMLPath = "C:/MySystem/Input.xml";
String outputXMLPath = "C:/MySystem/Output.xml";
StreamSource inputStream = new StreamSource(inputXMLPath);
// *** DANGER! COMPARING DOUBLES IS PRONE TO ERRORS! ***
while (currentValue != 7.0) {
// *** USE TRY-WITH-RESOURCES TO ENSURE THE STREAM GETS CLOSED ***
try (FileOutputStream opStream = new FileOutputStream(outputXMLPath)) {
String xslPath = "C:/MySystem/input.xsl";
performTransformation(xslPath, inputStream, opStream, outputXMLPath);
StreamSource secondStream = new StreamSource(outputXMLPath);
inputStream = secondStream;
currentValue++;
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Crawl online directories and parse online pdf document to extract text in java

I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/ and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it.
I am using files.walk in order to crawl around locally in my laptop, and Apache Tika library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory.
Here's the code that goes through my PC and parses the files just so you guys have an idea of what I'm doing:
public static void GetFiles() throws IOException {
//PathXml is the path directory such as "/home/user/" that
//is taken from an xml file .
Files.walk(Paths.get(PathXml)).forEach(filePath -> { //Crawling process (Using Java 8)
if (Files.isRegularFile(filePath)) {
if (filePath.toString().endsWith(".pdf") || filePath.toString().endsWith(".docx") ||
filePath.toString().endsWith(".txt")){
try {
TikaReader.ParsedText(filePath.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
System.out.println(filePath);
}
}
});
}
and here's the TikaReader method:
public static String ParsedText(String file) throws IOException, SAXException, TikaException {
InputStream stream = new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
System.out.println(handler.toString());
return handler.toString();
} finally {
stream.close();
}
}
So again, how can I do the same thing with the given online directory above?

Unable to move a file using java while using apache tika

I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename the file.
The Exception thrown while checking the content type is
java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
Please suggest any solution. Thanks in advance.
public static void main(String args[])
{
InputStream is = null;
StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
File file = null;
File destination = null;
try
{
file = new File("E:\\New folder\\testFile.pdf");
boolean a = file.exists();
destination = new File("E:\\New folder\\test\\testOutput.pdf");
is = new FileInputStream(file);
parser.parse(is, new WriteOutContentHandler(writer), metadata, new ParseContext()); //EXCEPTION IS THROWN HERE.
String contentType = metadata.get(Metadata.CONTENT_TYPE);
System.out.println(contentType);
}
catch(Exception e1)
{
e1.printStackTrace();
}
catch(Throwable t)
{
t.printStackTrace();
}
finally
{
try
{
if(is!=null)
{
is.close(); //CLOSES THE INPUT STREAM
}
writer.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}
}
boolean x = file.renameTo(destination); //RETURNS FALSE
System.out.println(x);
}
This might be due to other processes are still using the file, like anti-virus program and also it may be a case that any other processes in your application may possessing a lock.
please check that and deal with that, it may solve your problem.

WordnetSynonymParser in Lucene

I am new to Lucene and I'm trying to use WordnetSynonymParser to expand queries using the wordnet synonyms prolog. Here is what I have till now:
public class CustomAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader){
// TODO Auto-generated method stub
Tokenizer source = new ClassicTokenizer(Version.LUCENE_47, reader);
TokenStream filter = new StandardFilter(Version.LUCENE_47, source);
filter = new LowerCaseFilter(Version.LUCENE_47,filter);
SynonymMap mySynonymMap = null;
try {
mySynonymMap = buildSynonym();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
filter = new SynonymFilter(filter, mySynonymMap, false);
return new TokenStreamComponents(source, filter);
}
private SynonymMap buildSynonym() throws IOException
{
File file = new File("wn/wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader rulesReader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(Version.LUCENE_47));
((WordnetSynonymParser) parser).add(rulesReader);
SynonymMap synonymMap = parser.build();
return synonymMap;
}
}
I get the error "The method add(CharsRef, CharsRef, boolean) in the type SynonymMap.Builder is not applicable for the arguments (Reader)"
However, the documentation of WordnetSynonymParser expects a Reader argument for the add function.
What am I doing wrong here?
Any help is appreciated.
If you are seeing documentation stating that WordNetSynonymParser has a method add(Reader), you are probably looking at documentation for an older version. The method certainly isn't there in the source code for 4.7. As of version 4.6.0, the method you are looking for is WordnetSynonymParser.parse(Reader).

Tika - retrieve main content from docs

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url. So that I can incorporate that method in my program. Also whether they are using any heuristic algorithm while extracting data from HTML pages. Because sometimes in the extracted content, I can't able to see the advertisements.
UPDATE : I found out that BoilerPipeContentHandler is responsible for it.
The "main content" feature in the Tika GUI is implemented using the BoilerpipeContentHandler class that relies on the boilerpipe library for the heavy lifting.
I believe this is powered by the BodyContentHandler, which fetches just the HTML contents of the document body. This can additionally be combined with other handlers to return just the plain text of the body, if required.
public String[] tika_autoParser() {
String[] result = new String[3];
try {
InputStream input = new FileInputStream(new File(path));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
result[0] = "Title: " + metadata.get(metadata.TITLE);
result[1] = "Body: " + textHandler.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
return result;
}

Categories