Unable tp read zipfile using Apache Tika

Unable tp read zipfile using Apache Tika - java

i am using Apache Tika 1.5 for parsing the contents present in a zip file,
here's my sample code
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
InputStream stream = null;
try {
stream = TikaInputStream.get(new File(zipFilePath));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
parser.parse(stream, handler, metadata, context);
logger.info("Content:\t" + handler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
in the logger statement all i see is org.xml.sax.helpers.DefaultHandler#5bd8e367
i am missing something, unable to figure it out,
looking for some help

First up, you need to make sure you have all the right jars. You can call Apache Tika with only the tika-core jar on your classpath, but it won't be able to do that much in the way of parsing. For parsing, you need tika-core plus tika-parsers plus all of their dependencies. Simplest thing to do for that is to use Maven, it'll handle it for you.
Otherwise, there's one problematic line in your code:
ContentHandler handler = new DefaultHandler();
If you want the plain text of the file, I'd suggest using:
ContentHandler handler = new BodyContentHandler();
If you want the XHTML version, then you'll instead want something like:
ContentHandler handler = new ToXMLContentHandler();
Finally, if you want control about how the embedded documents in the zip file get extracted / handled, take a look at the examples on the Tika Wiki for Recursion

Related

How to upload zip file through REST API Mule 4?

I'm trying to upload zip file to the url https://anypoint.mulesoft.com/designcenter/api-designer/projects/{projectId}/branches/master/import. Content-Type must be application/zip, cant change to multipart/form-data. In Mule 3, a java transform class is used (com.test.FileReader) with the FileReader.class is stored in lib. It worked in Mule 3.
I tried to use ReadFile component to read test.zip and set as payload but it's not working. Any suggestion how to upload zip file in Mule 4?
package com.test;
import org.mule.transformer.*;
import org.mule.api.*;
import org.mule.api.transformer.*;
import java.io.*;
public class PayloadFileReader extends AbstractMessageTransformer
{
public Object transformMessage(final MuleMessage message, final String outputEncoding) throws TransformerException {
byte[] result = null;
try {
result = this.readZipFile("test.zip");
}
catch (Exception e) {
e.printStackTrace();
}
message.setPayload((Object)result);
return message;
}
public String readFileTest(final String path) throws FileNotFoundException, IOException, Exception {
final ClassLoader classLoader = this.getClass().getClassLoader();
final File file = new File(classLoader.getResource(path).getFile());
final FileReader fileReader = new FileReader(file);
BufferedReader bufferReader = null;
final StringBuilder stringBuffer = new StringBuilder();
try {
bufferReader = new BufferedReader(fileReader);
String line;
while ((line = bufferReader.readLine()) != null) {
stringBuffer.append(line);
}
}
catch (IOException e) {
e.printStackTrace();
if (bufferReader != null) {
try {
bufferReader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
finally {
if (bufferReader != null) {
try {
bufferReader.close();
}
catch (IOException e2) {
e2.printStackTrace();
}
}
}
return stringBuffer.toString();
}
public byte[] readZipFile(final String path) {
final ClassLoader classLoader = this.getClass().getClassLoader();
final File file = new File(classLoader.getResource(path).getFile());
final byte[] b = new byte[(int)file.length()];
try {
final FileInputStream fileInputStream = new FileInputStream(file);
fileInputStream.read(b);
fileInputStream.close();
}
catch (FileNotFoundException e) {
System.out.println("Not Found.");
e.printStackTrace();
}
catch (IOException e2) {
System.out.println("Error");
e2.printStackTrace();
}
return b;
}
}
'

Assuming that your zip file corresponds to a valid API spec, in Mule 4, you don't need to use a custom java code to achieve what you want: you can read the file content using the File connector Read operation, and use an HTTP Request to upload it to Design Center using Design Center API. Your flow should look like:
For the Read operation, you only need to set the file location, in the File Path operation property.
No need to set content type in the HTTP Request (Mule 4 will configure the content type automatically based on the file content loaded by the Read operation).

You can't use Java code that depends on Mule 3 classes in Mule 4. Don't bother trying to adapt the code, it is not meant to work. Their architecture are just different.
While in Mule 4 you can use plain Java code or create a module with the SDK, there is no reason to do so for this problem and it would be counterproductive. My advise it to forget the Java code and resolve the problem with pure Mule 4 components.
In this case there doesn't seem a need to actually use Java code. The File connector read operation should read the file just fine as it doesn't appear the Java code is doing anything else than reading the file into the payload.
Sending through the HTTP Request connector should be straightforward. You didn't provide any details of the error, (where is it happening, complete error message, HTTP status error code, complete flow with the HTTP request in both versions, etc) and the API Designer REST API doesn't document an import endpoint so it is difficult to say if the request is correctly constructed.

Crawl online directories and parse online pdf document to extract text in java

I need to be able to crawl an online directory such as for example this one http://svn.apache.org/repos/asf/ and whenever a pdf, docx, txt, or odt file come across the crawling, I need to be able to parse, and extract the text from it.
I am using files.walk in order to crawl around locally in my laptop, and Apache Tika library to parse text, and it works just fine, but I don't really know how can I do the same in an online directory.
Here's the code that goes through my PC and parses the files just so you guys have an idea of what I'm doing:
public static void GetFiles() throws IOException {
//PathXml is the path directory such as "/home/user/" that
//is taken from an xml file .
Files.walk(Paths.get(PathXml)).forEach(filePath -> { //Crawling process (Using Java 8)
if (Files.isRegularFile(filePath)) {
if (filePath.toString().endsWith(".pdf") || filePath.toString().endsWith(".docx") ||
filePath.toString().endsWith(".txt")){
try {
TikaReader.ParsedText(filePath.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
System.out.println(filePath);
}
}
});
}
and here's the TikaReader method:
public static String ParsedText(String file) throws IOException, SAXException, TikaException {
InputStream stream = new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
System.out.println(handler.toString());
return handler.toString();
} finally {
stream.close();
}
}
So again, how can I do the same thing with the given online directory above?

Java Commons IO append multiple files

I am using Commons IO to download files from the internet.
This is the method i am using:
public void getFile(String url){
File f = new File("C:/Users/Matthew/Desktop/hello.txt");
PrintWriter pw = new PrintWriter(f);
pw.close();
URL url1;
try {
url1 = new URL(url);
FileUtils.copyURLToFile(url1, f);
} catch (MalformedURLException e1) {
e1.printStackTrace();
}catch (IOException e1){
e1.printStackTrace();
}
}
Is there a way i can download multiple files using this method and have them all save to the hello.txt file? Using the above method, everything gets overwritten and the last file downloaded will be the one added to the hello.txt file.
Basically, is there a way i can store multiple file downloads in one file.
Thanks.

There is no way using FileUtils. However, if you want to use Apache Commons, I'd suggest you do the following:
File f = new File("C:/Users/Matthew/Desktop/hello.txt");
URL url1;
try {
url1 = new URL(url);
IOUtils.copy(url1.openStream(), new FileOutputStream(f, true));
} catch (MalformedURLException e1) {
e1.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
}
which does more or less the same thing, but uses append mode on the FileOutputStream.

Unable to move a file using java while using apache tika

I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename the file.
The Exception thrown while checking the content type is
java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
Please suggest any solution. Thanks in advance.
public static void main(String args[])
{
InputStream is = null;
StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
File file = null;
File destination = null;
try
{
file = new File("E:\\New folder\\testFile.pdf");
boolean a = file.exists();
destination = new File("E:\\New folder\\test\\testOutput.pdf");
is = new FileInputStream(file);
parser.parse(is, new WriteOutContentHandler(writer), metadata, new ParseContext()); //EXCEPTION IS THROWN HERE.
String contentType = metadata.get(Metadata.CONTENT_TYPE);
System.out.println(contentType);
}
catch(Exception e1)
{
e1.printStackTrace();
}
catch(Throwable t)
{
t.printStackTrace();
}
finally
{
try
{
if(is!=null)
{
is.close(); //CLOSES THE INPUT STREAM
}
writer.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}
}
boolean x = file.renameTo(destination); //RETURNS FALSE
System.out.println(x);
}

This might be due to other processes are still using the file, like anti-virus program and also it may be a case that any other processes in your application may possessing a lock.
please check that and deal with that, it may solve your problem.

Tika - retrieve main content from docs

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url. So that I can incorporate that method in my program. Also whether they are using any heuristic algorithm while extracting data from HTML pages. Because sometimes in the extracted content, I can't able to see the advertisements.
UPDATE : I found out that BoilerPipeContentHandler is responsible for it.

The "main content" feature in the Tika GUI is implemented using the BoilerpipeContentHandler class that relies on the boilerpipe library for the heavy lifting.

I believe this is powered by the BodyContentHandler, which fetches just the HTML contents of the document body. This can additionally be combined with other handlers to return just the plain text of the body, if required.

public String[] tika_autoParser() {
String[] result = new String[3];
try {
InputStream input = new FileInputStream(new File(path));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
result[0] = "Title: " + metadata.get(metadata.TITLE);
result[1] = "Body: " + textHandler.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
return result;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable tp read zipfile using Apache Tika - java

Related

How to upload zip file through REST API Mule 4?

Crawl online directories and parse online pdf document to extract text in java

Java Commons IO append multiple files

Unable to move a file using java while using apache tika

Tika - retrieve main content from docs

Categories

Resources