Java : Microsoft word document to html converter style sheet - java

As required I am trying to convert doc or docx (Microsoft word) files to html format with Apache tika
I end up with following code which works fine,
But its not adding any style sheet to result html.
import javax.xml.transform.OutputKeys;
import java.io.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.detect.DefaultDetector;
public class DocxConvert
{
public static void main(String []args)
{
InputStream input=null;
try
{
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
handler.setResult(new StreamResult(sw));
input = new FileInputStream("f:\\file.doc");
DefaultDetector detector = new DefaultDetector();
Metadata metadata = new Metadata();
org.apache.tika.parser.Parser parser = new AutoDetectParser(detector);
parser.parse(input, handler, metadata, new ParseContext());
System.out.print(sw.toString());
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
try {
input.close();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
Is there any way to add/generate style sheet to output? kindly help !

I used version 1.6 of Tika and that worked fine for me. Here is the pom dependency I used.
http://tika.apache.org/download.html
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.6</version>
</dependency>
</dependencies>

You can use unoconv and it requires Openoffice or Libreoffice. Download from here and it provides doc,docx,xls etc. to pdf conversion from command line in your server. if you want to show embedding pdf file with apache or apache tomcat, i think pdf.js is good solution.

Related

How to determine extension from fileBytes

My application allows users to download files. While creating headers I am using Tika to set extension as shown below.
This works fine for pdf files. Fails for DOC and EXCEL files.
private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
final HttpHeaders headers = new HttpHeaders();
TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
Tika tika = new Tika();
String mimeType = tika.detect(tikaStream);
headers.setContentType(MediaType.valueOf(mimeType));
MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
String extension = defaultMimeTypes.forName(mimeType).getExtension();
headers.add("file-ext", extension);
return headers;
}
I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to "application/x-tika-ooxml" for excel and word files which is the problem.
How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.
Why does this work for pdf?
Summary
The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes.
The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured. For those you have to also parse the files, and inspect their metadata.
The term "media type" has replaced the term "MIME type" - see here:
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
Office 97-2003
When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice. I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis. This is similar to the application/x-tika-ooxml in your question.
Expected Results
Based on the IANA list here, and a Mozilla list here, these are the media types we expect to get for the following file types:
.pdf :: application/pdf
.xls :: application/vnd.ms-excel
.doc :: application/msword
.xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document
The Program
The program shown below uses the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.23</version>
</dependency>
<dependency>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
The program (just for this demo - not production ready) is shown below. Specifically, look at the tikaDetect() and tikaParse() methods.
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
public class Main {
private final Set<File> msOfficeFiles = new HashSet();
public static void main(String[] args) throws IOException, MimeTypeException,
SAXException, TikaException {
Main main = new Main();
main.doFileDetection();
}
private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
File file1 = new File("C:/tmp/foo.pdf");
File file2 = new File("C:/tmp/baz.xlsx");
File file3 = new File("C:/tmp/bat.docx");
// Excel 97-2003 format:
File file4 = new File("C:/tmp/bar.xls");
// Word 97-2003 format:
File file5 = new File("C:/tmp/daz.doc");
Set<File> files = new HashSet();
files.add(file1);
files.add(file2);
files.add(file3);
files.add(file4);
files.add(file5);
for (File file : files) {
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
}
for (File file : msOfficeFiles) {
tikaParse(file);
}
}
private void tikaDetect(File file, BufferedInputStream bis)
throws IOException, SAXException, TikaException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(bis, metadata);
if (mediaType.toString().equals("application/x-tika-msoffice")) {
msOfficeFiles.add(file);
} else {
System.out.println("Media Type for " + file.getName()
+ " is: " + mediaType.toString());
}
}
private void tikaParse(File file) throws SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (BufferedInputStream bis = new BufferedInputStream(
new FileInputStream(file))) {
parser.parse(bis, handler, metadata, context);
tikaDetect(file, bis);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Media Type for " + file.getName()
+ " is: " + metadata.get("Content-Type"));
}
}
Actual Results
The program generates some warnings and information messages. If we ignore these for this exercise, we get the following print statements:
Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword
These match the expected official media (MIME) types.
Tika official usages:
https://tika.apache.org/1.26/detection.html
Tika supported formats:
https://tika.apache.org/1.26/formats.html
You could get the answers by simply reading the above 2 pages.
Here are some key quotes:
Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents.
That means you need to include also Apache POI jars or Maven dependencies for MS office files.
Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn. For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.
That means you need to include the Tika Parsers jar or Maven dependencies.
Then use
new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());

tika: disable looking inside zip files

Currently, tika is processing zip files looking inside them.
I'd like to disable this features and only gets me application/zip mime type.
I'm using this code right now:
public String getMimeType(InputStream is) {
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}
This code returns me zipped mime type file.
Any ideas?
Based on your example I wrote a dummy app. I then used a large Zip file that doesn't have the zip Extension. I do not see the behavior that Tika parses the whole file.
I looked with a debugger. Tika only reads 65536 bytes to determine the file type.
See: Tika MimeTypes.class:154
public MediaType detect(InputStream input, Metadata metadata) throws IOException {
List<MimeType> possibleTypes = null;
// Get type based on magic prefix
if (input != null) {
input.mark(getMinLength());
try {
byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);
} finally {
input.reset();
}
}
dummy app
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(new Main().getMimeType(new FileInputStream(new File("C:\\temp\\apache-tomcat-8.0.47-windows-x64"))));
}
public String getMimeType(InputStream is) throws IOException {
final TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
return mediaType.getType();
}
Maven:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.18</version>
</dependency>
</dependencies>
Can you show me an example that reads and parses the whole zip-file? I can understand that that would be a problem when files exceed a certain size.
Unfortunately I cant help you if I can't reproduce the problem.

Content type check for uploading APK file in JAVA

How to check if user is actually uploading APK file in java.
By checking only the extension of file user can rename any file(Eg:TXT file) to apk and upload.
I tried to check the contentType of the particular file as shown below:
if(apkFile.getContentType().equals("application/vnd.android.package-archive"))
But here the problem is when upload from some browser the content type comes as application/vnd.android.package-archive but for some others its application/octet-stream.
What is the proper way to check this?
Apache tika parser can detect your file type if you changed the file type extension by renaming.Though parse method a bit expensive. After parsing the file, file metadata are stored in MetaData object. However you can follow the code
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.AutoDetectParser;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
public class Test {
public static void main(String[] args) {
File file = null;
InputStream is = null;
String contentType = null;
try {
//suppose f.apk file you renamed it f.txt
file = new File("path\\to\\your\\file\\f.txt");
is = new FileInputStream(file);
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
try {
// This step takes some time.
parser.parse(is, handler, metadata);
} catch (Exception ex) {
ex.printStackTrace();
} finally {
is.close();
}
contentType = metadata.get("Content-Type");
System.out.println("Content Type = " + contentType);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Required maven dependency
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.16</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.16</version>
</dependency>
Apk is formated in zip,so you could check whether it's a zip file to decrease the probability that it' not an apk file ,in addtion to checking the file extension.After all ,not all of the marplots change file extensions only from zip to apk

Can't get correct Key-Value Pairs with Tika

I'm trying to get the Metadata Values from an Office Document and all it shows as key-value pair is this one:
Content-Type: application/zip
I just can't tell the issue in this one. Why does it only show the Content-Type?
What i'm interested in are Keys like title.
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App
{
private static final String PATH = "C:/docs/myDocument.docx";
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
InputStream fileStream = new FileInputStream(PATH);
BodyContentHandler handler = new BodyContentHandler();
parser.parse(fileStream, handler, metadata);
String[] metadataNames = metadata.names();
for (String key : metadataNames) {
String value = metadata.get(key);
System.out.println(key + ": " + value);
}
}
}
Promoting a comment to an answer - you appear to be missing some key Apache Tika jars or their dependencies.
If you're using Maven, then your pom should have (as of January 2015) should have something like:
<properties>
<tika.version>1.7</tika.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${tika.version}</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
The tika-core artifact gives you everything you need to run Tika, and develop your own parsers, but doesn't come with any parsers. It's the tika-parsers artifact (+dependencies!) which provides all the built-in Tika parsers, which you need to process files liek yours

How to use OpenNLP with Java?

I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed
When I execute the command
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
It gives output POSTagging the input in Text.txt
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
I hope it installed properly?
Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?
Here's some (old) sample code I threw together, with modernized code to follow:
package opennlp;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
public class OpenNlpTest {
public static void main(String[] args) throws IOException {
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
ObjectStream<String> lineStream =
new PlainTextByLineStream(new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
}
}
The output is:
Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
Average: 76.9 sent/s
Total: 1 sent
Runtime: 0.013s
This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.
This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.javachannel</groupId>
<artifactId>opennlp-example</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>[6.8.21,)</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:
package org.javachannel.opennlp.example;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;
public class POSTest {
private void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
#DataProvider
Object[][] getCorpusData() {
return new Object[][][]{{{
"Can anyone help me dig through OpenNLP's horrible documentation?"
}}};
}
#Test(dataProvider = "getCorpusData")
public void showPOS(Object[] input) throws IOException {
File modelFile = new File("en-pos-maxent.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
}
POSModel model = new POSModel(modelFile);
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
Stream.of(input).map(line -> {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
return sample.toString();
}).forEach(System.out::println);
perfMon.stopAndPrintFinalResult();
}
}
This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.
The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo
The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
#Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}

Categories