java tika how to convert html to plain text retaining specific element - java

The code below works perfectly in converting html to plain text...
Url url = new URL(your_url);
InputStream is = url.openStream();
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(is, textHandler, metadata, context);
System.out.println("Body: " + textHandler.toString());
My question is:
How to retain / keep specific element like links, , etc... or how to prevent specific element like links, to be removed in html to plain text conversion?
Thanks and best regards...

There are many ways you can use Apache Tika for this kind of work.
As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.
Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.
MatchingContentHandler
A common route is to use a MatchingContentHandler to filter the content you are interested in:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);
// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());
It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.
LinkContentHandler
If you just want to extract links, the LinkContentHandler is a great option:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());
It's code is also a great example of how to build a custom handler.
BoilerpipeContentHandler
The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());
System.out.println(textHandler.getTextDocument().getTextBlocks());
These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.
Custom ContentHandler or ContentHandlerDecorator
You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.
In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.
Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.
Below is a bit of a contrived example, just trying to emit part of the HTML:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));
ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
private final List<String> elementsToInclude = List.of("h2");
private boolean processElement = false;
#Override
public void startElement(String uri, String local, String name, Attributes atts)
throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = true;
super.startElement(uri, local, name, atts);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length) {
// Skip whitespace
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!processElement) {
return;
}
super.characters(ch, start, length);
}
#Override
public void endElement(
String uri, String local, String name) throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = false;
super.endElement(uri, local, name);
}
}
};
HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());
As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.
JSoup :)
Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
System.out.println(headline.text());
}
Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.

Related

IKVM C# Tika Implementation - NoClassDefFoundError - sun.java2d.Disposer

I have a small library that utilizes IKVM to run Tika (1.2) for the purposes of extracting text and metadata for use within Lucene. I grab document and image paths from a CMS we are using, and pass them through here:
public TextExtractionResult Extract(string filePath)
{
var parser = new AutoDetectParser();
var metadata = new Metadata();
var parseContext = new ParseContext();
Class parserClass = parser.GetType();
parseContext.set(parserClass, parser);
try
{
// Attempt to fix ImageParser "NoClassDefFoundError"
java.lang.System.setProperty("java.awt.headless", "true");
var file = new File(filePath);
var url = file.toURI().toURL();
using (InputStream inputStream = TikaInputStream.get(url, metadata))
{
parser.parse(inputStream, getTransformerHandler(), metadata, parseContext);
inputStream.close();
}
return AssembleExtractionResult(_outputWriter.toString(), metadata);
}
catch (Exception ex)
{
throw new ApplicationException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
}
}
Only when the files are .png, it bombs with this error:
It seems as though it most likely coming from Tika's ImageParser.
For those who are interested - You can see getTransformerHandler() here:
private TransformerHandler getTransformerHandler()
{
var factory = TransformerFactory.newInstance() as SAXTransformerFactory;
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "text");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "UTF-8");
_outputWriter = new StringWriter();
handler.setResult(new StreamResult(_outputWriter));
return handler;
}
I have looked around and keep being pointed in the direct of running headless, so I already tried that with no luck. Because this is a C# implementation in IKVM, is something missing? It works on all other documents as far as I can tell (.jpeg, .docx, .pdf, etc.).
Thanks to those who know more about Tika + IKVM implementations than I do.
Apache Tika 1.2 was released back on 17 July 2012, and there have been a lot of fixes and improvements since then
You should upgrade to the most recent version of Apache Tika (1.12 as of writing), and that should solve your issue

How to write modified meta data to mp3

I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?

Extract text from a large pdf with Tika

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}
From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains
only up to getMaxStringLength() first characters extracted from the
input document. Use the setMaxStringLength(int) method to adjust this
limitation.
Calling setMaxStringLength(-1) will disable this limit.
Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

Why is my Tika Metadata object not being populated when using ForkParser?

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.
Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?
public static Metadata getMetadata(File f) {
Metadata metadata = new Metadata();
try {
FileInputStream fis = new FileInputStream(f);
BodyContentHandler contentHandler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
parser.parse(fis, contentHandler, metadata, context);
fis.close();
String contentType = metadata.get(Metadata.CONTENT_TYPE);
logger.error("contentHandler: " + contentHandler.toString());
logger.error("metadata: " + metadata.toString());
return metadata;
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in it!\n" +
"Exception: " + e, e);
return metadata;
}
}
The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.
One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.

How to validate an XML using XSD InputStream using oracle.xml.parser.v2.DOMParser

I am trying to validate an XML file against a XSD document. There are two ways by which I am doing.
Both XML and XSDs are files
XML is a file and XSD is a stream
In the case of XML and XSDs are files,
validateXML(xmlFile, xsdFile)
{
m_domParser = new DOMParser();
String url = "file:" + new File(xml).getAbsolutePath();
String xsdUrl = "file:" + new File(xsd).getAbsolutePath();
try
{
m_domParser.setValidationMode(XMLParser.SCHEMA_VALIDATION);
m_domParser.setXMLSchema(xsd);
Validator handler = new Validator();
m_domParser.setErrorHandler(handler);
m_domParser.parse(url);
m_xmlDoc = m_domParser.getDocument();
//determine what kinda utility requested
}
}
It works fine and validating correctly. Here is the code I have written for validating using XSD info as stream
import org.xml.sax.InputSource;
import oracle.xml.parser.v2.DOMParser;
validateXML(String xmlFile, InputStream is)
Reader reader;
try {
m_domParser = new DOMParser();
m_domParser.setValidationMode(XMLParser.SCHEMA_VALIDATION);
//get the XMLSchema from the input stream
XSDBuilder builder = null;
XMLSchema schema = null;
builder = new XSDBuilder();
Reader reader = new InputStreamReader(is,"UTF-8");
InputSource iSource = new InputSource(reader);
if(iSource != null) {
iSource.setEncoding("UTF-8");
schema = builder.build(iSource); //NOTE
}
m_domParser.setXMLSchema(schema);
Validator handler = new Validator();
m_domParser.setErrorHandler(handler);
//get the url for the xml file
String url = "file:" + new File(xmlFile).getAbsolutePath();
m_domParser.parse(url);
}
but at the NOTE comment (schema = builder.build(iSource);) during building it throws out an exception "invalid derivation from base type "missing derivation".
The XSD stream is being generated from the same xsd file why is it failing in the second place. Whilce building the XMLSchema, what is it meant by saying "invalid derivation from the base type"?
Please help me in understanding what went wrong in the second case.. Any quick responses are highly appreciated.

Categories