The code below works perfectly in converting html to plain text...
Url url = new URL(your_url);
InputStream is = url.openStream();
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(is, textHandler, metadata, context);
System.out.println("Body: " + textHandler.toString());
My question is:
How to retain / keep specific element like links, , etc... or how to prevent specific element like links, to be removed in html to plain text conversion?
Thanks and best regards...
There are many ways you can use Apache Tika for this kind of work.
As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.
Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.
MatchingContentHandler
A common route is to use a MatchingContentHandler to filter the content you are interested in:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);
// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());
It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.
LinkContentHandler
If you just want to extract links, the LinkContentHandler is a great option:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());
It's code is also a great example of how to build a custom handler.
BoilerpipeContentHandler
The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());
System.out.println(textHandler.getTextDocument().getTextBlocks());
These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.
Custom ContentHandler or ContentHandlerDecorator
You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.
In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.
Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.
Below is a bit of a contrived example, just trying to emit part of the HTML:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));
ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
private final List<String> elementsToInclude = List.of("h2");
private boolean processElement = false;
#Override
public void startElement(String uri, String local, String name, Attributes atts)
throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = true;
super.startElement(uri, local, name, atts);
}
}
#Override
public void ignorableWhitespace(char[] ch, int start, int length) {
// Skip whitespace
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!processElement) {
return;
}
super.characters(ch, start, length);
}
#Override
public void endElement(
String uri, String local, String name) throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = false;
super.endElement(uri, local, name);
}
}
};
HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());
As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.
JSoup :)
Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
System.out.println(headline.text());
}
Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.
I'm using Crawler4j to extract pages and pdf files. I already checked that the byte array I got is valid and can be output to a pdf file.
With this byte array, I do the following:
//Tika specific types
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream;
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
...
byte[] contentData = null;
contentData = page.getContentData(); //Crawler4j content, delivers valid PDF
//Path path = Paths.get("C:\\Test\\local.pdf"); //use this line to read from a local pdf
//Default fields:
String title = "pdf title";
String content = "";
String suggestions = "";
//
try {
////contentData = Files.readAllBytes(path); //use this line to read from a local pdf
inputstream = new ByteArrayInputStream(contentData);
pdfparser.parse(inputstream, handler, metadata,pcontext); //THIS LINE CRASHES
content = "pdf suggestions";
suggestions = handler.toString();
} catch (Exception e) {
LOGGER.warn("Error parsing with Tika.", e);
}
I marked the crashing line. The resulting Exception is the following:
WARN 2017-07-26 11:17:51,302 [Thread-5] de.searchadapter.crawler.solrparser.parser.file.PDFFileParser - Error parsing with Tika.
org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
at org.apache.tika.metadata.Metadata.add(Metadata.java:305)
at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:209)
at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:150)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:239)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
at de.searchadapter.crawler.solrparser.parser.file.PDFFileParser.parse(PDFFileParser.java:82)
at de.searchadapter.crawler.solrparser.SolrParser.parse(SolrParser.java:36)
at de.searchadapter.crawler.SolrJAdapter.indexDocs(SolrJAdapter.java:58)
at de.searchadapter.crawler.WebCrawler.onBeforeExit(WebCrawler.java:63)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:309)
at java.lang.Thread.run(Thread.java:745)
The code above is from the PDFFileParser. I'm not setting any property, so I'm puzzled where this error comes from.
Additional info: The PDF file seems to use an unknown font, the following warning comes up:
11:17:50.963 [Thread-5] WARN o.a.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for f_i (30) in font GGOLOE+TheSansC5-Plain
EDIT: I edited the code, so that it could read local pdf files. I tried another PDF file and didn't get the error. It seems like it results of the failing font.
l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.
Error output:
Exception in thread "main"
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException:
Your document contained more than 100000 characters, and so your
requested limit has been reached. To receive the full text of the
document, increase your limit.
How can l increase the limit to 10-15mb ?
I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
Here is my code:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";
FileInputStream inputstream = new FileInputStream(location);
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
Output:
System.out.println("Content of the PDF :" + pcontext);
Use
BodyContentHandler handler = new BodyContentHandler(-1);
to disable the limit.
From the Javadoc:
The internal string buffer is bounded at the given number of
characters. If this write limit is reached, then a SAXException is
thrown. Parameters: writeLimit - maximum number of characters to
include in the string, or -1 to disable the write limit
I found this code sample. However, it does not save the new metadata. How to save the new metadata in the same file ?
I tried IOUtils copy. But the problem is that The parser implementation will consume this stream but will not close it .(https://tika.apache.org/1.1/parser.html)
I need a sample code to save the changes.
public void SetMedata(File param_File) throws IOException, SAXException, TikaException {
// parameters of parse() method
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(param_File);
ParseContext context = new ParseContext();
// Parsing the given file
parser.parse(inputstream, handler, metadata, context);
// list of meta data elements elements
System.out.println("===Before=== metadata elements and values of the given file :");
String[] metadataNamesb4 = metadata.names();
for (String name : metadataNamesb4) {
System.out.println(name + ": " + metadata.get(name));
}
// setting date meta data
metadata.set(TikaCoreProperties.CREATED, new Date());
// setting multiple values to author property
metadata.set(TikaCoreProperties.TITLE, "ram ,raheem ,robin ");
// printing all the meta data elements with new elements
System.out.println("===After=== List of all the metadata elements after adding new elements ");
String[] metadataNamesafter = metadata.names();
for (String name : metadataNamesafter) {
System.out.println(name + ": " + metadata.get(name));
}
//=======================================
//How To Save metada. ===================
}
Thank you in advance for your answers, examples and help.
ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.
Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?
public static Metadata getMetadata(File f) {
Metadata metadata = new Metadata();
try {
FileInputStream fis = new FileInputStream(f);
BodyContentHandler contentHandler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
parser.parse(fis, contentHandler, metadata, context);
fis.close();
String contentType = metadata.get(Metadata.CONTENT_TYPE);
logger.error("contentHandler: " + contentHandler.toString());
logger.error("metadata: " + metadata.toString());
return metadata;
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in it!\n" +
"Exception: " + e, e);
return metadata;
}
}
The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.
One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.