XPath application using tika parser - java

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.
The code I use is,
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());
But in this case I am getting no output. But for the url- google.com I am getting output.
In either case I don't know how to apply the xpath.
Any ideas please...
Tried by making my custom xpath as how body content handler uses,
HttpClient client = new HttpClient();
GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
int status = client.executeMethod(method);
HtmlParser parse = new HtmlParser();
XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
//Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
Matcher matcher = parser.parse("/html/body//h1");
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);
System.out.println("content: " + textHandler.toString());
But not getting the content in the given xpath..

I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath
In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

Related

How to get SVGDocument object from an SVG string?

I gonna get an SVGDocument object to fill into JSVGCanvas, but I just had an SVG string without any files, so I cannot use URI to construct.
You can read your SVG from a StringReader like this:
StringReader reader = new StringReader(svgString);
String uri = "file:make-something-up";
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new SAXSVGDocumentFactory(parser);
SVGDocument doc = f.createSVGDocument(uri, reader);
You need to make up a valid URI but it's not important unless you make relative references to other URIs from your SVG.

Convert html String to org.w3c.dom.Document in Java

To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.

Lucene 4 - How to discard numeric terms in index?

I'm using Apache Tika to parse xml document before indexing with Apache Lucene.
This is Tika part:
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(f);
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
return handler.toString();// return simple text
I use StandardAnalyzer with stop words list to Tokenize my document :
analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET); // using stop words
Can I discard numeric terms because I dont need it?
Thanks for your help.

CSS parser parsing string content

I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();

Problems parsing a table inside an RTF file using Apache Tika

I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with
several columns.
The problem is that the parser writes out the result without any information in which column the value was.
What I'm doing right now is:
AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();
InputStream fis = new FileInputStream(file);
adp.parse(fis, handler, metadata, new ParseContext());
fis.close();
System.out.println(handler.toString());
It works but I need to know like meta-information.
Is there already a Handler which outputs something like HTML with a structure of the read RTF file?
I would suggest that rather than asking Tika for the plain text version, then wondering where all your nice HTML information has gone, you instead just ask Tika for the document as XHTML. You'll then be able to process that to find the information you want on your RTF File
If you look at the Tika Examples or the Tika Unit Tests, you'll see this same pattern for an easy way to get the XHTML output
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

Categories