Lucene 4 - How to discard numeric terms in index? - java

I'm using Apache Tika to parse xml document before indexing with Apache Lucene.
This is Tika part:
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(f);
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
return handler.toString();// return simple text
I use StandardAnalyzer with stop words list to Tokenize my document :
analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET); // using stop words
Can I discard numeric terms because I dont need it?
Thanks for your help.

Related

CSS parser parsing string content

I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();

setting namespace true while parsing in dom4j

I am parsing one xml document using dom4j as below
SAXReader reader = new SAXReader();
document = reader.read("C:/test.xml");
But it does not keep the namespaces that were there when i write the xml as below
FileOutputStream fos = new FileOutputStream("c:/test.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(document);
writer.flush();
how to do this using dom4j.I am using dom4j because of code easiness.
Don't agree. This snippet
System.out.println(new SAXReader()
.read(new ByteArrayInputStream("<a:c xmlns:a='foo'/>"
.getBytes(Charset.forName("utf-8")))).getRootElement()
.getNamespaceURI());
will print
foo
Your problem is that the SAXReader#read(String) method takes a System IDargument, not a file name. Instead, try feeding the reader a File, or an InputStream, or a URL.
reader.read(new File("C:/test.xml"))

CSS styles and <ul> <li> tags been ignored while parsing using Apache Tika

While I am parsing the PDF or Word document using AutoDetectParser the "li", "ul" tags are converted as "p" tags. I need the exact HTML content what is been there for PDF or Word Document.
I tried in several ways as below:
ToHTMLContentHandler textHandler = new ToHTMLContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new IdentityHtmlMapper());
parser.parse(in, textHandler, metadata, context);
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
handler.setResult(new StreamResult(writer));
System.out.println(handler.toString());
return handler;
But the "li" tags are been replaced with "p" tags with class but the CSS style is not seen in the parsed HTML output.
Any help is appreciated.

Problems parsing a table inside an RTF file using Apache Tika

I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with
several columns.
The problem is that the parser writes out the result without any information in which column the value was.
What I'm doing right now is:
AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();
InputStream fis = new FileInputStream(file);
adp.parse(fis, handler, metadata, new ParseContext());
fis.close();
System.out.println(handler.toString());
It works but I need to know like meta-information.
Is there already a Handler which outputs something like HTML with a structure of the read RTF file?
I would suggest that rather than asking Tika for the plain text version, then wondering where all your nice HTML information has gone, you instead just ask Tika for the document as XHTML. You'll then be able to process that to find the information you want on your RTF File
If you look at the Tika Examples or the Tika Unit Tests, you'll see this same pattern for an easy way to get the XHTML output
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

XPath application using tika parser

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.
The code I use is,
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());
But in this case I am getting no output. But for the url- google.com I am getting output.
In either case I don't know how to apply the xpath.
Any ideas please...
Tried by making my custom xpath as how body content handler uses,
HttpClient client = new HttpClient();
GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
int status = client.executeMethod(method);
HtmlParser parse = new HtmlParser();
XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
//Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
Matcher matcher = parser.parse("/html/body//h1");
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);
System.out.println("content: " + textHandler.toString());
But not getting the content in the given xpath..
I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath
In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

Categories