I wrote custom tokenizer for solr, when I first add records to solr, they are going throug my tokenizer and other filters, when they are going through my tokenizer I call web service and add needed attributes. After it I can use search without sending requests to web service. When I use search with highlighting data are going through my tokenizer again, what should I do for not going through tokenizer again?
When the highlighter is run on the text to highlight, the analyzer and tokenizer for the field is re-run on the text to score the different tokens against the submitted text, to determine which fragment is the best match for the query produced. You can see this code around line #62 of Highlighter.java in Lucene.
There are however a few options that might help for negating the need to re-parse the document text, all given as options on the community wiki for Highlighting:
For the standard highlighter:
It does not require any special datastructures such as termVectors,
although it will use them if they are present. If they are not, this
highlighter will re-analyze the document on-the-fly to highlight it.
This highlighter is a good choice for a wide variety of search
use-cases.
There are also two other Highlighter-implementations you might want to look at, as either one uses other support structures that might avoid doing the retokenizing / analysis of the field (I think testing it will be a lot quicker for you than for me right now).
FastVector Highlighter: The FastVector Highlighter requires term vector options (termVectors, termPositions, and termOffsets) on the field.
Postings Highlighter: The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field. This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms.
You can switch the highlighting implementation by using hl.useFastVectorHighligter=true or adding <highlighting class="org.apache.solr.highlight.PostingsSolrHighlighter"/> to your searchComponent definition.
Related
Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.
I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.
I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.
But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.
For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.
I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!
Any ideas??
Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...
First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.
Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.
Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.
Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.
I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.
EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.
There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.
Exact pipeline for your purpose is:
Document Reset PR.
ANNIE English Tokenizer.
ANNIE RegEx Sentence Splitter.
Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).
If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.
Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!
But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"
I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!
For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same.
And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.
I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall
I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).
I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.