Indexing semi-structured data

Indexing semi-structured data - java

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall

I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).

Related

Configurable HTML information extraction

Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.

I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.

Indexing NTriples with Lucene

A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement:
We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.

To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.
If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.
Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.
The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:
SELECT DISTINCT ?owner
WHERE {
?owner :owns ?thing
?thing rdf:type/rdfs:subClassOf :Automobile
?thing :hasColor "red"#en
}
In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).

Custom index in Apache Solr

Suppose in addition of simple text terms i want to retrieve some complex data from text. For example, text can contain descriptions of graphs in some format. After that I want to do queries which contain some conditions on those graphs (for examle I want to find all documents with planar graphs or something like this). It seems that standard index of Solr is not sufficient for such a task because in the end it (as I understand) treats document in terms of tokens which are just strings, but I need additional index which have more suited format. So question is: can I somehow customize indexing and retrieving data from index in Solr? I've read a lot of documentation but could not find an answer.

Yes. You are able to define each field in the schema.xml file. Within that file, you can define what type of data is stored, how the document is tokenized, and how the tokenized data is manipulated. In order to meet your need, you will probably need to write a custom tokenizer and possibly custom filters as well.

Your best starting point is to look at field definition of text_general in schema. It has various tokenizers, filters that apply to the text and help you in indexing. You can define different tokens both at indexing and quering process.
You need to know that, tokens apply on the text, and filters apply on each token. You have descripton of graphs in some format. Can you elaborate more on th type of format, so that we can think of better ways? There are so many existing tokenzers and filters available. Depending upon the format, you can use existing ones or write your own.

Do I need the original files used to create an index in Lucene?

First of all as I explained in my profile, I'm not English native, so I hope you can forgive me if I make some grammar mistakes.
I'm trying to develop with the Apache Lucene API in Java. I'm able to write some index and search methods, but I'm still confused about how it works behind the scenes.
From what I know, Lucene doesn't care about the origin of the data. It just takes the data and indexes it. Let me ask with a simple example:
I want to index words from a .txt based dictionary. Once Lucene has made its indexes, do I need the source .txt dictionary anymore? How, exactly, do the indexes work?
Do the indexes contain the necessary content to perform searches without the original source? Or do the indexes only contain directions to where words are stored in the original source .txt dictionary file? I'm a little confused.

Once you have indexed everything, Lucene does not refer back to, or have further need of, any of the source documents. Everything it needs to operate is saved in it's index directory. Many people use Lucene to index files, others database records, others online resources. Whatever your source, you always have to bring in the data yourself (or with some third-party tool), and construct Documents for lucene to index, and nothing about a document says anything about where it came from. So, not only does lucene not need to refer back to original data sources, it couldn't find them if you wanted it to.
Many people's implementations do rely on having the original source present. It's not at all unusual for people to set up Lucene to index everything, but only store a file name, or database id, or some similar pointer to the original source. This allows them to perform an effective full-text search through lucene, while handling storage of the full content to some other system.

Java API : downloading and calculating tf-idf for a given web page

I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi

You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.

Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.