Configurable HTML information extraction

Configurable HTML information extraction - java

Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.

I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.

Related

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

how to create a new word from template with docx4j

I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?

docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.

The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.

Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.

How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.

If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)

Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.

I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?

I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH

Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/

Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

File-based Document Storage in android

I'm in the early stages of a note-taking application for android and I'm hoping that somebody can point me to a nice solution for storing the note data.
Ideally, I'm looking to have a solution where:
Each note document is a separate file (for dropbox syncing)
A note can be composed of multiple pages
Note pages can have binary data (such as images)
A single page can be loaded without having to parse the entire document into memory
Thread-safety: Multiple reads/writes can occur at the same time.
XML is out (at least for the entire file), since I don't have a good way to extract a single page at a time. I considered using zip files, but (especially when compressed) I think they'd be stuck loading the entire file as well.
It seems like there should be a Java library out there that does this, but my google-fu is failing me. The only other alternative I can think of is to make a separate sqlite database for every note.
Does anybody know of a good solution to this problem? Thanks!

Seems like a relational database would work here. You just need to play around with the schema a little.
Maybe make a Pages table with each page including, say, a field for the document it belongs to and a field for its order in the document. Pages could also have a field for binary data, which might be contained in another table. If the document itself has additional data, maybe you have a table for documents too.
I haven't used SQLite transactions on an Android device, but it seems like that would be a good way to address thread safety.

I would recommend using SQLite to store the documents. Ultimately, it'll be easier than trying to deal with file I/O every time you access the note. Then, when somebody wants to upload to dropbox, you generate the file on the fly and upload it. It would make sense to have a Notes table and a pages table, at least. That way you can load each page individually and a note is just a collection of pages anyway. Additionally, you can store images as BLOBS in the database for a particular page. Basically, if you only want one type of content per page, then you would have, in the pages table, something like an id column and a content column. Alternatively, if you wanted to support something that is more complex such as multiple types of content then you would need to make your pages a collection of something else, like "entities."
IMO, a relational database is going to be the easiest way to accomplish your requirement of reading from particular pages without having to load the entire file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.