I am working on a project where I have emailed receipts from various courier agents. The emails are of HTML format.
But, they do not all form a specific structure. Each email is of different format. I tried jsoup to extract data, but its difficult to write the extraction for each specific type of html. I need to extract Name, from location, to location, organization and few other details from the mail. I tried openNLP, but it does not recognize all locations and names. It catches some of the locations if it is in a sentence form.
Can I create my own training data with html content in it, annotate them and train it to detect locations and names based on the html structure i have in the training data?
I think your initial approach is worth pursuing. I see an option for 2 steps here:
Get the 'text' content of the mail using Jsoup. An example of that is here: Get Text from html Using Jsoup.
Use OpenNLP or StanfordNLP NER to extract the named entities. Locations, Names e.t.c.
Another options involves playing around with the parse tree generated from the sentences and see if there is a pattern for the data you're open to extracting.
As regards getting from location and to location, you can try to generate a parse tree for the Sentences, there is an excellent example of that here: Extract noun phrase from Sentences OpenNLP. Just change the code to get the PP (Prepositional Phrase) in line 65 as it currently gets the NP (Noun Phrase).
You'll notice that from location and to location are prepositional phrases (from and to are prepositions). Once you get the prepositional phrases from the sentences, you can try to extract the noun component (after the preposition) and use other heuristics to determine if they are locations.
Something that can also be very useful is if you have a lexicon of the possible locations. If there is a lexicon then your 'search space' is smaller, you can check your prepositional phrases to see if they are known locations.
As someone mentioned in the comment, no entity recognizer can do a perfect job out of the box. These things usually need a lot of tweaking so you have to be keen on experimenting and looking at what the data says.
Hope this helps
Related
I have a system that ultimately creates a PDF files from html file. It works very similar to a mail merge. It grabs data from a database, merge's the data into palceholders in the html document and then converts the html file to a pdf.
When I am unit testing the html file I can look at the values in my place holder. For example if I had a John Smith and I want to validate that the name is "John Smith" I simply look the value of the div after the merge.
I need to do something similar with validating the data in the pdf. Using pdfbox and itext I was able to extract text from a location as well as text from the document but I can't find anything that would let me create a "tag/placeholder/..." and extract information from it similar to what I do with the html file.
Is this possible with a pdf?
That's perfectly possible using pdf2Data, which is a solution from the iText suite.
You can find the demo here
http://pdf2data.online/
It essentially does exactly what you described, you are given a viewer and some tools that allow you to define areas of interest (what you called 'placeholders').
Areas of interest can be defined using:
coordinates
relative to other areas of interest
relative to text or regular expressions
matching a certain regular exression
matching a table
etc
The tool then stores your template as an XML file, and you can use java or .NET code to extract information from a PDF that matches the template.
You are given either a json-like datastructure, or an XML file.
That should make it relatively straightforward to test whether a given area of interest contains a piece of text.
I have a .csv file with text, and am supposed to parse the data, and based on specific keywords, replace the words with the necessary html tags for linking the keywords to a website.
So far, I wrote a .csv parser and writer, that gets all the data from the columns required out of the first file, and prints those columns to a newly created (.csv) file (e.g. text id in one cell, text title in the next cell, and the actual text in the next cell).
Now I am still waiting to get a list of keywords, as well as the website hierarchy and links to put it, but to be honest I have no idea how to continue working on this. Somehow I'll have to parse down the website hierarchy to where the text title is present, and only consider elements beneath it, and link them to keywords in my text. How can this be done? I there special software of extensions, libs, packs for java to do something like this?
Any help would be appreciated, I'm running on a deadline here...
THX!
P.S.: I am coding all of it in java
I'm not sure, but it sounds like you want to create an href column in your output:
Visit W3Schools
You could do this most simply by concatenating the strings:
String makeHref(String title, String id, String link) {
return "<a href=" + ... etc. }
before you write out the second csv. You'll need to escape the "s, of course.
It's also entirely possible that I didn't understand the question. You may want to try to be more specific if that's the case.
I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.
I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).
I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document