Java Algorithm To Extract Information From a String - java

I'm trying to implement a smart search feature in my application.
Usecase: The user enters the search term in a textbox
Eg: Find me a christian male 28 years old from Brazil.
I need to be parse the input into a map as follows:
Gender: male
Age: 38
Location: Brazil
Relegion: Christian
Already had a glance on : OpenNLP, Cross Validate, Java Pattern Matching and Regex, Information Extraction. I'm confused which one I need to look deeper into.
Is there any java lib already available for this specific domain?

There's an API that extracts structured information (JSON) from free text: http://wit.ai
You need to train Wit with some examples of what you want to be achieved.

Just an approach (there are many ways to do this I think): split your String in a String[] and process each word as you need:
String str = "Find me a christian male 28 years old from Brazil";
for(String s : str.split(" ")){ //splits your String using space char
processWord(s);
}
Where processWord(s) should do something to determine if s is or not a key word based on your business rules.
EDIT: Well, as many people consider this answer insufficient I'll add some more tips.
Let's say you have a class in which you put some search criteria (assuming you want to get people that match these criteria):
public class SearchCriteria {
public void setGender(String gender){...}
public void setCountry(String country){...}
public void setReligion(String religion){...}
...
public void setWatheverYouThinkIsImportant(String str){...}
}
As #Sotirios pointed in his comment, you may need a pool of matching words. Let's say you can use List<String> with basic matching words:
List<String> gender = Arrays.asList(new String[]{"MALE","FEMALE","BOY","GIRL"...});
List<String> country = Arrays.asList(new String[]{"ALGERIA","ARGENTINA","AUSTRIA"...});
List<String> religion = Arrays.asList(new String[]{"CHRISTIAN","JEWISH","MUSLIM"...});
Now I'll modify processWord(s) a little (assuming this method has access to lists above):
public void processWord(String word, SearchCriteria sc){
if(gender.contains(word.toUpperCase()){
sc.setGender(word.toUpperCase());
return;
}
if(country.contains(word.toUpperCase()){
sc.setCountry(word.toUpperCase());
return;
}
if(religion.contains(word.toUpperCase()){
sc.setReligion(word.toUpperCase());
return;
}
....
}
Finally you need to process user's input:
String usersInput = "Find me a christian girl 28 years old from Brazil"; //sorry I change "male" for "girl" but I like girls :P
SearchCriteria sc = new SearchCriteria();
for(String word : usersInput.split(" "){
processWord(word, sc);
}
// do something with your SearchCriteria object
Sure you can do this so much better. This is only an approach.
If you want to do the search more accurate take a read about Levenshtein's distance. It will help you for example if somebody puts "Brasil" instead "Brazil" or "cristian" instead "christian".

This is a pretty huge area of research in language processing: it's called Information Extraction. If it's Java you want, GATE has pretty extensive support for IE.

Related

Return the matching word in an efficient manner

Given array of strings like [“crane", "drain", "refrain”] and a pattern such as *an* where * can match any number of characters.
Return the matching word in an efficient manner. (In this example, "crane")
I can solve it in very simple way:
String [] array = {"crane", "drain", "refrain"};
String pattern="an";
for(String s:array){
if(s.contains(pattern)){
System.out.println(s);
}
}
Is there a way to optimize the code performance in java? Consider array can contains a large number of strings.
You could try it with Regular Expressions (regex).
public class RegexExample3{
public static void main(String args[]){
String [] array = {"crane", "drain", "refrain"};
for(String s:array){
if(java.util.regex.Pattern.matches(".*an.*", s))
System.out.println(""+s);
}
}
}
Here is the link if someone doesn't know about regex and would want to understand it.
well, if you want to check if a word is match a pattern without using any Regex, contains..etc
i suggest to encode the pattern in way that if you encode a word will have same hashing...
but, in your case i suggest to do this:
static String EncodeString(String x){
String output="";
for(int i=0;i<x.length();i++){
// *an* == 0110
char c=x.charAt(i);
if(c=='n'|| c=='a'){
output +="1";
} else {
output +="0";
}
}
return output;
}public static void main(String args[])
{
String pattern="*an*";
String enPattern=EncodeString(pattern);
String word="xyanxvsdanfgh";
String enWord=EncodeString(word);
System.out.println(enPattern+" "+enWord);
int v1=Integer.parseInt(enPattern);
int v2=Integer.parseInt(enWord);
System.out.println(" % :"+ v2%v1);// notice here if word not match the pattern then the MOD operation will NOT return 0
}
The assignment asks for a return of the matching word, so the assumptions are, that there is one word, and only one word matching.
And if there is just one word, it is efficient to return early, instead of looping on. You have been close.
String matching (String pattern, String [] array) {
for (String s:array)
if (s.contains (pattern))
return s;
return "";
}
Think about alternatives, how to measure s.contains (pattern) against Regex.pattern.matches, how many cases you would have to generate, to find a difference. Without doing the measurement, you're not sure, that it isn't less efficient. Maybe the pattern should be precompiled?
In such assignments, supposed you cited it carefully, you usually have to take everything very carefully.
Often people have good ideas about a topic, and can't hesitate to implement their first idea to it. Don't do it!
Given array of strings like [“crane", "drain", "refrain”] and a
pattern such as an where * can match any number of characters.
Return the matching word in an efficient manner. (In this example,
"crane")
Be very sensible for every violation of your expectation. It is asked for returning the matching word. Did you notice the singular case? What might it mean in the context of efficient manner?
Of course you have to know your teacher, whether he is a bit sloppy or not and fluent in the language, he uses. But interfaces of methods which fit together are a big issue in software development and reading the specs carefully, too. You soon end up investing much time into a solution which works, but doesn't fit the problem.
Returning an empty String is probably not the best idea, but seems sufficient for the level, and without further context it is hard to decide, what an appropriate reaction would be, if nothing is found. Again, the wording suggests, that there is exactly one solution.

Search true if the word is Singular or Plural Java

I am trying to achieve the result in which if the user enters the word, in plural or singular, the regex should return true
For example 'I want to by drone' or 'I want to by drones'.
#Test
public void testProductSearchRegexp() {
String regexp = "(?i).*?\\b%s\\b.*?";
String query = "I want the drone with FLIR Duo";
String data1 = "drone";
String data2 = "FLIR Duo";
String data3 = "FLIR";
String data4 = "drones";
boolean isData1 = query.matches(String.format(regexp, data1));
boolean isData2 = query.matches(String.format(regexp, data2));
boolean isData3 = query.matches(String.format(regexp, data3));
boolean isData4 = query.matches(String.format(regexp, data4));
assertTrue(isData1);
assertTrue(isData2);
assertTrue(isData3);
assertTrue(isData4);//Test fails here (obviously)
}
Your valuable time on this question is very appreciated.
English is a language with many exceptions. Checking whether a word ends in 's' is simply not sufficient to determine whether it's plural.
The best way to solve this problem is to not solve this problem. It's been done before. Take advantage of that. One solution would be to make use of a third party API. The OED have one, for example.
If you were to make a request to their API such as:
/entries/en/mice
You would get back a JSON response containing:
"crossReferenceMarkers": [
"plural form of mouse"
],
from there it should be easy to parse. Simply checking for the presence of the word 'plural' may be sufficient.
They even have working Java examples that you can copy and paste.
An advantage of this approach is there's no compile-time dependency. A disadvantage is that you're relying on being able to make HTTP requests. Another is that you're limited by any restrictions they impose. The OED allows up to 3k requests/month and 60 requests/minute on their free plan, which seems pretty reasonable to me.
Well something like this is very hard to achieve without external sources. Sure many words in plural end with 's' but there are also a lot of exceptions to this like "knife" and "knives" or "cactus" and "cacti". For that you could use a Map to sort these out.
public static String getPlural(String singular){
String plural;
HashMap<String,String> irregularPlurals = new HashMap<>();
irregularPlurals.put("cactus","cacti");
irregularPlurals.put("knife","knives");
irregularPlurals.put("man","men");
/*add all your irregular ones*/
plural = irregularPlurals.get(singular);
if (plural == null){
return singular + "s";
}else{
return plural;
}
}
Very simple and not very practical but gets the job done when you only have a few words.

Java - Search keywords list in another string list

I have a list of keywords in a List and I have data coming from some source which will be a list too.
I would like to find if any of keywords exists in the data list, if yes add those keywords to another target list.
E.g.
Keywords list = FIRSTNAME, LASTNAME, CURRENCY & FUND
Data list = HUSBANDFIRSTNAME, HUSBANDLASTNAME, WIFEFIRSTNAME, SOURCECURRENCY & CURRENCYRATE.
From above example, I would like to make a target list with keywords FIRSTNAME, LASTNAME & CURRENCY, however FUND should not come as it doesn't exists in the data list.
I have a solution below that works by using two for loops (one inside another) and check with String contains method, but I would like to avoid two loops, especially one inside another.
for (int i=0; i<dataList.size();i++) {
for (int j=0; j<keywordsList.size();j++) {
if (dataList.get(i).contains(keywordsList.get(j))) {
targetSet.add(keywordsList.get(j));
break;
}
}
}
Is there any other alternate solution for my problem?
Here's a one loop approach using regex. You construct a pattern using your keywords, and then iterate through your dataList and see if you can find a match.
public static void main(String[] args) throws Exception {
List<String> keywords = new ArrayList(Arrays.asList("FIRSTNAME", "LASTNAME", "CURRENCY", "FUND"));
List<String> dataList = new ArrayList(Arrays.asList("HUSBANDFIRSTNAME", "HUSBANDLASTNAME", "WIFEFIRSTNAME", "SOURCECURRENCY", "CURRENCYRATE"));
Set<String> targetSet = new HashSet();
String pattern = String.join("|", keywords);
for (String data : dataList) {
Matcher matcher = Pattern.compile(pattern).matcher(data);
if (matcher.find()) {
targetSet.add(matcher.group());
}
}
System.out.println(targetSet);
}
Results:
[CURRENCY, LASTNAME, FIRSTNAME]
Try Aho–Corasick algorithm. This algorithm can get the count of appearance of every keyword in the data (You just need whether it appeared or not).
The Complexity is O(Sum(Length(Keyword)) + Length(Data) + Count(number of match)).
Here is the wiki-page:
In computer science, the Aho–Corasick algorithm is a string searching
algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is
a kind of dictionary-matching algorithm that locates elements of a
finite set of strings (the "dictionary") within an input text. It
matches all patterns simultaneously. The complexity of the algorithm
is linear in the length of the patterns plus the length of the
searched text plus the number of output matches.
I implemented it(about 200 lines) years ago for similar case, and it works well.
If you just care keyword appeared or not, you can modify that algorithm for your case with a better complexity:
O(Sum(Length(Keyword)) + Length(Data)).
You can find implementation of that algorithm from internet everywhere but I think it's good for you to understand that algorithm and implement it by yourself.
EDIT:
I think you want to eliminate two-loops, so we need find all keywords in one loop. We call it Set Match Problem that a set of patterns(keywords) to match a text(data). You want to solve Set Match Problem, then you should choose Aho–Corasick algorithm which is particularly designed for that case. In that way, we will get one loop solution:
for (int i=0; i < dataList.size(); i++) {
targetSet.addAll(Ac.run(keywordsList));
}
You can find a implementation from here.

Fastest way to parse txt file in Java

I have to parse a txt file for a tax calculator that has this form:
Name: Mary Jane
Age: 23
Status: Married
Receipts:
Id: 1
Place: Restaurant
Money Spent: 20
Id: 2
Place: Mall
Money Spent: 30
So, what i have done so far is:
public void read(File file) throws FileNotFoundException{
Scanner scanner = new Scanner(file);
String[] tokens = null;
while(scanner.hasNext()){
String line= scanner.nextLine();
tokens = line.split(":");
String lastToken = tokens[tokens.length - 1];
System.out.println(lastToken);
So, I want to access only the second column of this file (Mary Jane, 23, Married) to a class taxpayer(name, age, status) and the receipts' info to an Arraylist.
I thought of taking the last token and save it to an String array, but I can't do that because I can't save string to string array. Can someone help me? Thank you.
The fastest way, if your data is ASCII and you don't need charset conversion, is to use a BufferedInputStream and do all the parsing yourself -- find the line terminators, parse the numbers. Do NOT use a Reader, or create Strings, or create any objects per line, or use parseInt. Just use byte arrays and look at the bytes. It's a little messier, but pretend you're writing C code, and it will be faster.
Also give some thought to how compact the data structure you're creating is, and whether you can avoid creating an object per line there too by being clever.
Frankly, I think the "fastest" is a red herring. Unless you have millions of these files, it is unlikely that the speed of your code will be relevant.
And in fact, your basic approach to parsing (read line using Scanner, split line using String.split(...) seems pretty sound.
What you are missing is that the structure of your code needs to match the structure of the file. Here's a sketch of how I would do it.
If you are going to ignore the first field of each line, you need a method that:
reads a line, skipping empty lines
splits it, and
returns the second field.
If you are going to check that the first field contains the expected keyword, then modify the method to take a parameter, and check the field. (I'd recommend this version ...)
Then call the above method in the correct pattern; e.g.
call it 3 times to extract the name, age and marital status
call it 1 time to skip the "reciepts" line
use a while loop to call the method 3 times to read the 3 fields for each receipt.
First why do you need to invest time into the fastest possible solution? Is it because the input file is huge? I also do not understand how you want to store result of parsing? Consider new class with all fields you need to extract from file per person.
Few tips:
- Avoid unnecessary per-line memory allocations. line.split(":") in your code is example of this.
- Use buffered input.
- Minimize input/output operations.
If these are not enough for you try to read this article http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
Do you really need it to be as fast as possible? In situations like this, it's often fine to create a few objects and do a bit of garbage collection along the way in order to have more maintainable code.
I'd use two regular expressions myself (one for the taxpayer and another for the receipts loop).
My code would look something like:
public class ParsedFile {
private Taxpayer taxpayer;
private List<Receipt> receipts;
// getters and setters etc.
}
public class FileParser {
private static final Pattern TAXPAYER_PATTERN =
// this pattern includes capturing groups in brackets ()
Pattern.compile("Name: (.*?)\\s*Age: (.*?)\\s*Status: (.*?)\\s*Receipts:", Pattern.DOTALL);
public ParsedFile parse(File file) {
BufferedReader reader = new BufferedReader(new FileReader(file)));
String firstChunk = getNextChunk(reader);
Taxpayer taxpayer = parseTaxpayer(firstChunk);
List<Receipt> receipts = new ArrayList<Receipt>();
String chunk;
while ((chunk = getNextChunk(reader)) != null) {
receipts.add(parseReceipt(chunk));
}
return new ParsedFile(taxpayer, receipts);
}
private TaxPayer parseTaxPayer(String chunk) {
Matcher matcher = TAXPAYER_PATTERN.matcher(chunk);
if (!matcher.matches()) {
throw new Exception(chunk + " does not match " + TAXPAYER_PATTERN.pattern());
}
// this is where we use the capturing groups from the regular expression
return new TaxPayer(matcher.group(1), matcher.group(2), ...);
}
private Receipt parseReceipt(String chunk) {
// TODO implement
}
private String getNextChunk(BufferedReader reader) {
// keep reading lines until either a blank line or end of file
// return the chunk as a string
}
}

Searching a String array to find portions of elements

I have a String array that has individuals names in it (example):
["John Smith", "Ramon Ruiz", "Bill Bradford", "Suzy Smith", "Brad Johnson"]
I would like to write a method that prompts a user to input (in form of String) a name OR portion of a name, and then lists all names that contain the string entered by the user, (I can fix the case issue easily).
ex:
Name: rad (meaning user enters "rad")
Output:
Bill Bradford
Brad Johnson
Does anyone have any ideas on this (one that also preserves white spaces)? If there already is a good example of this, feel free to link me. I was unable to find a good method in API.
I would use
for(String name : names) {
if(org.apache.commons.lang3.StringUtils.containsIgnoreCase(name, stringToLookFor)) {
// Do your thing
}
}
You can use .indexOf() it return -1 if it does not find a subString into a String.
for(String name : myArray)
{
if (name.indexOf("rad") != -1) {
// contains word
}
}

Categories