Using regex and android for categorizing different fields - java

I am currently trying do a business name card scanner app. The idea here is to take a picture of a name card and it would extract the text and categorize the text into different EditText.
I have already completed the OCR part which extract out all the text from a name card image.
What I am missing now is to make a regex method which can take this entire text extracted from OCR and categorize the name, email address, phone number into their respective fields in EditText.
Through some googling I have already found the regex formulas below:
private static final String EMAIL_PATTERN =
"[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" +
"\\#" +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" +
"(" +
"\\." +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" +
")+";
private static final String PHONE_PATTERN =
"^[89]\\d{7}$";
private static final String NAME_PATTERN =
"/^[a-z ,.'-]+$/i";
Currently I am just able to extract out the email address using the below method:
public String EmailValidator(String email) {
Pattern pattern = Pattern.compile(EMAIL_PATTERN);
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
return email.substring(matcher.start(), matcher.end());
} else {
// TODO handle condition when input doesn't have an email address
}
return email;
}
I am unsure of how to edit the ^above method^ to include using all the 3 regex patterns at once and display them to different EditText fields like (name, email address, phone number).
--------------------------------------------EDIT-------------------------------------------------
After using #Styx answer,
it has a problem with the parameter whereby how I used to pass the text "textToUse" to the method as shown below:
I have also tried passing the text into all three parameters. But since the method is void, it cannot be done. Or if I change the method to a String instead of void, it would require a return value.

Try this code. The function takes in the recognize text and split it using break line symbol. Then run a loop and determine the type of content by running a pattern check. Whenever a pattern is determined then the loop will go into next iteration using continue keyword. This piece of code also able to handle situation where 1 or more email and phone number appear on a single business card. Hope it helps. Cheers!
public void validator(String recognizeText) {
Pattern emailPattern = Pattern.compile(EMAIL_PATTERN);
Pattern phonePattern = Pattern.compile(PHONE_PATTERN);
Pattern namePattern = Pattern.compile(NAME_PATTERN);
String possibleEmail, possiblePhone, possibleName;
possibleEmail = possiblePhone = possibleName = "";
Matcher matcher;
String[] words = recognizeText.split("\\r?\\n");
for (String word : words) {
//try to determine is the word an email by running a pattern check.
matcher = emailPattern.matcher(word);
if (matcher.find()) {
possibleEmail = possibleEmail + word + " ";
continue;
}
//try to determine is the word a phone number by running a pattern check.
matcher = phonePattern.matcher(word);
if (matcher.find()) {
possiblePhone = possiblePhone + word + " ";
continue;
}
//try to determine is the word a name by running a pattern check.
matcher = namePattern.matcher(word);
if (matcher.find()) {
possibleName = possibleName + word + " ";
continue;
}
}
//after the loop then only set possibleEmail, possiblePhone, and possibleName into
//their respective EditText here.
}

Related

Regex Parsing Kafka Listener

I'm listening to a Kafka Topic and receiving the messages, comparing them to an object and then trying to parse the message. I'm receiving a number of messages about one search, and I'm just trying to get this one
userName:User.Name userId:FDF3JH4 session:9cf2-21-c6-28-c360f1edba53 searchString:test, searchType:DEFAULT_SEARCH and this is what I want my
LogPattern to be String logPattern = ".*(userName:)(\\S+)\\s(userId:)(\\S+)\\s(session:)(\\S+)\\s(searchString:)([^,]).*";
if (isValidObject) {
final Pattern p = Pattern.compile(logPattern);
Matcher matcher = p.matcher(historyRequest.getLog());
if (!matcher.matches()) {
return;
}
I setup a test function to make sure the message I received and my pattern was correct, but when I put it into the actual function, it doesn't work. It returns no results for String logPattern = ".*"; But, the strange thing is, when messing around with the log patterns, I was able to get a match of a kafka message with this log pattern and this log:
String logPattern = ".*[userName]\\:(\\S+)\\s\\w+:(\\S+)(\\s\\S+\\s\\w+\\:)([^,]+).*";
userName:User.Name userId:D394H4 session:3f1da-0c-fb-90-949a searchString:"test" took:13.0 page:1 resultSize:1 sponsored:false
Near as I can tell you had a matching pattern. I'm guessing it didn't do exactly what you wanted, because it didn't pick up the searchString argument. I've posted some code below with a slightly modified version of your pattern. I did two things to it:
I eliminated the parentheses around the constant text
I fixed the pattern to match all text after searchString up to the comma
Here's the code:
public class Logtest {
String logPattern_orig = ".*(userName:)(\\S+)\\s(userId:)(\\S+)\\s(session:)(\\S+)\\s(searchString:)([^,]).*";
String logPattern = ".*userName:(\\S+)\\suserId:(\\S+)\\ssession:(\\S+)\\ssearchString:([^,]*),.*";
String kafkaMsg = "userName:User.Name userId:FDF3JH4 session:9cf2-21-c6-28-c360f1edba53 searchString:test, searchType:DEFAULT_SEARCH";
void test() {
final Pattern p = Pattern.compile(logPattern);
Matcher matcher = p.matcher(kafkaMsg);
if (matcher.matches()) {
System.out.println("Matches!");
for (int i=1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + "='" + matcher.group(i) + "'");
}
}
}
public static final void main(String[] args) {
Logtest lt = new Logtest();
lt.test();
}
}
When I run it, I get the following output:
Matches!
Group 1='User.Name'
Group 2='FDF3JH4'
Group 3='9cf2-21-c6-28-c360f1edba53'
Group 4='test'

Java check one string in other string

I am receiving metainformations in a radio player via ICY.
Here is a short example of how this can look:
die neue welle - Der beste Musikmix aus 4 Jahrzehnten! - WELSHLY ARMS - SANCTUARY - Der Mehr Musik-Arbeitstag mit Benni Rettich
Another example for the meta information stream would be:
SWR1 Baden Württemberg
or
Welshly Arms - Sanctuary
Now I need to extract the title from there, the problem is that this 'meta-information' string can have any format.
What I know:
-I know the complete meta information string as showed in the first code section
-I know the station name, which is delivered by another ICY propertie
The first approach was to check if the string contains the station name (I thought if not, it has to be the title):
private boolean icyInfoContainsTitleInfo() {
String title = id3Values.get("StreamTitle"); //this is the title string
String icy = id3Values.get("icy-name"); //this is the station name
String[] titleSplit = title.split("\\s");
String[] icySplit = icy.split("\\s");
for (String a : titleSplit) {
StringBuilder abuilder = new StringBuilder();
abuilder.append(a);
for (String b : icySplit) {
StringBuilder builder = new StringBuilder();
builder.append(b);
if (builder.toString().toLowerCase().contains(abuilder.toString().toLowerCase())) {
return false;
}
}
}
return true;
}
But that does not help me if title and station are both present in the title string.
Is there a pattern that matches a string followed by a slash, backslash or a hyphen followed by another string?
Has anyone encountered a similiar problem?
Since you don't have a specification and each station can send a different format. I would not try to find a "perfect" pattern but simply create a mapping to store each station's format regex to recover the title.
First, create a map
Map<String, String> stationPatterns = new HashMap<>();
Them, insert some pattern you know
stationPatterns.put("station1", "(.*)");
stationPatterns.put("station2", "station2 - (.*)");
...
Then, you just need to get this pattern (where you ALWAYS find one capture group).
public String getPattern(String station){
return stationPatterns.getOrDefault(station, "(.*)"); //Use a default value to get everything)
}
With this, you just need to get a pattern to extract the title from a String.
Pattern pattern = Pattern.compile(getPattern(stationSelected));
Matcher matcher = pattern.matcher(title);
if (matcher.find()) {
System.out.println("Title : " + matcher.group(1));
} else {
System.err.println("The title doesn't match the format");
}

Two separate patterns and matchers (java)

I'm working on a simple bot for discord and the first pattern reading works fine and I get the results I'm looking for, but the second one doesn't seem to work and I can't figure out why.
Any help would be appreciated
public void onMessageReceived(MessageReceivedEvent event) {
if (event.getMessage().getContent().startsWith("!")) {
String output, newUrl;
String word, strippedWord;
String url = "http://jisho.org/api/v1/search/words?keyword=";
Pattern reading;
Matcher matcher;
word = event.getMessage().getContent();
strippedWord = word.replace("!", "");
newUrl = url + strippedWord;
//Output contains the raw text from jisho
output = getUrlContents(newUrl);
//Searching through the raw text to pull out the first "reading: "
reading = Pattern.compile("\"reading\":\"(.*?)\"");
matcher = reading.matcher(output);
//Searching through the raw text to pull out the first "english_definitions: "
Pattern def = Pattern.compile("\"english_definitions\":[\"(.*?)]");
Matcher matcher2 = def.matcher(output);
event.getTextChannel().sendMessage(matcher2.toString());
if (matcher.find() && matcher2.find()) {
event.getTextChannel().sendMessage("Reading: "+matcher.group(1)).queue();
event.getTextChannel().sendMessage("Definition: "+matcher2.group(1)).queue();
}
else {
event.getTextChannel().sendMessage("Word not found").queue();
}
}
}
You had to escape the [ character to \\[ (once for the Java String and once for the Regex). You also did forget the closing \".
the correct pattern looks like this:
Pattern def = Pattern.compile("\"english_definitions\":\\[\"(.*?)\"]");
At the output, you might want to readd \" and start/end.
event.getTextChannel().sendMessage("Definition: \""+matcher2.group(1) + "\"").queue();

How to get multi sub strings from String, Android/Java

I know there are similar questions regarding to this. However, I tried many solutions and it just does not work for me.
I need help to extract multiple substrings from a string:
String content = "Ben Conan General Manager 90010021 benconan#gmail.com";
Note: The content in the String may not be always in this format, it may be all jumbled up.
I want to extract the phone number and email like below:
1. 90010021
2. benconan#gmail.com
In my project, I was trying to get this result and then display it into 2 different EditText.
I have tried using pattern and matcher class but it did not work.
I can provide my codes here if requested, please help me ~
--------------------EDIT---------------------
Below is my current method which only take out the email address:
private static final String EMAIL_PATTERN =
"[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" +
"\\#" +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" +
"(" +
"\\." +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" +
")+";
public String EmailValidator(String email) {
Pattern pattern = Pattern.compile(EMAIL_PATTERN);
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
return email.substring(matcher.start(), matcher.end());
} else {
// TODO handle condition when input doesn't have an email address
}
return email;
}
You can separate your string into arraylist like this
String str = "Ben Conan, General Manager, 90010021, benconan#gmail.com";
List<String> List = Arrays.asList(str.split(" "));
maybe you should do this instead of yours :
String[] Stringnames = new String[5]
Stringnames [0] = "your phonenumber"
Stringnames[1] = "your email"
System.out.println(stringnames)
Or :
String[] Stringnames = new String[2]
String[] Stringnames = {"yournumber","your phonenumber"};
System.out.println(stringnames [1]);
String.split(...) is a java method for that.
EXAMPLE:
String content = "Ben Conan, General Manager, 90010021, benconan#gmail.com";
String[] selection = content.split(",");
System.out.println(selection[0]);
System.out.println(selection[3]);
BUT if you want to do a Regex then take a look at this:
https://stackoverflow.com/a/16053961/982161
Try this regex for phone number
[\d+]{8} ---> 8 represents number of digits in phone number
You can use
[\d+]{8,} ---> if you want the number of more than 8 digits
Use appropriate JAVA functions for matching. You can try the results here
http://regexr.com/
For email, it depends whether the format is simple or complicated. There is a good explanation here
http://www.regular-expressions.info/index.html

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters
You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>
To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);
Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

Categories