Regex Parsing Kafka Listener - java

I'm listening to a Kafka Topic and receiving the messages, comparing them to an object and then trying to parse the message. I'm receiving a number of messages about one search, and I'm just trying to get this one
userName:User.Name userId:FDF3JH4 session:9cf2-21-c6-28-c360f1edba53 searchString:test, searchType:DEFAULT_SEARCH and this is what I want my
LogPattern to be String logPattern = ".*(userName:)(\\S+)\\s(userId:)(\\S+)\\s(session:)(\\S+)\\s(searchString:)([^,]).*";
if (isValidObject) {
final Pattern p = Pattern.compile(logPattern);
Matcher matcher = p.matcher(historyRequest.getLog());
if (!matcher.matches()) {
return;
}
I setup a test function to make sure the message I received and my pattern was correct, but when I put it into the actual function, it doesn't work. It returns no results for String logPattern = ".*"; But, the strange thing is, when messing around with the log patterns, I was able to get a match of a kafka message with this log pattern and this log:
String logPattern = ".*[userName]\\:(\\S+)\\s\\w+:(\\S+)(\\s\\S+\\s\\w+\\:)([^,]+).*";
userName:User.Name userId:D394H4 session:3f1da-0c-fb-90-949a searchString:"test" took:13.0 page:1 resultSize:1 sponsored:false

Near as I can tell you had a matching pattern. I'm guessing it didn't do exactly what you wanted, because it didn't pick up the searchString argument. I've posted some code below with a slightly modified version of your pattern. I did two things to it:
I eliminated the parentheses around the constant text
I fixed the pattern to match all text after searchString up to the comma
Here's the code:
public class Logtest {
String logPattern_orig = ".*(userName:)(\\S+)\\s(userId:)(\\S+)\\s(session:)(\\S+)\\s(searchString:)([^,]).*";
String logPattern = ".*userName:(\\S+)\\suserId:(\\S+)\\ssession:(\\S+)\\ssearchString:([^,]*),.*";
String kafkaMsg = "userName:User.Name userId:FDF3JH4 session:9cf2-21-c6-28-c360f1edba53 searchString:test, searchType:DEFAULT_SEARCH";
void test() {
final Pattern p = Pattern.compile(logPattern);
Matcher matcher = p.matcher(kafkaMsg);
if (matcher.matches()) {
System.out.println("Matches!");
for (int i=1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + "='" + matcher.group(i) + "'");
}
}
}
public static final void main(String[] args) {
Logtest lt = new Logtest();
lt.test();
}
}
When I run it, I get the following output:
Matches!
Group 1='User.Name'
Group 2='FDF3JH4'
Group 3='9cf2-21-c6-28-c360f1edba53'
Group 4='test'

Related

Two separate patterns and matchers (java)

I'm working on a simple bot for discord and the first pattern reading works fine and I get the results I'm looking for, but the second one doesn't seem to work and I can't figure out why.
Any help would be appreciated
public void onMessageReceived(MessageReceivedEvent event) {
if (event.getMessage().getContent().startsWith("!")) {
String output, newUrl;
String word, strippedWord;
String url = "http://jisho.org/api/v1/search/words?keyword=";
Pattern reading;
Matcher matcher;
word = event.getMessage().getContent();
strippedWord = word.replace("!", "");
newUrl = url + strippedWord;
//Output contains the raw text from jisho
output = getUrlContents(newUrl);
//Searching through the raw text to pull out the first "reading: "
reading = Pattern.compile("\"reading\":\"(.*?)\"");
matcher = reading.matcher(output);
//Searching through the raw text to pull out the first "english_definitions: "
Pattern def = Pattern.compile("\"english_definitions\":[\"(.*?)]");
Matcher matcher2 = def.matcher(output);
event.getTextChannel().sendMessage(matcher2.toString());
if (matcher.find() && matcher2.find()) {
event.getTextChannel().sendMessage("Reading: "+matcher.group(1)).queue();
event.getTextChannel().sendMessage("Definition: "+matcher2.group(1)).queue();
}
else {
event.getTextChannel().sendMessage("Word not found").queue();
}
}
}
You had to escape the [ character to \\[ (once for the Java String and once for the Regex). You also did forget the closing \".
the correct pattern looks like this:
Pattern def = Pattern.compile("\"english_definitions\":\\[\"(.*?)\"]");
At the output, you might want to readd \" and start/end.
event.getTextChannel().sendMessage("Definition: \""+matcher2.group(1) + "\"").queue();

Using regex and android for categorizing different fields

I am currently trying do a business name card scanner app. The idea here is to take a picture of a name card and it would extract the text and categorize the text into different EditText.
I have already completed the OCR part which extract out all the text from a name card image.
What I am missing now is to make a regex method which can take this entire text extracted from OCR and categorize the name, email address, phone number into their respective fields in EditText.
Through some googling I have already found the regex formulas below:
private static final String EMAIL_PATTERN =
"[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" +
"\\#" +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" +
"(" +
"\\." +
"[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" +
")+";
private static final String PHONE_PATTERN =
"^[89]\\d{7}$";
private static final String NAME_PATTERN =
"/^[a-z ,.'-]+$/i";
Currently I am just able to extract out the email address using the below method:
public String EmailValidator(String email) {
Pattern pattern = Pattern.compile(EMAIL_PATTERN);
Matcher matcher = pattern.matcher(email);
if (matcher.find()) {
return email.substring(matcher.start(), matcher.end());
} else {
// TODO handle condition when input doesn't have an email address
}
return email;
}
I am unsure of how to edit the ^above method^ to include using all the 3 regex patterns at once and display them to different EditText fields like (name, email address, phone number).
--------------------------------------------EDIT-------------------------------------------------
After using #Styx answer,
it has a problem with the parameter whereby how I used to pass the text "textToUse" to the method as shown below:
I have also tried passing the text into all three parameters. But since the method is void, it cannot be done. Or if I change the method to a String instead of void, it would require a return value.
Try this code. The function takes in the recognize text and split it using break line symbol. Then run a loop and determine the type of content by running a pattern check. Whenever a pattern is determined then the loop will go into next iteration using continue keyword. This piece of code also able to handle situation where 1 or more email and phone number appear on a single business card. Hope it helps. Cheers!
public void validator(String recognizeText) {
Pattern emailPattern = Pattern.compile(EMAIL_PATTERN);
Pattern phonePattern = Pattern.compile(PHONE_PATTERN);
Pattern namePattern = Pattern.compile(NAME_PATTERN);
String possibleEmail, possiblePhone, possibleName;
possibleEmail = possiblePhone = possibleName = "";
Matcher matcher;
String[] words = recognizeText.split("\\r?\\n");
for (String word : words) {
//try to determine is the word an email by running a pattern check.
matcher = emailPattern.matcher(word);
if (matcher.find()) {
possibleEmail = possibleEmail + word + " ";
continue;
}
//try to determine is the word a phone number by running a pattern check.
matcher = phonePattern.matcher(word);
if (matcher.find()) {
possiblePhone = possiblePhone + word + " ";
continue;
}
//try to determine is the word a name by running a pattern check.
matcher = namePattern.matcher(word);
if (matcher.find()) {
possibleName = possibleName + word + " ";
continue;
}
}
//after the loop then only set possibleEmail, possiblePhone, and possibleName into
//their respective EditText here.
}

Unable to find pattern in Java

I have been trying to use pattern matcher to find the specific pattern and I have created the regex pattern through this website and it shows that the pattern is found in the text file I wanted to read.
Extra info.: This code works like this : Start reading the textfile,
when meet >D10, enter another loop and get the information until the
next >D10 is found. Loop this process until EOF.
My sample text file:
D14*
Y7620D03*
X247390Y66680D03*
X251540Y160150D03*
G01Y136780*
G03X-374970Y133680I3100J0*
D17*
Y7620D03*
X247390Y66680D03*
X251540Y160150D03*
G01Y136780*
G03X-374970Y133680I3100J0*
My pattern code in java:
private final Pattern PinNamePattern = compile("(D[1-9][0-9])\\*");
private final Pattern LocationXYPattern = compile("^(G0[1-3])?(X|Y)(-?[\\d]+)(D0[1-3])?\\*",Pattern.MULTILINE);
private final Pattern LocationXYIJPattern = compile("^(G0[1-3])?X(-?[\\d]+)?Y(-?[\\d]+)?I?(-?[\\d]+)?J?(-?[\\d]+)?(D0[1-3])?\\*",Pattern.MULTILINE);
My code in java:
while ((line = br.readLine()) != null) {
Matcher pinNameMatcher = PinNamePattern.matcher(line);
//If found Aperture Name
if (pinNameMatcher.find()) {
currentApperture = pinNameMatcher.group(1);
System.out.println(currentApperture);
pinNameMatcher.reset();
//Start matching Location X Y I J
//Will keep looping as long as next aperture name not found
//Second While loop
while (!(pinNameMatcher.find()) ) {
line = br.readLine();
Matcher locXYMatcher = LocationXYPattern.matcher(line);
Matcher locXYIJMatcher = LocationXYIJPattern.matcher(line);
LineNumber++;
if (locXYMatcher.find()) {
System.out.println("XY FOUND");
if (locXYIJMatcher.find()) {
System.out.println("XYIJ FOUND");
}
}
However, when I'm using java to read, the pattern just simply cannot be found. Is there anything I missed out or am I doing it wrong? I have tried removing the "^" and MULTILINE flag but the pattern is still not found.
Your regex looks and works fine, it's possible you aren't searching it properly.
String s = "G03X-374970Y133680I3100J0*";
Pattern pattern = Pattern.compile("^(G0[1-3])?X(-?[\\d]+)?Y(-?[\\d]+)?I?(-?[\\d]+)?J?(-?[\\d]+)?(D0[1-3])?\\*");
Matcher m = pattern.matcher(s);
while (m.find()) {
String s = m.group(0);
System.out.println(s); // prints G03X-374970Y133680I3100J0*
}
In your updated code, you are looking for the second and third pattern only when the first pattern matches, which is probably not what you want. Try using this as a foundation and improving upon it:
while ((line = br.readLine()) != null) {
Matcher pinNameMatcher = PinNamePattern.matcher(line);
if (pinNameMatcher.find()) {
currentApperture = pinNameMatcher.group(0);
System.out.println(currentApperture);
}
Matcher locXYMatcher = LocationXYPattern.matcher(line);
if (locXYMatcher.find()) {
System.out.println(locXYMatcher.group(0));
}
Matcher locXYIJMatcher = LocationXYIJPattern.matcher(line);
if (locXYMatcher.find()) {
System.out.println(locXYIJMatcher.group(0));
}
}

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters
You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>
To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);
Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

regex matcher check in if logic not working

Hi, you can see my code below. I have some strings Country, rank and grank in my code, initially they will be null, but if regex is mached, it should change the value. But even if regex is matched it is not changing the value it is always null. If I remove all if statements and append the string it works fine, but if match is not found it is throwing an exception. Please let me know how can I check this in if logic.
System.err.println(content);
Pattern c = Pattern.compile("NAME=\"(.*)\" RANK");
Pattern r = Pattern.compile("\" RANK=\"(.*)\"");
Pattern gr = Pattern.compile("\" TEXT=\"(.*)\" SOURCE");
Matcher co = c.matcher(content);
Matcher ra = r.matcher(content);
Matcher gra = gr.matcher(content);
co.find();
ra.find();
gra.find();
String country = null;
String Rank = null;
String Grank = null;
if (co.matches()) {
country = co.group(1);
}
if (ra.matches()) {
Rank = ra.group(1);
}
if (gra.matches()) {
Grank = gra.group(1);
}
You have to escape a single \ - use double \\ then it should work.
Tried this?
while (co.find()) {
System.out.print("Start index: " + co.start());
System.out.print(" End index: " + co.end() + " ");
System.out.println(co.group());
}
Personally I can't make your program work with / without the if so it's not a problem of logic but just a problem that it doesn't match the string for me
So I changed it to get something working, maybe you can use it :)
String content = "NAME=\"salut\" RANK=\"pouet\" TEXT=\"text\" SOURCE";
System.out.println(content);
System.out.println(content.replaceAll(("NAME=\"(.*)\"\\sRANK=\"(.*)\"\\sTEXT=\"(.*)\" SOURCE"), "$1---$2---$3"));
Output
NAME="salut" RANK="pouet" TEXT="text" SOURCE
salut---pouet---text

Categories