Regular expression for comma separated string - java

I am trying to create a pattern in Java that matches the following string;
String message ="%%140911,A,140929100526,S0117.6262E03647.8107,000,067,F100,4F000100,108";
The pattern I have formed is not matching the string. What am I missing? Ihis is my pattern what I tried so far:
private static final Pattern pattern = Pattern.compile(
"(\\%\\%)"+"(\\d)," + // Id
"([AL])," + // Validity a for valid and l for invalid
"(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})," + // Date (YYMMDD)Time (HHMMSS)
"([NS])" + "(\\d{2})(\\d{2}\\.\\d+)" + "([EW])" + "(\\d{3})(\\d{2}\\.\\d+)," + //loc
"(\\d+)," + // Speed
"(\\d+)," + // Direction
"([FC])" + "(\\d{3})," + // temperature in Fahrenheit/celsius
"(\\w{8})," + // status
"(\\d+)"); // event

You're missing + in first line. Try changing
"(\\%\\%)"+"(\\d),"
to
"(\\%\\%)"+"(\\d+),"

Related

Regex: starts with messages and string between parent message curly brace

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}
First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.
Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

How to replace all domains with pattern in a XML string in Java?

I have an XML output like this (<xml> element or xlink:href attribute are just fiction and you cannot rely on them to create regex pattern.)
<xml>http://localhost:8080/def/abc/xyx</xml>
<element xlink:href="http://localhostABCDEF/def/ABC/XYZ">Some Text</element>
...
What I want to do is using Java regex to replace the domain pattern (I don't know about existing domains):
"http(s)?://.*/def/.*
with an input domain (e.g: http://google.com/def) and the result will be:
<xml>http://google.com/def/abc/xyx</xml>
<element xlink:href="http://google.com.com/def/ABC/XYZ">Some Text</element>
...
How can I do it? I think Regex in Java can do or String.replaceAll (but this one seems not possible).
Regex: http[s]?:\/{2}.+\/def Substitution: http://google.com/def
Details:
? Matches between zero and one times
[] Match a single character present in the list
. Matches any character
+ Matches between one and unlimited times
Java code:
String domain = "http://google.com/def";
String html = "<xml>http://localhost:8080/def/abc/xyx</xml>\r\n<element xlink:href=\"http://localhostABCDEF/def/ABC/XYZ\">Some Text</element>";
html = html.replaceAll("http[s]?:\\/{2}.+\\/def", domain);
System.out.print(html);
Output:
<xml>http://google.com/def/abc/xyx</xml>
<element xlink:href="http://google.com/def/ABC/XYZ">Some Text</element>
Actually, this could be done with Regex and it is simple enough than parsing XML document. Here is the answer:
String text = "<epsg:CommonMetaData>\n"
+ " <epsg:type>geographic 2D</epsg:type>\n"
+ " <epsg:informationSource>EPSG. See 3D CRS for original information source.</epsg:informationSource>\n"
+ " <epsg:revisionDate>2007-08-27</epsg:revisionDate>\n"
+ " <epsg:changes>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2002.151\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2003.370\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2006.810\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2007.079\"/>\n"
+ " </epsg:changes>\n"
+ " <epsg:show>true</epsg:show>\n"
+ " <epsg:isDeprecated>false</epsg:isDeprecated>\n"
+ " </epsg:CommonMetaData>\n"
+ " </gml:metaDataProperty>\n"
+ " <gml:metaDataProperty>\n"
+ " <epsg:CRSMetaData>\n"
+ " <epsg:projectionConversion xlink:href=\"http://www.opengis.net/def/coordinateOperation/EPSG/0/15593\"/>\n"
+ " <epsg:sourceGeographicCRS xlink:href=\"http://www.opengis.net/def/crs/EPSG/0/4979\"/>\n"
+ " </epsg:CRSMetaData>\n"
+ " </gml:metaDataProperty>"
+ "<gml:identifier codeSpace=\"OGP\">http://www.opengis.net/def/area/EPSG/0/1262</gml:identifier>";
String patternString1 = "(http(s)?://.*/def/.*)";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
String prefixDomain = "http://localhost:8080/def";
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
String url = prefixDomain + matcher.group(1).split("def")[1];
matcher.appendReplacement(sb, url);
System.out.println(url);
}
matcher.appendTail(sb);
System.out.println(sb.toString());
which returns output https://www.diffchecker.com/CyJ8fY8p

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com
Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Java RegEx Trouble: Matching Function Body

having quite some trouble with the following regex.
String fktRegex ="public double " + a+ "2" + b + "(double value) {return value * (.*);}\n";
a and b are Strings that are inserted individually.
The regex works just fine until I want to identify also the number with it. That's the (.*) part...
Any help`? Would be really glad! Thanks.
C.
Judging by your example I think you need to escape few regex meta-characters like { } ( ) * so your regex should probably look more like
"public double " + a + "2" + b + "\\(double value\\) \\{return value \\* (.*);\\}\n";
Demo
// abc2xyz
String a = "abc";
String b = "xyz";
String fktRegex = "public double " + a + "2" + b + "\\(double value\\) \\{return value \\* (.*);\\}\n";
String data = "public double abc2xyz(double value) {return value * 100000;}\n";
Pattern p = Pattern.compile(fktRegex);
Matcher m = p.matcher(data);
if(m.find()){
System.out.println(m.group(1));
}else{
System.out.println("no match found");
}

Advanced parsing of numeric ranges from string

I'm using Java to parse strings input by the user, representing either single numeric values or ranges. The user can input the following string:
10-19
And his intention is to use whole numbers from 10-19 --> 10,11,12...19
The user can also specify a list of numbers:
10,15,19
Or a combination of the above:
10-19,25,33
Is there a convenient method, perhaps based on regular expressions, to perform this parsing? Or must I split the string using String.split(), then manually iterate the special signs (',' and '-' in this case)?
This is how I would go about it:
Split using the , as a delimiter.
If it matches this regular expression: ^(\\d+)-(\\d+)$, then I know I have a range. I would then extract the numbers and create my range (it might be a good idea to make sure that the first digit is lower than the second digit, because you never know...). You then act accordingly.
If it matches this regular expression: ^\\d+$ I would know I have only 1 number, so I have a specific page. I would then act accordingly.
This tested (and fully commented) regex solution meets the OP requirements:
Java regex solution
// TEST.java 20121024_0700
import java.util.regex.*;
public class TEST {
public static Boolean isValidIntRangeInput(String text) {
Pattern re_valid = Pattern.compile(
"# Validate comma separated integers/integer ranges.\n" +
"^ # Anchor to start of string. \n" +
"[0-9]+ # Integer of 1st value (required). \n" +
"(?: # Range for 1st value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of 1st value. \n" +
")? # Range for 1st value (optional). \n" +
"(?: # Zero or more additional values. \n" +
" , # Comma separates additional values. \n" +
" [0-9]+ # Integer of extra value (required). \n" +
" (?: # Range for extra value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of extra value. \n" +
" )? # Range for extra value (optional). \n" +
")* # Zero or more additional values. \n" +
"$ # Anchor to end of string. ",
Pattern.COMMENTS);
Matcher m = re_valid.matcher(text);
if (m.matches()) return true;
else return false;
}
public static void printIntRanges(String text) {
Pattern re_next_val = Pattern.compile(
"# extract next integers/integer range value. \n" +
"([0-9]+) # $1: 1st integer (Base). \n" +
"(?: # Range for value (optional). \n" +
" - # Dash separates range integer. \n" +
" ([0-9]+) # $2: 2nd integer (Range) \n" +
")? # Range for value (optional). \n" +
"(?:,|$) # End on comma or string end.",
Pattern.COMMENTS);
Matcher m = re_next_val.matcher(text);
String msg;
int i = 0;
while (m.find()) {
msg = " value["+ ++i +"] ibase="+ m.group(1);
if (m.group(2) != null) {
msg += " range="+ m.group(2);
};
System.out.println(msg);
}
}
public static void main(String[] args) {
String[] arr = new String[]
{ // Valid inputs:
"1",
"1,2,3",
"1-9",
"1-9,10-19,20-199",
"1-8,9,10-18,19,20-199",
// Invalid inputs:
"A",
"1,2,",
"1 - 9",
" ",
""
};
// Loop through all test input strings:
int i = 0;
for (String s : arr) {
String msg = "String["+ ++i +"] = \""+ s +"\" is ";
if (isValidIntRangeInput(s)) {
// Valid input line
System.out.println(msg +"valid input. Parsing...");
printIntRanges(s);
} else {
// Match attempt failed
System.out.println(msg +"NOT valid input.");
}
}
}
}
Output:
r'''
String[1] = "1" is valid input. Parsing...
value[1] ibase=1
String[2] = "1,2,3" is valid input. Parsing...
value[1] ibase=1
value[2] ibase=2
value[3] ibase=3
String[3] = "1-9" is valid input. Parsing...
value[1] ibase=1 range=9
String[4] = "1-9,10-19,20-199" is valid input. Parsing...
value[1] ibase=1 range=9
value[2] ibase=10 range=19
value[3] ibase=20 range=199
String[5] = "1-8,9,10-18,19,20-199" is valid input. Parsing...
value[1] ibase=1 range=8
value[2] ibase=9
value[3] ibase=10 range=18
value[4] ibase=19
value[5] ibase=20 range=199
String[6] = "A" is NOT valid input.
String[7] = "1,2," is NOT valid input.
String[8] = "1 - 9" is NOT valid input.
String[9] = " " is NOT valid input.
String[10] = "" is NOT valid input.
'''
Note that this solution simply demonstrates how to validate an input line and how to parse/extract value components from each line. It does not further validate that for range values the second integer is larger than the first. This logic check however, could be easily added.
Edit:2012-10-24 07:00 Fixed index i to count from zero.
You can use
strinput = '10-19,25,33'
eval(cat(2,'[',strrep(strinput,'-',':'),']'))
Best is to include some input checks, also negative numbers will give problems with this method.
In a simplest approach you can use the evil eval for this
A = eval('[10:19,25,33]')
A =
10 11 12 13 14 15 16 17 18 19 25 33
BUT of course you should think twice before you do that. Especially if this is a user-supplied string! Imagine what would happen if the user supplied any other command...
eval('!rm -rf /')
You would have to make sure that there really is nothing else than what you want. You could do this by regexp.

Categories