Regex: starts with messages and string between parent message curly brace - java

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}

First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.

Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

Related

How do I stop regex after finding "Message: "?

I'm splitting the body of a JSON message with the regex ":|\n" and storing the values into an array. I would like to get assistance with stopping my regex expression from splitting the message once it finds "Message: ".
In the JSON body, each section is separated by a new line, so the body looks similar to this:
{"body": "Name: Alfred Alonso\nCompany: null\nEmail: 123#abc.com\nPhone Number: 123-456-9999\nProject Type: Existing\nContact by: Email\nTime Frame: within 1 month\nMessage: Hello,\nThis is my message.\nThank You,\nJohn Doe"}
The code below works perfectly when the user doesn't create a new line within the message, so the entire message gets stored as one array value.
Thank you to anyone that can help me fix this!
String[] messArr = body.split(":|\n");
for (int i = 0; i < messArr.length; i++)
messArr[i] = messArr[i].trim();
if ("xxx".equals(eventSourceARN)) {
name = messArr[1];
String[] temp;
String delimiter = " ";
temp = name.split(delimiter);
name = temp[0];
String lastName = temp[1];
company = messArr[3];
email = messArr[5];
phoneNumber = messArr[7];
projectType = messArr[9];
contactBy = messArr[11];
timeFrame = messArr[13];
message = messArr[15];
I would like
messArr[14] = "Message"
messArr[15] = "Hello, This is my message. Thank you, John Doe"
This is what I get
[..., Message, Hello,, This is my message., Thank You, John Doe].
messArr[14] = "Message"
messArr[15] = "Hello,"
messArr[16] = "This is my message."
messArr[17] = "Thank You,"
messArr[18] = "John Doe"
Instead of using split, you can use a find loop, e.g.
Pattern p = Pattern.compile("([^:\\v]+): |((?<=Message: )(?s:.*)|(?<!$).*)\\R?");
List<String> result = new ArrayList<>();
for (Matcher m = p.matcher(input); m.find(); )
result.add(m.start(1) != -1 ? m.group(1) : m.group(2));
Test
String input = "Name: Alfred Alonso\n" +
"Company: null\n" +
"Email: 123#abc.com\n" +
"Phone Number: 123-456-9999\n" +
"Project Type: Existing\n" +
"Contact by: Email\n" +
"Time Frame: within 1 month\n" +
"Message: Hello,\n" +
"This is my message.\n" +
"Thank You,\n" +
"John Doe";
Pattern p = Pattern.compile("([^:\\v]+): |((?<=Message: )(?s:.*)|(?!$).*)\\R?");
List<String> result = new ArrayList<>();
for (Matcher m = p.matcher(input); m.find(); )
result.add(m.start(1) != -1 ? m.group(1) : m.group(2));
for (int i = 0; i < result.size(); i++)
System.out.println("result[" + i + "]: " + result.get(i));
Output
result[0]: Name
result[1]: Alfred Alonso
result[2]: Company
result[3]: null
result[4]: Email
result[5]: 123#abc.com
result[6]: Phone Number
result[7]: 123-456-9999
result[8]: Project Type
result[9]: Existing
result[10]: Contact by
result[11]: Email
result[12]: Time Frame
result[13]: within 1 month
result[14]: Message
result[15]: Hello,
This is my message.
Thank You,
John Doe
Explanation
Match one of:
( Start capture #1
[^:\v]+ Match one or more characters that are not a : or a linebreak
) End capture #1
: Match, but don't capture, a : and a space (which SO is hiding here)
| or:
( Start capture #2
Match one of:
(?<=Message: )(?s:.*) Rest of input, i.e. all text including linebreaks, if the text is immediately preceded by "Message: "
| or:
(?!$) Don't match if we're already at end-of-input
.* Match 0 or more characters up to end-of-line, excluding the EOL
) End capture #2
\\R? Match, but don't capture, an optional linebreak. This doesn't apply to Message text, and is optional in case there is no Message text and no linebreak after last value
If you want to, you could do exactly what you are doing and then put things together later. As you are trimming, notice where it says Message, then know that the Message is in the next slot and beyond. Then put it back together.
int messagePosition = -1;
for (int i = 0; i < messArr.length; i++){
messArr[i] = messArr[i].trim();
if (i>0 && messArr[i-1].equals("Message")){
messagePosition =i;
}
}
if (messagePosition > -1){
for (int i=messagePosition+1; i <messArr.length; i++){
messArr[messagePosition]=messArr[messagePosition]+" "+messArr[i];
}
}
One downside is that because arrays are fixed size, you need to act as if there is nothing beyond the messagePosition. So any calculations with length will be misleading. If for some reason you are worried you will look in the slots beyond, you could add messArr[i]=""; to the second for loop after the concatenation step.

Java string split gives array index out of bounds error

I came across this unusual error today. Can anyone explain me what I am doing wrong. Below is the code:
AreStringsPermuted checkStringPerObj = new AreStringsPermuted();
String[] inputStrings = {"siddu$isdud", "siddu$siddarth", "siddu$sidde"};
for(String inputString : inputStrings){
String[] stringArray = inputString.split("$");
if(checkStringPerObj.areStringsPermuted(stringArray[0],stringArray[1]))
System.out.println("Strings : " + stringArray[0] + " ," + stringArray[1] + " are permuted");
else
System.out.println("Strings : " + stringArray[0] + " ," + stringArray[1] + " are not permuted");
}
The above code errors out at when i try to split the string. For some reason split does not work when I try to divide each string using "$". Can any one explain me what I am doing wrong here?
Below is the error message:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at arraysAndStrings.TestClass.checkStringsPermuted(TestClass.java:24)
at arraysAndStrings.TestClass.main(TestClass.java:43)
String.split() takes a regular expression, so you need to quote strings that contain characters that have special meanings in regular expressions.
String regularExpression = Pattern.quote("$");
for (String inputString : inputStrings) {
String[] stringArray = inputString.split(regularExpression);
String.split( ) uses regex partern and $ has special meaning in regex(the end of line).
In your case, use "\$" instead of "$".
String []arrayString = inputString.split("\\$");
For more information,
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Split string after n amount of digits occurrence

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!
You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

Regular expression for comma separated string

I am trying to create a pattern in Java that matches the following string;
String message ="%%140911,A,140929100526,S0117.6262E03647.8107,000,067,F100,4F000100,108";
The pattern I have formed is not matching the string. What am I missing? Ihis is my pattern what I tried so far:
private static final Pattern pattern = Pattern.compile(
"(\\%\\%)"+"(\\d)," + // Id
"([AL])," + // Validity a for valid and l for invalid
"(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})," + // Date (YYMMDD)Time (HHMMSS)
"([NS])" + "(\\d{2})(\\d{2}\\.\\d+)" + "([EW])" + "(\\d{3})(\\d{2}\\.\\d+)," + //loc
"(\\d+)," + // Speed
"(\\d+)," + // Direction
"([FC])" + "(\\d{3})," + // temperature in Fahrenheit/celsius
"(\\w{8})," + // status
"(\\d+)"); // event
You're missing + in first line. Try changing
"(\\%\\%)"+"(\\d),"
to
"(\\%\\%)"+"(\\d+),"

How to replace multiple spaces and newlines with one blank line

How to remove multiple spaces and newlines in a string, but preserve at least one blank line for each group of blank lines.
For example, change:
"This is
a string.
Something."
to
"This is
a string.
Something."
I'm using .trim() to strip whitespace from the beginning and end of a string, but I couldn't find anything for removing multiple spaces and newlines in a string.
I would like to keep just one whitespace and one newline.
The one-line solution to remove multiple spaces/newlines, but preserve at least one blank line from multiple blank lines:
str = str.replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
Each individual line is trimmed too.
Here's some test code:
String str = " This is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \n " +
"\r\n" +
" a string. ";
str = str.trim().replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
System.out.println(str);
Output:
This is
a string.
The previous advice will trim all whitespace, including the linefeeds and replace them with a single space.
text.replaceAll("\\n\\s*\\n", "\\n").replaceAll("[ \\t\\x0B\\f]+", " ").trim());
First it replaces any instances of linefeeds with only whitespace between them with a single linefeed, then it trims down any other whitespace to a single space ignoring linefeeds.
Here is what I came up with after a bit of testing...
public String keepOneWS(String str) {
Pattern p = Pattern.compile("(\\s+)");
Matcher m = p.matcher(str);
Pattern pBlank = Pattern.compile("[ \t]+");
String newLineReplacement = System.getProperty("line.separator") +
System.getProperty("line.separator");
StringBuffer sb = new StringBuffer();
while (m.find()) {
if(pBlank.matcher(m.group(1)).matches()) {
m.appendReplacement(sb, " ");
} else {
m.appendReplacement(sb, newLineReplacement);
}
}
m.appendTail(sb);
return sb.toString().trim();
}
public void testKeepOneWS() {
String str = " This \t is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \t \n " +
"\r\n" +
" a \t string. \t ";
String expected = "This is" + System.getProperty("line.separator")+
System.getProperty("line.separator") + "a string.";
String actual = keepOneWS(str);
System.out.println("'" + actual + "'");
assertEquals(expected, actual);
}
After a goup of whitespace is captured, it is checked whether it consists only of spaces, if yes then that goup is replaced by one single space, otherwise the goup consits of spaces and line terminators, in this case the group is replaced by one line terminator.
The output is:
'This is
a string.'

Categories