Regex: split a complex list (two columns) in Java

Regex: split a complex list (two columns) in Java - java

I'm trying to figure out how to split a file (two columns) to readLine(); by considering a lot of delimiters (see bellow).
Here are all possibilities of my delimiters (see comments)
+--------+---------+
+ ##some text + //some text which starts with (##) I want to exclude this row
+ 341, 222 + //comma delimited
+ 211 321 + //space delimited
+ 541 1231 + //tab delimited
+ ##some text + //some text which starts with (##) I want to exclude this row
+ 11.3 321.11 + //double values delimited by tab
+ 331.3 33.11 + //double values delimited by space
+ 231.3, 33.1 + //double values delimited by comma
+ ##some text + //some text which starts with (##) I want to exclude this row
+--------+---------+
I want to obtain this table:
+--------+---------+
+ 341 222 +
+ 211 321 +
+ 541 1231 +
+ 11.3 321.11 +
+ 331.3 33.11 +
+ 231.3 33.1 +
+--------+---------+
I will be glad to find a solution to this issue
UPDATE:
For now I have ([,\s\t;])+ (for comma, tab, space, semicolon...) but I can't figure out how to do for ##some text. I tried \##\w+ but didn't work. Any advice?

You can try this...
I have tried it and its working fine.
(\\d+\\.?\\d*),?\\s*?(\\d+\\.?\\d*)
and replace with $1 and $2.
EDIT:
TRY BELOW CODE...
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class regcheck
{
private static Pattern twopart = Pattern.compile("(\\d+\\.?\\d*),?\\s*?(\\d+\\.?\\d*)");
public static void checkString(String s)
{
Matcher m = twopart.matcher(s);
if (m.matches()) {
System.out.println(m.group(1) +" " + m.group(2));
} else {
System.out.println(s + " does not match.");
}
}
public static void main(String[] args) {
System.out.println("Parts of strings are ");
checkString("##some text");
checkString("123, 4567");
checkString("123, 342");
checkString("45.45 4.3");
checkString("3.78, 23.78");
}
}
OUTPUT :
Parts of strings are
##some text does not match.
123 4567
123 342
45.45 4.3
3.78 23.78
m.group(1) will give you the first part.
m.group(2) will give you the second part.
In your code use checkstring() method for single line....

Assuming the ASCII isn't part of the input, you could try this:
##[a-z\s]+|([\d\.]+)[,\s\t]+([\d\.]+)
then replace with:
\1 \2 (or $1 $2)
Note, this doesn't allow for commas in the numbers

Related

Regex seperate 2 numbers by komma

I'm trying to make a regex to allow only a case of a number then "," and another number or same case seperated by ";" like
57,1000
57,1000;6393,1000
So far i made this: Pattern.compile("\\b[0-9;,]{1,5}?\\d+;([0-9]{1,5},?)+").matcher("57,1000").find();
which work if case is 57,1000;6393,1000 but it also allow letters and don't work when case 57,1000

try Regex "(\d+,\d+(;\d+,\d+)?)"
#Test
void regex() {
Pattern p = Pattern.compile("(\\d+,\\d+)(;\\d+,\\d+)?");
Assertions.assertTrue(p.matcher("57,1000").matches());
Assertions.assertTrue(p.matcher("57,1000;6393,1000").matches());
}

How about like this. Just look for two numbers separated by a comma and capture them.
String[] data = {"57,1000",
"57,1000;6393,1000"};
Pattern p = Pattern.compile("(\\d+),(\\d+)");
for (String str : data) {
System.out.println("For String : " + str);
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1) + " " + m.group(2));
}
System.out.println();
}
prints
For String : 57,1000
57 1000
For String : 57,1000;6393,1000
57 1000
6393 1000
If you just want to match those, you can do the following: It matches a single instance of the string followed by an optional one preceded by a semi-colon.
String regex = "(\\d+,\\d+)(;(\\d+,\\d+))?";
for (String str : data) {
System.out.println("Testing String " + str + " : " +str.matches(regex));
}
prints
Testing String 57,1000 : true
Testing String 57,1000;6393,1000 : true

Regex: starts with messages and string between parent message curly brace

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}

First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.

Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

Unwanted elements appearing when splitting a string with multiple separators in Java

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
System.out.println("spart[4]: " + spart[4]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: fever
spart[4]: normal
My desired output is:
spart[0]: s
spart[1]: film
spart[2]: fever
spart[3]: normal
spart[4]: curse

Try with this:
public static void main(String[] args) {
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");
for (String string : spart) {
System.out.println("'"+string+"'");
}
}
output:
's'
'film'
'fever'
'normal'
'curse'

I believe it is because you have a Greedy quantifier for space at the end there. I think you would have to use an escape sequence for the plus sign too.

String spart = s.replaceAll( "\\W", " " ).split(" +");

Split string after n amount of digits occurrence

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!

You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

Advanced parsing of numeric ranges from string

I'm using Java to parse strings input by the user, representing either single numeric values or ranges. The user can input the following string:
10-19
And his intention is to use whole numbers from 10-19 --> 10,11,12...19
The user can also specify a list of numbers:
10,15,19
Or a combination of the above:
10-19,25,33
Is there a convenient method, perhaps based on regular expressions, to perform this parsing? Or must I split the string using String.split(), then manually iterate the special signs (',' and '-' in this case)?

This is how I would go about it:
Split using the , as a delimiter.
If it matches this regular expression: ^(\\d+)-(\\d+)$, then I know I have a range. I would then extract the numbers and create my range (it might be a good idea to make sure that the first digit is lower than the second digit, because you never know...). You then act accordingly.
If it matches this regular expression: ^\\d+$ I would know I have only 1 number, so I have a specific page. I would then act accordingly.

This tested (and fully commented) regex solution meets the OP requirements:
Java regex solution
// TEST.java 20121024_0700
import java.util.regex.*;
public class TEST {
public static Boolean isValidIntRangeInput(String text) {
Pattern re_valid = Pattern.compile(
"# Validate comma separated integers/integer ranges.\n" +
"^ # Anchor to start of string. \n" +
"[0-9]+ # Integer of 1st value (required). \n" +
"(?: # Range for 1st value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of 1st value. \n" +
")? # Range for 1st value (optional). \n" +
"(?: # Zero or more additional values. \n" +
" , # Comma separates additional values. \n" +
" [0-9]+ # Integer of extra value (required). \n" +
" (?: # Range for extra value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of extra value. \n" +
" )? # Range for extra value (optional). \n" +
")* # Zero or more additional values. \n" +
"$ # Anchor to end of string. ",
Pattern.COMMENTS);
Matcher m = re_valid.matcher(text);
if (m.matches()) return true;
else return false;
}
public static void printIntRanges(String text) {
Pattern re_next_val = Pattern.compile(
"# extract next integers/integer range value. \n" +
"([0-9]+) # $1: 1st integer (Base). \n" +
"(?: # Range for value (optional). \n" +
" - # Dash separates range integer. \n" +
" ([0-9]+) # $2: 2nd integer (Range) \n" +
")? # Range for value (optional). \n" +
"(?:,|$) # End on comma or string end.",
Pattern.COMMENTS);
Matcher m = re_next_val.matcher(text);
String msg;
int i = 0;
while (m.find()) {
msg = " value["+ ++i +"] ibase="+ m.group(1);
if (m.group(2) != null) {
msg += " range="+ m.group(2);
};
System.out.println(msg);
}
}
public static void main(String[] args) {
String[] arr = new String[]
{ // Valid inputs:
"1",
"1,2,3",
"1-9",
"1-9,10-19,20-199",
"1-8,9,10-18,19,20-199",
// Invalid inputs:
"A",
"1,2,",
"1 - 9",
" ",
""
};
// Loop through all test input strings:
int i = 0;
for (String s : arr) {
String msg = "String["+ ++i +"] = \""+ s +"\" is ";
if (isValidIntRangeInput(s)) {
// Valid input line
System.out.println(msg +"valid input. Parsing...");
printIntRanges(s);
} else {
// Match attempt failed
System.out.println(msg +"NOT valid input.");
}
}
}
}
Output:
r'''
String[1] = "1" is valid input. Parsing...
value[1] ibase=1
String[2] = "1,2,3" is valid input. Parsing...
value[1] ibase=1
value[2] ibase=2
value[3] ibase=3
String[3] = "1-9" is valid input. Parsing...
value[1] ibase=1 range=9
String[4] = "1-9,10-19,20-199" is valid input. Parsing...
value[1] ibase=1 range=9
value[2] ibase=10 range=19
value[3] ibase=20 range=199
String[5] = "1-8,9,10-18,19,20-199" is valid input. Parsing...
value[1] ibase=1 range=8
value[2] ibase=9
value[3] ibase=10 range=18
value[4] ibase=19
value[5] ibase=20 range=199
String[6] = "A" is NOT valid input.
String[7] = "1,2," is NOT valid input.
String[8] = "1 - 9" is NOT valid input.
String[9] = " " is NOT valid input.
String[10] = "" is NOT valid input.
'''
Note that this solution simply demonstrates how to validate an input line and how to parse/extract value components from each line. It does not further validate that for range values the second integer is larger than the first. This logic check however, could be easily added.
Edit:2012-10-24 07:00 Fixed index i to count from zero.

You can use
strinput = '10-19,25,33'
eval(cat(2,'[',strrep(strinput,'-',':'),']'))
Best is to include some input checks, also negative numbers will give problems with this method.

In a simplest approach you can use the evil eval for this
A = eval('[10:19,25,33]')
A =
10 11 12 13 14 15 16 17 18 19 25 33
BUT of course you should think twice before you do that. Especially if this is a user-supplied string! Imagine what would happen if the user supplied any other command...
eval('!rm -rf /')
You would have to make sure that there really is nothing else than what you want. You could do this by regexp.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex: split a complex list (two columns) in Java - java

Assuming the ASCII isn't part of the input, you could try this: ##[a-z\s]+|([\d\.]+)[,\s\t]+([\d\.]+) then replace with: \1 \2 (or $1 $2) Note, this doesn't allow for commas in the numbers

Related

Regex seperate 2 numbers by komma

Regex: starts with messages and string between parent message curly brace

Unwanted elements appearing when splitting a string with multiple separators in Java

Split string after n amount of digits occurrence

Advanced parsing of numeric ranges from string

Categories

Resources