Advanced parsing of numeric ranges from string - java

I'm using Java to parse strings input by the user, representing either single numeric values or ranges. The user can input the following string:
10-19
And his intention is to use whole numbers from 10-19 --> 10,11,12...19
The user can also specify a list of numbers:
10,15,19
Or a combination of the above:
10-19,25,33
Is there a convenient method, perhaps based on regular expressions, to perform this parsing? Or must I split the string using String.split(), then manually iterate the special signs (',' and '-' in this case)?

This is how I would go about it:
Split using the , as a delimiter.
If it matches this regular expression: ^(\\d+)-(\\d+)$, then I know I have a range. I would then extract the numbers and create my range (it might be a good idea to make sure that the first digit is lower than the second digit, because you never know...). You then act accordingly.
If it matches this regular expression: ^\\d+$ I would know I have only 1 number, so I have a specific page. I would then act accordingly.

This tested (and fully commented) regex solution meets the OP requirements:
Java regex solution
// TEST.java 20121024_0700
import java.util.regex.*;
public class TEST {
public static Boolean isValidIntRangeInput(String text) {
Pattern re_valid = Pattern.compile(
"# Validate comma separated integers/integer ranges.\n" +
"^ # Anchor to start of string. \n" +
"[0-9]+ # Integer of 1st value (required). \n" +
"(?: # Range for 1st value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of 1st value. \n" +
")? # Range for 1st value (optional). \n" +
"(?: # Zero or more additional values. \n" +
" , # Comma separates additional values. \n" +
" [0-9]+ # Integer of extra value (required). \n" +
" (?: # Range for extra value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of extra value. \n" +
" )? # Range for extra value (optional). \n" +
")* # Zero or more additional values. \n" +
"$ # Anchor to end of string. ",
Pattern.COMMENTS);
Matcher m = re_valid.matcher(text);
if (m.matches()) return true;
else return false;
}
public static void printIntRanges(String text) {
Pattern re_next_val = Pattern.compile(
"# extract next integers/integer range value. \n" +
"([0-9]+) # $1: 1st integer (Base). \n" +
"(?: # Range for value (optional). \n" +
" - # Dash separates range integer. \n" +
" ([0-9]+) # $2: 2nd integer (Range) \n" +
")? # Range for value (optional). \n" +
"(?:,|$) # End on comma or string end.",
Pattern.COMMENTS);
Matcher m = re_next_val.matcher(text);
String msg;
int i = 0;
while (m.find()) {
msg = " value["+ ++i +"] ibase="+ m.group(1);
if (m.group(2) != null) {
msg += " range="+ m.group(2);
};
System.out.println(msg);
}
}
public static void main(String[] args) {
String[] arr = new String[]
{ // Valid inputs:
"1",
"1,2,3",
"1-9",
"1-9,10-19,20-199",
"1-8,9,10-18,19,20-199",
// Invalid inputs:
"A",
"1,2,",
"1 - 9",
" ",
""
};
// Loop through all test input strings:
int i = 0;
for (String s : arr) {
String msg = "String["+ ++i +"] = \""+ s +"\" is ";
if (isValidIntRangeInput(s)) {
// Valid input line
System.out.println(msg +"valid input. Parsing...");
printIntRanges(s);
} else {
// Match attempt failed
System.out.println(msg +"NOT valid input.");
}
}
}
}
Output:
r'''
String[1] = "1" is valid input. Parsing...
value[1] ibase=1
String[2] = "1,2,3" is valid input. Parsing...
value[1] ibase=1
value[2] ibase=2
value[3] ibase=3
String[3] = "1-9" is valid input. Parsing...
value[1] ibase=1 range=9
String[4] = "1-9,10-19,20-199" is valid input. Parsing...
value[1] ibase=1 range=9
value[2] ibase=10 range=19
value[3] ibase=20 range=199
String[5] = "1-8,9,10-18,19,20-199" is valid input. Parsing...
value[1] ibase=1 range=8
value[2] ibase=9
value[3] ibase=10 range=18
value[4] ibase=19
value[5] ibase=20 range=199
String[6] = "A" is NOT valid input.
String[7] = "1,2," is NOT valid input.
String[8] = "1 - 9" is NOT valid input.
String[9] = " " is NOT valid input.
String[10] = "" is NOT valid input.
'''
Note that this solution simply demonstrates how to validate an input line and how to parse/extract value components from each line. It does not further validate that for range values the second integer is larger than the first. This logic check however, could be easily added.
Edit:2012-10-24 07:00 Fixed index i to count from zero.

You can use
strinput = '10-19,25,33'
eval(cat(2,'[',strrep(strinput,'-',':'),']'))
Best is to include some input checks, also negative numbers will give problems with this method.

In a simplest approach you can use the evil eval for this
A = eval('[10:19,25,33]')
A =
10 11 12 13 14 15 16 17 18 19 25 33
BUT of course you should think twice before you do that. Especially if this is a user-supplied string! Imagine what would happen if the user supplied any other command...
eval('!rm -rf /')
You would have to make sure that there really is nothing else than what you want. You could do this by regexp.

Related

Regex: starts with messages and string between parent message curly brace

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}
First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.
Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

How to split a java string with a regex?

I am struggling a bit with splitting a string.
Here is an example of an input and the correct output I want:
Input: "Hope you're doing well! I am doing ok. " <--- A few spaces after the period
Output:
[Hope, " ", you're, " ", doing, well, "!" , " ", "I", " ",
"am", " ", "doing", " ", "ok", "." , " ", " ", " ", " ", " "]
I want an output that splits all the words into it's own index (even if it includes an apostrophe). Also, I want all the spaces and punctuation(?, !, ., " ") to have their own index in the array.
Here's what I've tried: I have taken a string message and used the split function. I have used a regex which is giving me almost the correct output, but it's not accounting for extra spaces after the period.
The regex I used:
"\\b |(?=\\p{Punct})|(?<=\\p{Punct}) | "
Anyone have any suggestions? Thank you for your time.
Here is a way to do it but it is rather unorthodox.
String str = "Hope you're doing well! I am doing ok. ";
Establish a regex for all the punctuation, spaces, etc you want using a capture group.
String regex = "([!\\s\\.])";
Then replace each occurrence of those surrounded by a non-relevant character. In this case I used a #. You could actually use several characters together as a split delimiter.
Then split on that character.
String[] tokens = str.replaceAll(regex, "#$1#").split("#");
System.out.println(Arrays.toString(tokens));
Prints
[Hope, , you're, , doing, , well, !, , , I, , am, , doing, , ok, ., , ]
You can get rid of the empty strings ("") as follows:
tokens = Arrays.stream(tokens).filter(s->!s.isEmpty()).toArray(String[]::new);
First you need to define what constitutes the characters of a "word", and anything else should become a separate token. Here I've defined word-characters as letters, numbers, apostrophes, and dashes, so the separate tokens are anything but those:
[^\p{L}\p{N}'-]
Then you build a regex using zero-with positive lookahead and lookbehind for the non-word characters, with a little extra to make sure we don't do a zero-width match at beginning or end of input.
(?<!^)(?=[^\p{L}\p{N}'-])|(?<=[^\p{L}\p{N}'])(?!$)
As Java code, that would be:
String input = "Hope you're doing well! I am doing ok. ";
String regex = "(?<!^)(?=[^\\p{L}\\p{N}'-])|(?<=[^\\p{L}\\p{N}'])(?!$)";
String[] tokens = input.split(regex);
System.out.println(Arrays.stream(tokens).map(s -> '"' + s + '"')
.collect(Collectors.joining(", ", "[", "]")));
Output
["Hope", " ", "you're", " ", "doing", " ", "well", "!", " ", "I", " ",
"am", " ", "doing", " ", "ok", ".", " ", " ", " ", " ", " "]

Java, Regex: Replace fields in Json by field key

Requirements:
For each Json field where key matches specified constant replace value with another constant.
{"regular":"a", "sensitive":"b"}
Parameters "sensitive", "*****".
Expected:
{"regular":"a", "sensitive":"*****"}
Values may, or may not have double quotes around them. Replacement constant is double quouted always. Json may be malformed. Java implementation preferably.
Key comparison is case insensitive.
Depending on how malformed your "JSON" is, the following might work - if not, we need more test cases:
"sensitive"\s*:\s* # match "sensitive":
( # capture in group 1:
"[^"]*" # any quoted value
| # or
[^\s,{}"]* # any unquoted value, ending at a comma, brace or whitespace
) # end of group 1
In Java:
String resultString = subjectString.replaceAll(
"(?x)\"sensitive\"\\s*:\\s* # match \"sensitive\":\n" +
"( # capture in group 1:\n" +
" \"[^\"]*\" # any quoted value\n" +
"| # or\n" +
" [^\\s,{}\"]* # an unquoted value, ending at comma, brace or whitespace\n" +
") # end of group 1",
"\"sensitive\":\"******\"");
Test it live on regex101.com.
You can use positive lookbehind to achieve this :
public static void main(String[] args) {
String s = "{\"regular\":\"a\", \"sensitive\":\"b\"}";
String key = "sensitive";
String val = "****";
System.out.println(s.replaceAll("(?<=\"" + key + "\":\")(\\w+)", val));
key = "regular";
System.out.println(s.replaceAll("(?<=\"" + key + "\":\")(\\w+)", val));
}
O/P :
{"regular":"a", "sensitive":"****"}
{"regular":"****", "sensitive":"b"}
You can use the following regex:
String t= "{\"regular\":\"a\", \"sensitive\":\"b\"}"; //{"regular":"a", "sensitive":"b"}
String r = t.replaceAll("(\\s*)\"?sensitive\"?\\s*:\\s*\"?b\"?\\s*", "$1\"sensitive\":\"*****\"");
System.out.println("output "+r); //output {"regular":"a", "sensitive":"*****"}
t= "{\"regular\":\"a\",sensitive:b}"; //{"regular":"a", "sensitive":"b"}
r = t.replaceAll("(\\s*)\"?sensitive\"?\\s*:\\s*\"?b\"?\\s*", "$1\"sensitive\":\"*****\"");
System.out.println("output "+r); //output {"regular":"a","sensitive":"*****"}
DEMO: https://regex101.com/r/uHUhEl/1/

Split string after n amount of digits occurrence

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!
You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

Regex: split a complex list (two columns) in Java

I'm trying to figure out how to split a file (two columns) to readLine(); by considering a lot of delimiters (see bellow).
Here are all possibilities of my delimiters (see comments)
+--------+---------+
+ ##some text + //some text which starts with (##) I want to exclude this row
+ 341, 222 + //comma delimited
+ 211 321 + //space delimited
+ 541 1231 + //tab delimited
+ ##some text + //some text which starts with (##) I want to exclude this row
+ 11.3 321.11 + //double values delimited by tab
+ 331.3 33.11 + //double values delimited by space
+ 231.3, 33.1 + //double values delimited by comma
+ ##some text + //some text which starts with (##) I want to exclude this row
+--------+---------+
I want to obtain this table:
+--------+---------+
+ 341 222 +
+ 211 321 +
+ 541 1231 +
+ 11.3 321.11 +
+ 331.3 33.11 +
+ 231.3 33.1 +
+--------+---------+
I will be glad to find a solution to this issue
UPDATE:
For now I have ([,\s\t;])+ (for comma, tab, space, semicolon...) but I can't figure out how to do for ##some text. I tried \##\w+ but didn't work. Any advice?
You can try this...
I have tried it and its working fine.
(\\d+\\.?\\d*),?\\s*?(\\d+\\.?\\d*)
and replace with $1 and $2.
EDIT:
TRY BELOW CODE...
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class regcheck
{
private static Pattern twopart = Pattern.compile("(\\d+\\.?\\d*),?\\s*?(\\d+\\.?\\d*)");
public static void checkString(String s)
{
Matcher m = twopart.matcher(s);
if (m.matches()) {
System.out.println(m.group(1) +" " + m.group(2));
} else {
System.out.println(s + " does not match.");
}
}
public static void main(String[] args) {
System.out.println("Parts of strings are ");
checkString("##some text");
checkString("123, 4567");
checkString("123, 342");
checkString("45.45 4.3");
checkString("3.78, 23.78");
}
}
OUTPUT :
Parts of strings are
##some text does not match.
123 4567
123 342
45.45 4.3
3.78 23.78
m.group(1) will give you the first part.
m.group(2) will give you the second part.
In your code use checkstring() method for single line....
Assuming the ASCII isn't part of the input, you could try this:
##[a-z\s]+|([\d\.]+)[,\s\t]+([\d\.]+)
then replace with:
\1 \2 (or $1 $2)
Note, this doesn't allow for commas in the numbers

Categories