How to split a java string with a regex?

How to split a java string with a regex? - java

I am struggling a bit with splitting a string.
Here is an example of an input and the correct output I want:
Input: "Hope you're doing well! I am doing ok. " <--- A few spaces after the period
Output:
[Hope, " ", you're, " ", doing, well, "!" , " ", "I", " ",
"am", " ", "doing", " ", "ok", "." , " ", " ", " ", " ", " "]
I want an output that splits all the words into it's own index (even if it includes an apostrophe). Also, I want all the spaces and punctuation(?, !, ., " ") to have their own index in the array.
Here's what I've tried: I have taken a string message and used the split function. I have used a regex which is giving me almost the correct output, but it's not accounting for extra spaces after the period.
The regex I used:
"\\b |(?=\\p{Punct})|(?<=\\p{Punct}) | "
Anyone have any suggestions? Thank you for your time.

Here is a way to do it but it is rather unorthodox.
String str = "Hope you're doing well! I am doing ok. ";
Establish a regex for all the punctuation, spaces, etc you want using a capture group.
String regex = "([!\\s\\.])";
Then replace each occurrence of those surrounded by a non-relevant character. In this case I used a #. You could actually use several characters together as a split delimiter.
Then split on that character.
String[] tokens = str.replaceAll(regex, "#$1#").split("#");
System.out.println(Arrays.toString(tokens));
Prints
[Hope, , you're, , doing, , well, !, , , I, , am, , doing, , ok, ., , ]
You can get rid of the empty strings ("") as follows:
tokens = Arrays.stream(tokens).filter(s->!s.isEmpty()).toArray(String[]::new);

First you need to define what constitutes the characters of a "word", and anything else should become a separate token. Here I've defined word-characters as letters, numbers, apostrophes, and dashes, so the separate tokens are anything but those:
[^\p{L}\p{N}'-]
Then you build a regex using zero-with positive lookahead and lookbehind for the non-word characters, with a little extra to make sure we don't do a zero-width match at beginning or end of input.
(?<!^)(?=[^\p{L}\p{N}'-])|(?<=[^\p{L}\p{N}'])(?!$)
As Java code, that would be:
String input = "Hope you're doing well! I am doing ok. ";
String regex = "(?<!^)(?=[^\\p{L}\\p{N}'-])|(?<=[^\\p{L}\\p{N}'])(?!$)";
String[] tokens = input.split(regex);
System.out.println(Arrays.stream(tokens).map(s -> '"' + s + '"')
.collect(Collectors.joining(", ", "[", "]")));
Output
["Hope", " ", "you're", " ", "doing", " ", "well", "!", " ", "I", " ",
"am", " ", "doing", " ", "ok", ".", " ", " ", " ", " ", " "]

Related

Regex: starts with messages and string between parent message curly brace

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.
String data = "syntax = \"proto3\";\r\n" +
"package grpc;\r\n" +
"\r\n" +
"import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" +
"import \"google/api/annotations.proto\";\r\n" +
"import \"google/protobuf/wrappers.proto\";\r\n" +
"import \"protoc-gen-swagger/options/annotations.proto\";\r\n" +
"\r\n" +
"message Acc {\r\n" +
" message AccErr {\r\n" +
" enum Enum {\r\n" +
" UNKNOWN = 0;\r\n" +
" CASH = 1;\r\n" +
" }\r\n" +
" }\r\n" +
" string account_id = 1;\r\n" +
" string name = 3;\r\n" +
" string account_type = 4;\r\n" +
"}\r\n" +
"\r\n" +
"message Name {\r\n" +
" string firstname = 1;\r\n" +
" string lastname = 2;\r\n" +
"}";
List<String> allMessages = new ArrayList<>();
Pattern pattern = Pattern.compile("message[^\\}]*\\}");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
String str = matcher.group();
allMessages.add(str);
System.out.println(str);
}
}
I am expecting response like below in my array list of string with size 2.
allMessage.get(0) should be:
message Acc {
message AccErr {
enum Enum {
UNKNOWN = 0;
CASH = 1;
}
}
string account_id = 1;
string name = 3;
string account_type = 4;
}
and allMessage.get(1) should be:
message Name {
string firstname = 1;
string lastname = 2;
}

First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):
String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");
See live demo.
If you actually need a List<String>, pass that result to Arrays.asList():
List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));
The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:
(?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
\\A means the very start of input
.*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
(?=^message) is a look ahead and means the following characters are a start of a line then "message"
See regex101 live demo for a thorough explanation.
The split regex matches one or more line break sequences when they are followed by "message":
\\R+ means one or more line break sequences (all OS variants)
(?=message) is a look ahead and means the following characters are "message"
See regex101 live demo for a thorough explanation.

Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.
Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")
No spliting, parsing, or managing nested braces either :)
https://regex101.com/r/Wa2xxx/1

Cannot escape double quotation in JAVA

I'm struggling to escape the double quotation marks when I try to convert my Map to String
//iterate through map and append to array
StringBuilder mapAsString = new StringBuilder("{");
for (String key : billingDetails.keySet()) {
mapAsString.append(key + ":" + billingDetails.get(key) + ", ");
}
mapAsString.delete(mapAsString.length()-2, mapAsString.length()).append("}");
String bdResult = mapAsString.toString();
Current output
"position": [
"{2018:Element3, 2012:Element2, 2010:Element1}"
Expected output
"position": [
"{"2018":"Element3", "2012":"Element2", "2010":"Element1"}"
Following method appends the contents of the Map to String however I can't seem to figure out how to escape the quotation marks so that the expected output is correct
mapAsString.append(key + ":" + billingDetails.get(key) + ", ");
Escaping characters in java has always been somewhat of a difficulty for me and I would appreciate it if someone with a wider knowledge would help me out
Cheers and thanks

Escape the quotes with an backslash
mapAsString.append("\"" + key + "\":\"" + billingDetails.get(key) + "\", ");

Split string after n amount of digits occurrence

I'm parsing some folder names here. I have a program that lists subfolders of a folder and parses folder names.
For example, one folder could be named something like this:
"Folder.Name.1234.Some.Info.Here-ToBeParsed"
and I would like to parse it so name would be "Folder Name". At the moment I'm first using string.replaceAll() to get rid of special characters and then there is this 4-digit sequence. I would like to split string on that point. How can I achieve this?
Currently my code looks something like this:
// Parsing string if regex p matches folder's name
if(b) {
//System.out.println("Folder: \" " + name + "\" contains special characters.");
String result = name.replaceAll("[\\p{P}\\p{S}]", " "); // Getting rid of all punctuations and symbols.
//System.out.println("Parsed: " + name + " > " + result);
// If string matches regex p2
if(b2) {
//System.out.println("Folder: \" " + result + "\" contains release year.");
String parsed_name[] = result.split("20"); // This is the line i would like to split when 4-digits in row occur.
//System.out.println("Parsed: " + result + " > " + parsed_name[0]);
movieNames.add(parsed_name[0]);
}
Or maybe there is even easier way to do this? Thanks in advance!

You should keep it simple like this:
String name = "Folder.Name.1234.Some.Info.Here-ToBeParsed";
String repl = name.replaceFirst( "\\.\\d{4}.*", "" ).
replaceAll( "[\\p{P}\\p{S}&&[^']]+", " " );
//=> Folder Name
replaceFirst is removing everything after a DOT and 4 digits
replaceAll is replacing all punctuation and space (except apostrophe) by a single space

How to replace multiple spaces and newlines with one blank line

How to remove multiple spaces and newlines in a string, but preserve at least one blank line for each group of blank lines.
For example, change:
"This is
a string.
Something."
to
"This is
a string.
Something."
I'm using .trim() to strip whitespace from the beginning and end of a string, but I couldn't find anything for removing multiple spaces and newlines in a string.
I would like to keep just one whitespace and one newline.

The one-line solution to remove multiple spaces/newlines, but preserve at least one blank line from multiple blank lines:
str = str.replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
Each individual line is trimmed too.
Here's some test code:
String str = " This is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \n " +
"\r\n" +
" a string. ";
str = str.trim().replaceAll("(?m)(^ *| +(?= |$))", "").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");
System.out.println(str);
Output:
This is
a string.

The previous advice will trim all whitespace, including the linefeeds and replace them with a single space.
text.replaceAll("\\n\\s*\\n", "\\n").replaceAll("[ \\t\\x0B\\f]+", " ").trim());
First it replaces any instances of linefeeds with only whitespace between them with a single linefeed, then it trims down any other whitespace to a single space ignoring linefeeds.

Here is what I came up with after a bit of testing...
public String keepOneWS(String str) {
Pattern p = Pattern.compile("(\\s+)");
Matcher m = p.matcher(str);
Pattern pBlank = Pattern.compile("[ \t]+");
String newLineReplacement = System.getProperty("line.separator") +
System.getProperty("line.separator");
StringBuffer sb = new StringBuffer();
while (m.find()) {
if(pBlank.matcher(m.group(1)).matches()) {
m.appendReplacement(sb, " ");
} else {
m.appendReplacement(sb, newLineReplacement);
}
}
m.appendTail(sb);
return sb.toString().trim();
}
public void testKeepOneWS() {
String str = " This \t is\r\n " +
"\r\n" +
" \r\n " +
" \r \n \t \n " +
"\r\n" +
" a \t string. \t ";
String expected = "This is" + System.getProperty("line.separator")+
System.getProperty("line.separator") + "a string.";
String actual = keepOneWS(str);
System.out.println("'" + actual + "'");
assertEquals(expected, actual);
}
After a goup of whitespace is captured, it is checked whether it consists only of spaces, if yes then that goup is replaced by one single space, otherwise the goup consits of spaces and line terminators, in this case the group is replaced by one line terminator.
The output is:
'This is
a string.'

Advanced parsing of numeric ranges from string

I'm using Java to parse strings input by the user, representing either single numeric values or ranges. The user can input the following string:
10-19
And his intention is to use whole numbers from 10-19 --> 10,11,12...19
The user can also specify a list of numbers:
10,15,19
Or a combination of the above:
10-19,25,33
Is there a convenient method, perhaps based on regular expressions, to perform this parsing? Or must I split the string using String.split(), then manually iterate the special signs (',' and '-' in this case)?

This is how I would go about it:
Split using the , as a delimiter.
If it matches this regular expression: ^(\\d+)-(\\d+)$, then I know I have a range. I would then extract the numbers and create my range (it might be a good idea to make sure that the first digit is lower than the second digit, because you never know...). You then act accordingly.
If it matches this regular expression: ^\\d+$ I would know I have only 1 number, so I have a specific page. I would then act accordingly.

This tested (and fully commented) regex solution meets the OP requirements:
Java regex solution
// TEST.java 20121024_0700
import java.util.regex.*;
public class TEST {
public static Boolean isValidIntRangeInput(String text) {
Pattern re_valid = Pattern.compile(
"# Validate comma separated integers/integer ranges.\n" +
"^ # Anchor to start of string. \n" +
"[0-9]+ # Integer of 1st value (required). \n" +
"(?: # Range for 1st value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of 1st value. \n" +
")? # Range for 1st value (optional). \n" +
"(?: # Zero or more additional values. \n" +
" , # Comma separates additional values. \n" +
" [0-9]+ # Integer of extra value (required). \n" +
" (?: # Range for extra value (optional). \n" +
" - # Dash separates range integer. \n" +
" [0-9]+ # Range integer of extra value. \n" +
" )? # Range for extra value (optional). \n" +
")* # Zero or more additional values. \n" +
"$ # Anchor to end of string. ",
Pattern.COMMENTS);
Matcher m = re_valid.matcher(text);
if (m.matches()) return true;
else return false;
}
public static void printIntRanges(String text) {
Pattern re_next_val = Pattern.compile(
"# extract next integers/integer range value. \n" +
"([0-9]+) # $1: 1st integer (Base). \n" +
"(?: # Range for value (optional). \n" +
" - # Dash separates range integer. \n" +
" ([0-9]+) # $2: 2nd integer (Range) \n" +
")? # Range for value (optional). \n" +
"(?:,|$) # End on comma or string end.",
Pattern.COMMENTS);
Matcher m = re_next_val.matcher(text);
String msg;
int i = 0;
while (m.find()) {
msg = " value["+ ++i +"] ibase="+ m.group(1);
if (m.group(2) != null) {
msg += " range="+ m.group(2);
};
System.out.println(msg);
}
}
public static void main(String[] args) {
String[] arr = new String[]
{ // Valid inputs:
"1",
"1,2,3",
"1-9",
"1-9,10-19,20-199",
"1-8,9,10-18,19,20-199",
// Invalid inputs:
"A",
"1,2,",
"1 - 9",
" ",
""
};
// Loop through all test input strings:
int i = 0;
for (String s : arr) {
String msg = "String["+ ++i +"] = \""+ s +"\" is ";
if (isValidIntRangeInput(s)) {
// Valid input line
System.out.println(msg +"valid input. Parsing...");
printIntRanges(s);
} else {
// Match attempt failed
System.out.println(msg +"NOT valid input.");
}
}
}
}
Output:
r'''
String[1] = "1" is valid input. Parsing...
value[1] ibase=1
String[2] = "1,2,3" is valid input. Parsing...
value[1] ibase=1
value[2] ibase=2
value[3] ibase=3
String[3] = "1-9" is valid input. Parsing...
value[1] ibase=1 range=9
String[4] = "1-9,10-19,20-199" is valid input. Parsing...
value[1] ibase=1 range=9
value[2] ibase=10 range=19
value[3] ibase=20 range=199
String[5] = "1-8,9,10-18,19,20-199" is valid input. Parsing...
value[1] ibase=1 range=8
value[2] ibase=9
value[3] ibase=10 range=18
value[4] ibase=19
value[5] ibase=20 range=199
String[6] = "A" is NOT valid input.
String[7] = "1,2," is NOT valid input.
String[8] = "1 - 9" is NOT valid input.
String[9] = " " is NOT valid input.
String[10] = "" is NOT valid input.
'''
Note that this solution simply demonstrates how to validate an input line and how to parse/extract value components from each line. It does not further validate that for range values the second integer is larger than the first. This logic check however, could be easily added.
Edit:2012-10-24 07:00 Fixed index i to count from zero.

You can use
strinput = '10-19,25,33'
eval(cat(2,'[',strrep(strinput,'-',':'),']'))
Best is to include some input checks, also negative numbers will give problems with this method.

In a simplest approach you can use the evil eval for this
A = eval('[10:19,25,33]')
A =
10 11 12 13 14 15 16 17 18 19 25 33
BUT of course you should think twice before you do that. Especially if this is a user-supplied string! Imagine what would happen if the user supplied any other command...
eval('!rm -rf /')
You would have to make sure that there really is nothing else than what you want. You could do this by regexp.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split a java string with a regex? - java

Related

Regex: starts with messages and string between parent message curly brace

Cannot escape double quotation in JAVA

Split string after n amount of digits occurrence

How to replace multiple spaces and newlines with one blank line

Advanced parsing of numeric ranges from string

Categories

Resources