Splitting a string on whitespaces - java

I'm currently trying to splice a string into a multi-line string.
The regex should select white-spaces which has 13 characters before.
The problem is that the 13 character count does not reset after the previous selected white-space. So, after the first 13 characters, the regex selects every white-space.
I'm using the following regex with a positive look-behind of 13 characters:
(?<=.{13})
(there is a whitespace at the end)
You can test the regex here and the following code:
import java.util.ArrayList;
public class HelloWorld{
public static void main(String []args){
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
for (String string : str.split("(?<=.{13}) ")) {
System.out.println(string);
}
}
}
The output of this code is as follows:
This is a test.
The
app
should
break
this
string
in
substring
on
whitespaces
after
13
characters
But it should be:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters

You may actually use a lazy limiting quantifier to match the lines and then replace with $0\n:
.{13,}?[ ]
See the regex demo
IDEONE demo:
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
System.out.println(str.replaceAll(".{13,}?[ ]", "$0\n"));
Note that the pattern matches:
.{13,}? - any character that is not a newline (if you need to match any character, use DOTALL modifier, though I doubt you need it in the current scenario), 13 times at least, and it can match more characters but up to the first space encountered
[ ] - a literal space (a character class is redundant, but it helps visualize the pattern).
The replacement pattern - "$0\n" - is re-inserting the whole matched value (it is stored in Group 0) and appends a newline after it.

You can just match and capture 13 characters before white spaces rather than splitting.
Java code:
Pattern p = Pattern.compile( "(.{13}) +" );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1));
}
It will produce:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
RegEx Demo

you can do this with the .split and using regular expression. It would be like this
line.split("\\s+");
This will spilt every word with one or more whitespace.

Related

Use regex to get 2 specific groups of substring

String s = #Section250342,Main,First/HS/12345/Jack/M,2000 10.00,
#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,
#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,
#Section251234,Main,First/HS/12345/Jack/M,2000 11.00
Wherever there is the word /Jack/M in the3 string, I want to pull the section numbers(250342,251234) and the values(10.00,11.00) associated with it using regex each time.
I tried something like this https://regex101.com/r/4te0Lg/1 but it is still messed.
.Section(\d+(?:\.\d+)?).*/Jack/M
If the only parts of each section that change are the section number, the name of the person and the last value (like in your example) then you can make a pattern very easily by using one of the sections where Jack appears and replacing the numbers you want by capturing groups.
Example:
#Section250342,Main,First/HS/12345/Jack/M,2000 10.00
becomes,
#Section(\d+),Main,First/HS/12345/Jack/M,2000 (\d+.\d{2})
If the section substring keeps the format but the other parts of it may change then just replace the rest like this:
#Section(\d+),\w+,(?:\w+/)*Jack/M,\d+ (\d+.\d{2})
I'm assuming that "Main" is a class, "First/HS/..." is a path and that the last value always has 2 and only 2 decimal places.
\d - A digit: [0-9]
\w - A word character: [a-zA-Z_0-9]
+ - one or more times
* - zero or more times
{2} - exactly 2 times
() - a capturing group
(?:) - a non-capturing group
For reference see: https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/regex/Pattern.html
Simple Java example on how to get the values from the capturing groups using java.util.regex.Pattern and java.util.regex.Matcher
import java.util.regex.*;
public class GetMatch {
public static void main(String[] args) {
String s = "#Section250342,Main,First/HS/12345/Jack/M,2000 10.00,#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,#Section251234,Main,First/HS/12345/Jack/M,2000 11.00";
Pattern p = Pattern.compile("#Section(\\d+),\\w+,(?:\\w+/)*Jack/M,\\d+ (\\d+.\\d{2})");
Matcher m;
String[] tokens = s.split(",(?=#)"); //split the sections into different strings
for(String t : tokens) //checks every string that we got with the split
{
m = p.matcher(t);
if(m.matches()) //if the string matches the pattern then print the capturing groups
System.out.printf("Section: %s, Value: %s\n", m.group(1), m.group(2));
}
}
}
You could use 2 capture groups, and use a tempered greedy token approach to not cross #Section followed by a digit.
#Section(\d+)(?:(?!#Section\d).)*\bJack/M,\d+\h+(\d+(?:\.\d+)?)\b
Explanation
#Section(\d+) Match #Section and capture 1+ digits in group 1
(?:(?!#Section\d).)* Match any character if not directly followed by #Section and a digit
\bJack/M, Match the word Jack and /M,
\d+\h+ Match 1+ digits and 1+ spaces
(\d+(?:\.\d+)?) Capture group 2, match 1+ digits and an optional decimal part
\b A word boundary
Regex demo
In Java:
String regex = "#Section(\\d+)(?:(?!#Section\\d).)*\\bJack/M,\\d+\\h+(\\d+(?:\\.\\d+)?)\\b";

Extracting words with - included upper lowercase not working for words it only extracts chars

I'm trying to extract several words from a string with regex matcher &pattern. I did spend some time to make the regular expression I'm using but this doesn't work as expected, any help would be very appreciated.
I made the regular expression I'm using but this doesn't work as expected, some help would be great. I'm able to extract the chars from the words I want but not the entire word.
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main (String[] args){
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'";
Pattern pattern = Pattern.compile("[((a-zA-Z1-9-0)/W)]");
Matcher matcher = pattern.matcher(mebo);
while (matcher.find()) {
System.out.printf("Word is %s %n",matcher.group(0));
}
}
}
This is current output:
Word is 1 Word is 3 Word is 2 Word is 3 Word is 9 Word is 9 Word
is B Word is I Word is M Word is C Word is P Word is 1 Word is 2
Word is B Word is M Word is W Word is Q Word is - Word is C Word
is S Word is P Word is S Word is - Word is D Word is 1 Word is 0
Word is 1 Word is 9 Word is 2 Word is 2 Word is 9 Word is 2 Word
is 2 Word is 9
============
My expectation is to iterate entire words for example:
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'"
word is 1323 word is 99BIMCP word is 1 word is 2 word is BMWQ-CSPS-D1
word is 0192 word is 29229
You can use this as it seems from your regex you want to include character digit and - in your match.
`[\w-]+`
[\w-]+ - Matches (a-z 0-9 _ and - ) one or more time.
Demo
The easiest solution here seems to be to ditch regex overall and just split the string instead. You want to allow digits, alphabetic characters and - in your words. Consider the following code:
for (String word : mebo.split("[^\\d\\w-]+")) {
System.out.printf("Word is %s %n", word);
}
This should exhibit the desired behaviour. Note that this will generate some empty strings, unless you have the + in the splitting pattern.
What this does is splitting the input string between everything that does not match your desired characters. This is accomplished through using an inverted character class.
I would suggest a regex split, followed by a regex replacement:
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'";
String[] parts = mebo.split("\\s*,?\\s+");
for (String part : parts) {
System.out.println(part.replaceAll("[']", ""));
}
1323
99BIMCP
1
2
BMWQ-CSPS-D1
0192
29229
The logic here is to split on whitespace, possibly including a comma separator. Then, we can do a regex replacement cleanup to remove stray characters such as single quotes. Double quotes and any other unwanted characters can easily be added to the character class used for replacement.
In general, regex alone may not suffice here, and you may need a parser to cover every edge case. Case in point, consider the following input line:
One, "Two or more", Three
My answer fails here, because it blindly splits on whitespace, and does not know that escaped whitespace is not a token. A regex would also fail here.

Regex add space between all punctuation

I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.

Java regex to find delimiters with release character

I'm trying to get a similar result \ has in Java String literals. If there are two of them, it's a \, otherwise it "escapes" whatever follows. So if there is a delimiter that follows a single release char, it doesn't count. But two release chars resolve to a release char literal, so then the following delimiter should be considered a delimiter. So, if an odd number of release chars precede a delimiter, it's ignored. For 0 or an even number it's a delimiter. So, in the code example below:
?: <- : is not a delimiter
??: <- : is a delimiter
???: <- : is not a delimiter
????: <- : is a delimiter
Here's sample code showing what doesn't work.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestPattern
{
public static void main(final String[] args)
{
final Matcher m = Pattern.compile("(\\?\\?)*[^\\?]\\:").matcher("a??:b:c");
m.find(0);
System.out.println(m.end());
}
}
The following should work
\b(\?{2})*:
The * means there can be zero of that group. So that capturing group can be the empty string. [^\\?] can be any character that isn't a ?, as ? is not a special character inside a character class. The \ is ignored.
Therefore, b: (with an empty string preceding it) matches, and the second colon is your last (and, in this case, first) match.
I think you simply want "(\\?\\?)*\\?:".
Your regex means:
Zero or more '??'
(\\?\\?)*
Followed by not '?'
[^\\?]
Ending in ':'
\\:
So, your last match is the last colon. That's why the result offset is 6.
You could change for:
final Matcher m = Pattern.compile("((\\?){2})+").matcher("a??:b:????:c");
while (m.find()){
//outputs 1 and 6, places
//you would have to start
//scaping...
System.out.println(m.start());
}
It appears that just be reversing the regex it works. Putting the "don't match a ?" first, and then the "any even number of ?'s" seems to do the trick:
[^?](\\?\\?)*:

regex delete heading and tailing punctuation

I am trying to write a regex in Java to get rid of all heading and tailing punctuation characters except for "-" in a String, however keeping the punctuation within words intact.
I tried to replace the punctuations with "", String regex = "[\\p{Punct}+&&[^-]]"; right now, but it will delete the punctuation within word too.
I also tried to match pattern: String regex = "[(\\w+\\p{Punct}+\\w+)]"; and Matcher.maches() to match a group, but it gives me null for input String word = "#(*&wor(&d#)("
I am wondering what is the right way to deal with Regex group matching in this case
Examples:
Input: #)($&word#)($& Output: word
Input: #)($)word#google.com#)(*$&$ Output: word#google.com
Pattern p = Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
Matcher m = p.matcher("#)($)word#google.com#)(*$&$");
if (m.matches()) {
System.out.println(m.group(1));
}
To give some more info, the key is to have marks for the beginning and end of the string in the regex (^ and $) and to have the middle part match non-greedily (using *? instead of just *).

Categories