Splitting a string using RegEx matches instead of delimiters - java

I want to split a string like this: "1.2 5" to be tokenized to {"1", ".", "2", "5"} (order matters), I was trying to do this with String.split() using the following regex: ([0-9])\w*|\. but this is what I want to match, not the delimiters.
Is there maybe another method that does this? Is it even possible to split two words that are connected while keeping both intact? (e.g split "1.2" like the above example)
More examples:
"1 2 8" => {"1", "2", "8"}
"1 122 .8" => {"1", "122", "." "8"}
"1 2.800" => {"1", "2", "." "800"}

This regex should work (demo):
s.split("(?=\\.)(?<! )|(?<=\\.)| +")
It works by spliting on places in the string where:
the next character is a literal . (lookahead) and the preceding character is not a space (negative lookbehind)
the preceding character is a literal . (lookbehind)
there are one or more space characters
The java split function removes any matching part of the string. In the case of the lookahead/lookbehind matches, they are are zero-width so split doesn't actually consume any of the string when spliting. The zero-width match basically just marks a position in the string to split at.
This solution will works for all your given examples, and it also works for multiple spaces. Here's a demo.
In response to your comment about the (?<! ) part of the regex. Without that part, The pattern matches every space character, and the position before every . and after every .. One of your examples had a space followed by a . (e.g. "2 .8") which would split like this:
["2", "", ".", "8"]
Note the empty string in the 2nd position. This is because it has split on the space, and then found a position before a ., and split there too. The (?<! ) prevents this by saying "only split before a . if it's not preceded by a space character.

You don't need regex matching, java has a built-in StringTokenizer that is just for this.
Try this:
StringTokenizer st = new StringTokenizer("1.2 5", ". ");
while(st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
Output:
1
2
5
EDIT: and if you want to include the delimiters, use new StringTokenizer(string, delimiters, returnDelims=true). In that case, the output is:
1
.
2
5
If you just want to return the dot, but not the space, skip it in the loop.

I'd rather collect all the non-digit and non-whitespace symbols with [^\d\s] and digits with a \d:
String s = "1.2 5";
Pattern pattern = Pattern.compile("\\d+|[^\\d\\s]+");
Matcher matcher = pattern.matcher(s);
List<String> lst = new ArrayList<>();
while (matcher.find()){
lst.add(matcher.group(0));
}
System.out.println(lst); // => [1, 122, ., 8]
See the Java demo
Pattern details:
\d+ - 1 or more digits
| - or
[^\d\s]+ - one or more chars other than a whitespace or digit
And here is a regex demo.

Related

Java replace strings between two commas

String = "9,3,5,*****,1,2,3"
I'd like to simply access "5", which is between two commas, and right before "*****"; then only replace this "5" to other value.
How could I do this in Java?
You can try using the following regex replacement:
String input = "9,3,5,*****,1,2,3";
input = input.replaceAll("[^,]*,\\*{5}", "X,*****");
Here is an explanation of the regex:
[^,]*, match any number of non-comma characters, followed by one comma
\\*{5} followed by five asterisks
This means to match whatever CSV term plus a comma comes before the five asterisks in your string. We then replace this with what you want, along with the five stars in the original pattern.
Demo here:
Rextester
I'd use a regular expression with a lookahead, to find a string of digits that precedes ",*****", and replace it with the new value. The regular expression you're looking for would be \d+(?=,\*{5}) - that is, one or more digits, with a lookahead consisting of a comma and five asterisks. So you'd write
newString = oldString.replaceAll("\\d+(?=,\\*{5})", "replacement");
Here is an explanation of the regex pattern used in the replacement:
\\d+ match any numbers of digits, but only when
(?=,\\*{5}) we can lookahead and assert that what follows immediately
is a single comma followed by five asterisks
It is important to note that the lookahead (?=,\\*{5}) asserts but does not consume. Hence, we can ignore it with regards to the replacement.
I considered newstr be "6"
String str = "9,3,5,*****,1,2,3";
char newstr = '6';
str = str.replace(str.charAt(str.indexOf(",*") - 1), newstr);
Also if you are not sure about str length check for IndexOutOfBoundException
and handle it
You could split on , and then join with a , (after replacing 5 with the desired value - say X). Like,
String[] arr = "9,3,5,*****,1,2,3".split(",");
arr[2] = "X";
System.out.println(String.join(",", arr));
Which outputs
9,3,X,*****,1,2,3
you can use spit() for replacing a string
String str = "9,3,5,*****,1,2,3";
String[] myStrings = str.split(",");
String str1 = myStrings[2];

How to replace character between two characters?

I have strings "ABC DE", "ABC FE", "ABC RE".
How to replace the characters between ABC and E using regex?
Trying to do this with a regex and replace
str.replace((ABC )[^*](E), 'G');
If you want to remove any characters that appear between "ABC " and "E", then you could accomplish this via lookaheads and the replaceAll() method :
String[] strings = { "ABC DE", "ABC FE", "ABC RE" };
for(int s = 0; s < strings.length; s++){
// Update each string, replacing these characters with a G
strings[s] = strings[s].replaceAll("(?<=ABC ).*(?=E)","G"));
}
Likewise if you didn't explicitly want the space after "ABC", simply remove it from the lookahead by using (?<=ABC).*(?=E).
You can see an interactive example of this here.
You probably want to use regex replaceAll with "ABC(.*?)E" instead.
str = str.replaceAll("ABC(.*?)E", "G");
Explanation:
ABC matches the characters ABC literally (case sensitive)
1st Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
E matches the character E literally (case sensitive)

Regex add space between all punctuation

I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.

Specific Regex Pattern

I wish to take a string input from the user and extract words or numbers like so:
String problem = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String[] solve = {"I'm", "looking", "to", "extract", "all", "6", "substrings"};
Basically, I want to extract numbers and words with complete disregard to punctuation except apostrophes. I know how to get words and strings but I can't seem to figure out this tricky part.
You could do like the below.
String s = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String parts[] = s.replaceAll("[^\\s\\w']|(?<!\\b)'|'(?!\\b)", "").split("\\s+");
System.out.println(Arrays.toString(parts));
Output:
[I'm, looking, to, extract, all, 6, substrings]
Explanation:
[^\\s\\w'] matches any character but not of space or single quote or word character.
(?<!\\b)'(?!\\b) matches the ' symbol only if it's not preceded and not followed by a word character.
replaceAll function replaces all the matched characters with an empty string.
Finally we do splitting on the resultant string according to one or more space characters.

How can I create a create a java regular expression for a comma separator list

How can I create a java regular expression for a comma separator list
(3)
(3,6)
(3 , 6 )
I tried, but it does not match anything:
Pattern.compile("\\(\\S[,]+\\)")
and how can I get the value "3" or "3"and "6" in my code from the Matcher class?
It's not clear to me exactly what your input looks like, but I doubt the pattern your using is what you want. Your pattern will match a literal (, followed by a single non-whitespace character, followed by one or more commas, followed by a literal ).
If you want to match a number, optionally followed by a comma and another number, all surrounded by parentheses, you could try this pattern:
"\\(\\s*(\\d+)\\s*(,\\d+)?\\s*\\)"
That should match (3), ( 3 ), ( 3, 6), etc. but not (a) or (3, a).
You can retrieve the matched digit(s) using Matcher.group; the first digit will be group 1, the second (if any) will be group 2.
Validation regex
You can try this meta-regex approach for clarity:
String pattern =
"< part (?: , part )* >"
.replace("<", "\\(")
.replace(">", "\\)")
.replace(" ", "\\s*")
.replace("part", "[^\\s*(,)]++");
System.out.println(pattern);
/*** this is the pattern
\(\s*[^\s*(,)]+\s*(?:\s*,\s*[^\s*(,)]+\s*)*\s*\)
****/
The part pattern is [^\s(,)]+, i.e. one or more of anything but whitespace, brackets and comma. This construct is called the negated character class. [aeiou] matches any of the 5 vowel letters; [^aeiou] matches everything but (which includes consonants but also numbers, symbols, whitespaces).
The + repetition is also made possessive to ++ for optimization. The (?:...) construct is a non-capturing group, also for optimization.
References
regular-expressions.info/Character Class, Possessive Quantifier, Non-capturing Group
java.util.regex.Pattern
Testing and splitting
We can then test the pattern as follows:
String[] tests = {
"(1,3,6)",
"(x,y!,a+b=c)",
"( 1, 3 , 6)",
"(1,3,6,)",
"(())",
"(,)",
"()",
"(oh, my, god)",
"(oh,,my,,god)",
"([],<>)",
"( !! , ?? , ++ )",
};
for (String test : tests) {
if (test.matches(pattern)) {
String[] parts = test
.replaceAll("^\\(\\s*|\\s*\\)$", "")
.split("\\s*,\\s*");
System.out.printf("%s = %s%n",
test,
java.util.Arrays.toString(parts)
);
} else {
System.out.println(test + " no match");
}
}
This prints:
(1,3,6) = [1, 3, 6]
(x,y!,a+b=c) = [x, y!, a+b=c]
( 1, 3 , 6) = [1, 3, 6]
(1,3,6,) no match
(()) no match
(,) no match
() no match
(oh, my, god) = [oh, my, god]
(oh,,my,,god) no match
([],<>) = [[], <>]
( !! , ?? , ++ ) = [!!, ??, ++]
This uses String.split to get a String[] of all the parts after trimming the brackets out.

Categories