Split on capitalized words not between underscores - java

Given the following string: ThisIsA_SimpleTest_Case
I want to split on all capitalized words not between underscores and on the first underscore of a string between underscores.
The expected splitted result: This Is A SimpleTest Case
I came up with the following none working regex, for the Java regex flavor:
(?=_[a-zA-Z]*_|[A-Z])
But this ofcourse doesn't work since it's an or and not an and. Also this splits on all capitalized words within underscores which is something I want to ignore.

Wiktor is right, it should be easier to try to match instead of splitting on what you don't want.
But because it's a fun challenge, I got one that will split it like you wanted.
_|(?<!_)(?=[A-Z])(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$)
Also works with multiple pairs of underscores.
(It can certainly be improved, I might try to simplify it)
The idea is :
_| Split on any underscore removing it from the final list.
(?<!_) Not right after an underscore. If you don't do that, you might get empty matches after the split (cases already handled by the _|). Can be skipped if you don't care.
(?=[A-Z]) Split before capital letters.
(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$) But it must be followed by an even number of underscores. If there are an odd number, it means you're between 2 and it should not split. I assume there can't be an odd number of underscores in the string.
Test at https://regex101.com/r/Iov1Yl/1/

You might split on:
(?=(?<!_)[A-Z](?![A-Za-z]*_))|(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_
(?=(?<!_)[A-Z](?![A-Za-z]*_)) If it is a position where a char A-Z is not directly preceded by _ and has no _ at the right
| Or
(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z]) If it is a position where what is at the left is not an underscore or the start of the string, and what is directly at the right is a char A-Z
| Or
_ Match an underscore
Regex demo | Java demo
Example code
String regex = "(?=(?<!_)[A-Z](?![A-Za-z]*_))|(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_";
String str = "ThisIsA_SimpleTest_Case";
String[] parts = str.split(regex);
for (String part : parts)
System.out.println(part);
Output
This
Is
A
SimpleTest
Case

Another approach before split:
The string is changed before split, see context:
public static void main(String[] args) {
String input = "ThisIsA_SimpleTest_Case";
String inputReplace1 = input.replaceAll("_(\\w+[a-z])([A-Z]\\w+)_", ",$1#$2");
String inputReplace2 = inputReplace1.replaceAll("(?<=[a-z])(?=[A-Z])", ",");
String inputReplace3 = inputReplace2.replaceAll("#", "");
System.out.println(Arrays.asList(inputReplace3.split(",")));
}
Output:
[This, Is, A, SimpleTest, Case]

Related

Splitting characters

My characters is "!,;,%,#,**,**,(,)" which get from XML. when I split it with ',', I lost the ','.
How can I do to avoid it.
I have already tried to change the comma to '&#002C', but it does not work.
Thre result I want is "!,;,%,#,,,(,)", but not "!,;,%,#,,(,)"
String::split use regex so you can split with this regex ((?<!,),|,(?!,)) like this :
String string = "!,;,%,#,,,(,)";
String[] split = string.split("((?<!,),|,(?!,))");
Details
(?<!,), match a comma if not preceded by a comma
| or
,(?!,) match a comma if not followed by a comma
Outputs
!
;
%
#
,
(
)
If you are trying to extract all characters from string, you can do so by using String.toCharArray()[1] :
String str = "sample string here";
char[] char_array = s.toCharArray();
If you just want to iterate over the characters in the string, you can use the character array obtained from above method or do so by using a for loop and str.charAt(i)[2] to access the character at position i.
[1] https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toCharArray()
[2]https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#charAt(int)
try this, this could be help full. First I replaced the ',' with other string and do split. After complete other string replace with ','
public static void main(String[] args) {
String str = "!,;,%,#,**,**,(,)";
System.out.println(str);
str = str.replace("**,**","**/!/**");
String[] array = str.split(",");
System.out.println(Arrays.stream(array).map(s -> s.replace("**/!/**", ",")).collect(Collectors.toList()));
}
out put
!,;,%,#,**,**,(,)
[!, ;, %, #, ,, (, )]
First, we need to define when the comma is an actual delimiter, and when it is part of a character sequence.
We need to assume that a sequence of commas surrounded by commas is an actual character sequence we want to capture. It can be done with lookarounds:
String s = "!,;,,,%,#,**,**,,,,(,)";
List<String> list = Arrays.asList(s.split(",(?!,)|(?<!,),"));
This regular expression splits by a comma that is either preceded by something that is not a comma, or followed by something that is not a comma.
Note that your formatting string, that is, every character sequence separated by a comma, is a bad design, since you require both the possibility to use a comma as sequence, and the possibility to use multiple characters to be used. That means you can combine them too!
What, for example, if I want to use these two character sequences:
,
,,,,
Then I construct the formatting string like this: ,,,,,,. It is now unclear whether , and ,,,, should be character sequences, or ,, and ,,,.

Split String end with special characters - Java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?
Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+
I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]
#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

How can I escape the \s (space char) from a string?

What I need is to escape each word in a string and escape each special char like: !,?._'#. What I've tried is this:
public class Solution
{
public static void main(String[] args)
{
Scanner scan = new Scanner(System.in);
Pattern pat = Pattern.compile("[!|,|?|.|_|'|#]");
String a = scan.nextLine();
scan.close();
String[] part = pat.split(a);
System.out.println(part.length);
for(String p: part)
System.out.println(p);
}
}
While this does escape the special characters, I can't manage to find a way to have the regex match the spaces between each word.
Also, I've tried using \s and \\s after the regex.
For input like: The dog is a very lazy dog, isn't he?
output should be:
The
dog
is
a
very
lazy
dog
isn
t
he
[..] is character class which describes range for single character, not two characters (we can allow repetition of characters with quantifiers like + * {nim,max} but that is not the case here).
Also you don't need to use | inside [..] because there it is simple character, not OR operator. So [a|b] doesn't mean a OR b, it represents characters a | b (so any repetition of | like |c will represent another | and c).
Based on example you provided, you may be looking for:
Pattern pat = Pattern.compile("[!,?._'#\\s]+");
or since this may be more readable
Pattern pat = Pattern.compile("([!,?._'#]|\\s)+");
You would need to use OR operator | outside of [..] and write \s as "\\s since \ is also special character in String literals (it can be used for instance to create tab character \t) so it requires escaping.
I wrapped entire expression with (..) to create group which can represent all your delimiters. This allowed me to use + (quantifier representing "one or more occurrences") so now you regex can see ,. as single delimiter for split, which will ensure one split on entire expression of few continuous delimiter, rather then splitting on each of them separately. So instead of "a,.b" -> ["a, "", "b"] now we will get ["a", "b"]

How to split comma-separated string but exclude some words containing comma in Java

Assume that we have below string:
"test01,test02,test03,exceptional,case,test04"
What I want is to split the string into string array, like below:
["test01","test02","test03","exceptional,case","test04"]
How can I do that in Java?
This negative lookaround regex should work for you:
(?<!exceptional),|,(?!case)
Working Demo
Java Code:
String[] arr = str.split("(?<!exceptional),|,(?!case)");
Explanation:
This regex matches a comma if any one of these 2 conditions meet:
comma is not preceded by word exceptional using negative lookbehind (?<!exceptional)
comma is not followed by word case using negative lookahead (?!case)
That effectively disallows splitting on comma when it is surrounded by exceptional and case on either side.
#anubhava's answer is great—use it. For completion, here's a general solution that is applicable to many solutions and uses a beautifully simple regex:
exceptional,case|(,)
The left side of the alternation | matches complete exceptional,case. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left. We then replace these commas by something distinctive, and split on that string.
This program shows how to use the regex (see the results at the bottom of the online demo):
String subject = "somethingelse,case,test02,test03,exceptional,case,test04,exceptional,notcase";
Pattern regex = Pattern.compile("exceptional,case|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "##SplitHere##");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("##SplitHere##");
for (String split : splits) System.out.println(split);
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
How can Java understand the exceptional,case is a single word and not to split ?
Still If there would have been some other recurring character like "" you could have split it.
For ex. if It was
"test01","test02","test03","exceptional,case","test04"
You could split it using ","
So in your case it is not possible, unless you use regular expression.
Here's a dead-simple answer, don't know why I didn't think of it yesterday:
(?<!exceptional(?=,case)),
Explanation
A comma (the last character of the regex) that is not preceded by exceptional followed by ,case
String s1 = "test01.test02.test03.{i}.case.test04.test03.{i}.test03.{i}.test03.{i}";
String[] arr1 = s1.split("(?<!)\\.|\\.(?!\\{i})");
Output:
test01
test02
test03.{i}
case
test04
test03.{i}
test03.{i}
test03.{i}
You probably want to use split()
Like this:
String[] array = "test01,test02,test03,exceptional,case,test04".split(",");

RegEx to split camelCase or TitleCase (advanced)

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.
(?<!^)(?=[A-Z])
It works as expected:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
For example with Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
My problem is that it does not work in some cases:
Case 1: VALUE -> V / A / L / U / E
Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext
To my mind, the result shoud be:
Case 1: VALUE
Case 2: eclipse / RCP / Ext
In other words, given n uppercase chars:
if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
if the n chars are at the end, the group should be: (n chars).
Any idea on how to improve this regex?
The following regex works for all of the above examples:
public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}
It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".
The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.
It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:
(?<=[a-z])(?=[A-Z])
Here is how this regex splits your example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt
The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.
Addendum - Improved version
Note: This answer recently got an upvote and I realized that there is a better way...
By adding a second alternative to the above regex, all of the OP's test cases are correctly split.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Here is how the improved regex splits the example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext
Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.
Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase
I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:
((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
and here's an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
; (^[a-z]+) Match against any lower-case letters at the start of the string.
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Here I'm separating each word with a space, so here are some examples of how the string is transformed:
ThisIsATitleCASEString => This Is A Title CASE String
andThisOneIsCamelCASE => and This One Is Camel CASE
This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:
((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
and an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
; (^[a-z]+) Match against any lower-case letters at the start of the command.
; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start).
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
And here are some examples of how a string with numbers is transformed with this regex:
myVariable123 => my Variable 123
my2Variables => my 2 Variables
The3rdVariableIsHere => The 3 rdVariable Is Here
12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
To handle more letters than just A-Z:
s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Either:
Split after any lowercase letter, that is followed by uppercase letter.
E.g parseXML -> parse, XML.
or
Split after any letter, that is followed by upper case letter and lowercase letter.
E.g. XMLParser -> XML, Parser.
In more readable form:
public class SplitCamelCaseTest {
static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";
static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);
public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}
#Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}
Brief
Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.
Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.
Code
See this regex in use here
([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
Results
Sample Input
eclipseRCPExt
SomethingIsWrittenHere
TEXTIsWrittenHERE
VALUE
loremIpsum
Sample Output
eclipse
RCP
Ext
Something
Is
Written
Here
TEXT
Is
Written
HERE
VALUE
lorem
Ipsum
Explanation
Match one or more uppercase alpha character [A-Z]+
Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.
You can use the expression below for Java:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):
String test = "_eclipse福福RCPExt";
Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);
Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between
// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}
if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}
System.out.println(components);
This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.
I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.
I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).
This able to split strings such as:
DrivingB2BTradeIn2019Onwards
to
Driving B2B Trade in 2019 Onwards
A JavaScript Solution
/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* #param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};

Categories