Java split by alphabeta char creates an empty value in array - java

I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?

Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]

Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).

The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"

The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.

So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)

Related

How to use regular expressions on an index of a String of array in Java

I am basically trying to find regular expression for a text "TC XX" where XX can be any two digit number. My piece of code is:
boolean b = DocArray[RTArrayIndex].matches("/TC \\d{2}/");
where DocArray - an array of string which is basically derived from another string separated by \t
RTArrayIndex - current index of the DocArray array.
Regular Expression - /TC \\d{2}/
The value of string at the current index is "TC 10", but still the value of "b" I am getting is false.
Another index of the array contains the string, "Refer Logs of TC 10" too, but again the value of "b" is false.
You have a few problems. First, your regex contains some "/" characters, which it is attempting to match. If you remove both of those, you will have a slightly better regex.
boolean b = DocArray[RTArrayIndex].matches("TC \\d{2}");
The regex above should evaluate for your first example, but not your second. You need to account for leading and trailing characters. You can do this by using the "." symbol. "." is a placeholder for any character at all, "" means it can be seen any number of times. If you add ".*" to the beginning and end of your pattern, any string that contains the substring "TC \d\d" will match to your regex.
boolean b = DocArray[RTArrayIndex].matches(".*TC \\d{2}.*");
Remove the slash at the begining and the end of your regular expression like that :
TC \\d{2}
This works for your first exemple. If you want all strings containing TC 10, you need to add some part at the begining and the end like .* (which means 'anything')
The final regular expression should be :
.*TC \\d{2}.*

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Why the space appears as sub string in this split instruction?

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Removing every other character in a string using Java regex

I have this homework problem where I need to use regex to remove every other character in a string.
In one part, I have to delete characters at index 1,3,5,... I have done this as follows:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "$1"));
This prints 12345 which is what I want. Essentially I match two characters at a time, and replacing with the first character. I used group capturing to do this.
The problem is, I'm having trouble with the second part of the homework, where I need to delete characters at index 0,2,4,...
I have done the following:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll(".(.)", "$1"));
This prints abcd5, but the correct answer must be abcd. My regex is only incorrect if the input string length is odd. If it's even, then my regex works fine.
I think I'm really close to the answer, but I'm not sure how to fix it.
You are indeed very close to the answer: just make matching the second char optional.
String s = "1a2b3c4d5";
System.out.println(s.replaceAll(".(.)?", "$1"));
// prints "abcd"
This works because:
Regex is greedy by default, it will take the second character if it's there
When the input is of odd length, the second char won't be there at the last replacement, but you'd still match one char (i.e. last char in input)
You can still use backreferences in substitution even if the group fails to match
It will substitute in the empty string, not "null"
This is different from Matcher.group(int), which returns null for failed groups
References
regular-expressions.info/Optional
A closer look at the first part
Let's take a closer look at the first part of the homework:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "$1"));
// prints "12345"
Here you didn't have to use ? for the second char, but it "works" because even though you didn't match the last char, you didn't have to! The last char can remain unmatched, unreplaced, due to the problem specification.
Now suppose that we want to delete chars at index 1,3,5..., and put the chars at index 0,2,4... in brackets.
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "($1)"));
// prints "(1)(2)(3)(4)5"
A-ha!! Now you're experiencing the exact same problem with odd-length input! You couldn't match the last char with your regex, because your regex needs two chars, but there's only one char at the end for odd-length input!
The solution, again, is to make matching the second char optional:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).?", "($1)"));
// prints "(1)(2)(3)(4)(5)"
my regex is only incorrect if the input string length is odd. if it's even, then my regex works fine.
Change your expresion to .(.)? - the question mark makes the second character optional, which means it doesn't matter if input is odd or even
Your regex needs 2 chars to match, so fails on the final char.
This regex:
".(.{0,1})"
Will make the second char optional, so it will match with your final '5' as well

Java String split regex not working as expected

The following Java code will print "0". I would expect that this would print "4". According to the Java API String.split "Splits this string around matches of the given regular expression". And from the linked regular expression documentation:
Predefined character classes
. Any character (may or may not match line terminators)
Therefore I would expect "Test" to be split on each character. I am clearly misunderstanding something.
System.out.println("Test".split(".").length); //0
You're right: it is split on each character. However, the "split" character is not returned by this function, hence the resulting 0.
The important part in the javadoc: "Trailing empty strings are therefore not included in the resulting array. "
I think you want "Test".split(".", -1).length(), but this will return 5 and not 4 (there are 5 ''spaces'': one before T, one between T and e, others for e-s, s-t, and the last one after the final t.)
You have to use two backslashes and it should work fine:
Here is example:
String[] parts = string.split("\\.",-1);
Everything is ok. "Test" is split on each character, and so between them there is no character. If you want iterate your string over each character you can use charAt and length methods.

Categories