Java Regex Metacharacters - java

I found this thread and one of users on it posted the following line of code:
String[] digits2 = number.split("(?<=.)");
I have consulted a couple of sources- like 1 and 2-to decipher what this code mean but I can't figure it out. Can anybody explain what the argument in the split() method means?
Edit: To anyone who has the same question as I had, here's another helpful link

This is a positive lookbehind. The overall expression means "after any character, but without capturing anything". Essentially, if the string looks like
ABC
then the matches would occur at |, between the characters.
A|B|C|

.split("") (on an empty string/pattern) will match the empty string at the start of the regex. This is an additional empty string character that is undesirable. (?<=.) is a zero-width assertion (does not consume any characters) that matches the zero-width space followed by any character (followed by because it is a lookbehind). This splits on the empty string between each character, but not the empty space between the first character and the start of the string.

Related

Java Regex missing a match in the output

I am currently matching a string against a regular expression. My pattern is:
"(?<=\p{Alnum}|\p{Punct})(\p{Alnum}+\p{Punct}{1})"
I am matching it with the string:
"https://www.google.com/"
My desired result with the above regex and string is:
https:, www., google., com/
I am able to get all the matches successfully except 'https:' one. In that case it is giving out 'ttps:' instead of the required 'https:'
I am not able to understand where I went wrong. Can anyone please help me in figuring this out?
You can use
(?<![^\p{Alnum}\p{Punct}])(\p{Alnum}+\p{Punct})
See the online regex demo.
The (?<![^\p{Alnum}\p{Punct}]) negative lookbehind matches a location that is not immediately preceded by a char other than an alphanumeric and a punctuation char.
Note that your regex required an alphanumeric or punctuation char immediately on the left, so it was impossible to match the start of string position.
Note that {1} is always redundant, you can see more about regex redundancy in the "Writing cleaner regular expressions" YT video of mine.

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

Why the space appears as sub string in this split instruction?

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Regex (Java) to remove all characters up to but not including (a number or a letter a-f followed by a number)

I need help constructing the regular expression to remove all characters up to but not including (a number or a letter a-f followed by a number) in Java:
Here's what I came up with (doesn't work):
string.replaceFirst(".+?(\\d|[a-f]\\d)","");
That line of code replaces the entire string with an empty string.
.+? is every character up to \\d a digit OR [a-f]\\d any of the letters a-f followed by a digit.
This doesn't work, however, can I have some help?
Thanks
EDIT: changed replace with replaceFirst
First off, replace() acts on literals, not regexes. You should use replaceFirst or replaceAll depending on what you want. Your regex problem is that you're including the suffix as part of the string to replace. You can give this a try:
input.replaceFirst(".+?(\\d|[a-f]\\d)","$1")
Here I just include the suffix in the replacement string as well. The more correct approach is to make that a zero-width assertion so that it doesn't get included in the region to replace. You can use a positive lookahead:
input.replaceFirst(".+?(?=(\\d|[a-f]\\d))", "")
The other answers given here have the problem that if the string starts with a-f followed by a number, or just a number, they will actually match and replace the first character. Not sure if that's a relevant scenario. This more convoluted pattern should work though:
"([^a-f\\d]|([a-f](?!\\d)))+"
(that is, everything that's not a digit or a-f, or a-f not followed by a digit).
I'd suggest something along the lines of
string.replaceFirst(".*?(?=(\\d|[a-f]\\d))", "");
s = s.replaceFirst(".*?(?=[a-f]?\\d)", "");
Using .*? instead of .+? insures that the first character gets checked by the lookahead, solving the problem #johusman mentioned. And while your (\\d|[a-f]\\d) isn't causing a problem, [a-f]?\\d is both more efficient and more readable.

Categories