Why the space appears as sub string in this split instruction? - java

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.

inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");

Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Related

Can we replace a specific part of a string literal using any predefined function and regex

I want to replace "&" with a random word "$d" in a given sentence.
Can we replace only those words which start with & and are followed by a single character and a space?
Example:-
Input:-
Two literals are &a and &b and also check &abc and &bac here.
Output:-
Two literals are $da and $db and also check &abc and &bac here.
In the above example in input, the only words that should be replaced are &a and &b(not the complete word should be replaced, only just the '&' in both the words) because these two random words start with & and are followed by a single character and a space.
In the case of the replaceAll() function, it replaces the entire word when I used regex:-
String str="Two literals are &a and &b and also check &abc and &bac here.";
str = str.replaceAll("\\&[a-zA-Z]{1}\\s", "\\$d");
System.out.println(str);
//output for this:-Two literals are $d and $d and also check &abc and &bac here.
//expected output:-Two literals are $da and $db and also check &abc and &bac here.
The correct code for this would be
str.replaceAll("&([a-zA-Z]\\s)", "\\$d$1")
This is an example of backreferencing captured groups in regex, and a here is a nice reference for it. Additionally, here's a relevant StackOverflow question about it.
Essentially, the match inside the parentheses ([a-zA-Z]\\s) matches a single letter and a space. The value of this match can be referenced with $1 since it is of capturing group 1.
So we replace &(a ) with $d(a ) (brackets here to demonstrate what is captured). Credit to u/rzwitserloot for reminding me that OP wants $ not &.
You presumably want a concept called look-ahead: You can match on things being there without 'consuming' it. You can even match on things NOT being there. That's what you want here: Match &[a-z], but only if looking ahead past that, we do NOT see another letter:
for (String test : List.of("Two literals are &a and &bcd", "A literal is &a", "How about &a?")) {
System.out.println(str.replaceAll("&(?=[a-zA-Z](?![a-zA-Z]))", "\\$d"));
}
Perhaps instead you want the single letter thing to just be on any word break (i.e. &z00 should NOT turn into $dz00, even though there is no letter after the z. Then I suggest:
"&(?=[a-zA-Z]\\b)"
That's a lot simpler to read!
A few notes:
(?=x) is 'positive lookahead'. It doesn't itself match anything but makes the match fail if x is not immediately following the match.
(?!x) is 'negative lookahead'. It doesn't itself match anything but makes the match fail if x is immediately following the match.
$ has special meaning in the replacement part so we need to escape it.
\\b is regexpese for 'word break': Doesn't match any characters, but fails if we aren't on a 'word break'. Spaces, dots, end-of-input, end-of-line, a dash, an ampersand - many things are word breaks.
We don't want to match those letters because if we do, they would be replaced.

Matching The Arabic punctuation marks in Java

I want to edit on REGEX_PATTERN2 in this code to work with matches()method of The Arabic punctuation marks
String REGEX_PATTERN = "[\\.|,|:|;|!|_|\\?]+";
String s1 = "My life :is happy, stable";
String[] result = s1.split(REGEX_PATTERN);
for (String myString : result) {
System.out.println(myString);
}
String REGEX_PATTERN2 = "[\\.|,|:|;|!|_|،|؛|؟\\?]+";
String s2 = " حياتي ؛ سعيدة، مستقر";
String[] result2 = s2.split(REGEX_PATTERN2);
for (String myString : result2) {
System.out.println(myString);
}
The output I wanted
My life
is happy
stable
حياتي
سعيدة
مستقر
How I can edit to this code and use the matches() instead of split() method to get the same output with Arabic punctuation marks
There are a few problems here. First this example:
if (word.matches("[\\.|,|:|;|!|\\?]+"))
That is mildly1 incorrect for the following reason:
A . does not need to be escaped in a character class.
A | does not mean alternation in a character class.
A ? does not need to be escaped in a character class.
(For more details, read the javadoc or a tutorial on Java regexes.)
So you can rewrite the above as:
if (word.matches("[.,:;!?]+"))
... assuming that you don't want to classify the pipe character as punctuation.
Now this:
if (word.matches("[\.|,|:|;|!|،|؛|..|...|؟|\?]+"))
You have same problems as above. In addition, you seem to have used the two and three full-stop / period characters instead of (presumably) some Unicode character. I suspect they might be a \ufbb7 or u061e or \u06db, but I'm no linguist. (Certainly 2 or 3 full-stops is incorrect.)
So what are the punctuation characters in Arabic?
To be honest, I think that the answer depends on what source you look at, but Wikipedia states:
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma (,).
1 - By mildly incorrect, I mean that the mistakes in this example are mostly harmless. However, your inclusion of (multiple instances of) the | character n the class does mean that you will incorrectly classify a "pipe" as punctuation.
[] denotes a regex character class, which means it only matches single characters. ... is 3 characters, so it cannot be used in a character class.
In a character class, you don't separate characters with |, and you don't need to escape . and ?.
You probably meant this, which is a list of alternate character sequences:
"(?:\\.|,|:|;|!|\\?|،|؛|؟|\\.\\.|\\.\\.\\.)+"
You might get better performance if you do use a character class where you can:
"(?:\\.{1,3}|[,:;!?،؛؟])+"
Of course, with the + at the end, matching 1-3 periods in each iteration is rather redundant, so this will do:
"[.,:;!?،؛؟]+"
Here's a different approach, that uses Unicode properties instead of specific characters (In case you care about more Arabic marks than just the question mark and comma mentioned in another answer):
"(?=^[\\p{InArabic}.,:;!?]+$)^\\p{IsPunctuation}+$"
It matches an entire string of characters that have a punctuation category, that also are either in the Arabic block or are one of the other punctuation characters you listed in your efforts.
It'll match strings like "؟،" or "؟،:", but not "؟،ؠ" or "؟،a".

Java split by alphabeta char creates an empty value in array

I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?
Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]
Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).
The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"
The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.
So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)

Java: Do I have an effective regex to eliminate symbols & rename a file?

I have a series of link names from which I'm trying to eliminate special characters. From a brief filewalk, my biggest concerns appear to be brackets, parentheses and colons. After unsuccessfully wrestling with escape characters to SELECT : [ and (, I decided instead to exclude everything I wanted to KEEP in the filename.
Consider:
String foo = inputFilname ; //SAMPLE DATA: [Phone]_Michigan_billing_(automatic).html
String scrubbed foo = foo.replaceAll("[^a-zA-Z-._]","") ;
Expected Result: Phone_Michigan_billing_automatic.html
My escape-character regex was approaching 60 characters when I ditched it. The last version I saved before changing strategies was [:.(\\[)|(\\()|(\\))|(\\])] where I thought I was asking for escape-character-[() and ].
The blanket exclude seems to work just fine. Is the Regex really that simple? Any input on how effective this strategy will be? I feel like I'm missing something and need a couple sets of eyes.
In my opinion, you're using the wrong tool for this job. StringUtils has a method named replaceChars that will replace all occurrences of a char with another one. Here's the documentation:
public static String replaceChars(String str,
String searchChars,
String replaceChars)
Replaces multiple characters in a String in one go. This method can also be used to delete characters.
For example:
replaceChars("hello", "ho", "jy") = jelly.
A null string input returns null. An empty ("") string input returns an empty string. A null or empty set of search characters returns the input string.
The length of the search characters should normally equal the length of the replace characters. If the search characters is longer, then the extra search characters are deleted. If the search characters is shorter, then the extra replace characters are ignored.
StringUtils.replaceChars(null, *, *) = null
StringUtils.replaceChars("", *, *) = ""
StringUtils.replaceChars("abc", null, *) = "abc"
StringUtils.replaceChars("abc", "", *) = "abc"
StringUtils.replaceChars("abc", "b", null) = "ac"
StringUtils.replaceChars("abc", "b", "") = "ac"
StringUtils.replaceChars("abcba", "bc", "yz") = "ayzya"
StringUtils.replaceChars("abcba", "bc", "y") = "ayya"
StringUtils.replaceChars("abcba", "bc", "yzx") = "ayzya"
So in your example:
String translated = StringUtils.replaceChars("[Phone]_Michigan_billing_(automatic).html", "[]():", null);
System.out.println(translated);
Will output:
Phone_Michigan_billing_automatic.html
This will be more straightforward and easier to understand than any regex you could write.
I think your regex is the way to go. In general white listing values instead of black listing them is almost always better.(Only allowing characters you KNOW are good instead of eliminating all characters you think are bad) From a security standpoint this regex should be preferred. You will never end up with a inputFilename which has invalid characters.
suggested regex: [^a-zA-Z-._]
I think your regex can be as simple as \W which will match everything that is not a word character (letters, digits, and underscores). This is the negation of \w
So your code becomes:
foo.replaceAll("\W","");
As pointed out in the comments the above also removes periods this will work to also keep periods:
foo.replaceAll("[^\w.]","");
Details: escape every thing that is not (the ^ inside the character class), a digit, underscore, letter ( the \w) or a period (the \.)
As noted above there may be other chars you want to white list: like -. Just include them in your character class as you go along.
foo.replaceAll("[^\w.\-]","");

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

Categories