What I need is to escape each word in a string and escape each special char like: !,?._'#. What I've tried is this:
public class Solution
{
public static void main(String[] args)
{
Scanner scan = new Scanner(System.in);
Pattern pat = Pattern.compile("[!|,|?|.|_|'|#]");
String a = scan.nextLine();
scan.close();
String[] part = pat.split(a);
System.out.println(part.length);
for(String p: part)
System.out.println(p);
}
}
While this does escape the special characters, I can't manage to find a way to have the regex match the spaces between each word.
Also, I've tried using \s and \\s after the regex.
For input like: The dog is a very lazy dog, isn't he?
output should be:
The
dog
is
a
very
lazy
dog
isn
t
he
[..] is character class which describes range for single character, not two characters (we can allow repetition of characters with quantifiers like + * {nim,max} but that is not the case here).
Also you don't need to use | inside [..] because there it is simple character, not OR operator. So [a|b] doesn't mean a OR b, it represents characters a | b (so any repetition of | like |c will represent another | and c).
Based on example you provided, you may be looking for:
Pattern pat = Pattern.compile("[!,?._'#\\s]+");
or since this may be more readable
Pattern pat = Pattern.compile("([!,?._'#]|\\s)+");
You would need to use OR operator | outside of [..] and write \s as "\\s since \ is also special character in String literals (it can be used for instance to create tab character \t) so it requires escaping.
I wrapped entire expression with (..) to create group which can represent all your delimiters. This allowed me to use + (quantifier representing "one or more occurrences") so now you regex can see ,. as single delimiter for split, which will ensure one split on entire expression of few continuous delimiter, rather then splitting on each of them separately. So instead of "a,.b" -> ["a, "", "b"] now we will get ["a", "b"]
Related
Given the following string: ThisIsA_SimpleTest_Case
I want to split on all capitalized words not between underscores and on the first underscore of a string between underscores.
The expected splitted result: This Is A SimpleTest Case
I came up with the following none working regex, for the Java regex flavor:
(?=_[a-zA-Z]*_|[A-Z])
But this ofcourse doesn't work since it's an or and not an and. Also this splits on all capitalized words within underscores which is something I want to ignore.
Wiktor is right, it should be easier to try to match instead of splitting on what you don't want.
But because it's a fun challenge, I got one that will split it like you wanted.
_|(?<!_)(?=[A-Z])(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$)
Also works with multiple pairs of underscores.
(It can certainly be improved, I might try to simplify it)
The idea is :
_| Split on any underscore removing it from the final list.
(?<!_) Not right after an underscore. If you don't do that, you might get empty matches after the split (cases already handled by the _|). Can be skipped if you don't care.
(?=[A-Z]) Split before capital letters.
(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$) But it must be followed by an even number of underscores. If there are an odd number, it means you're between 2 and it should not split. I assume there can't be an odd number of underscores in the string.
Test at https://regex101.com/r/Iov1Yl/1/
You might split on:
(?=(?<!_)[A-Z](?![A-Za-z]*_))|(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_
(?=(?<!_)[A-Z](?![A-Za-z]*_)) If it is a position where a char A-Z is not directly preceded by _ and has no _ at the right
| Or
(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z]) If it is a position where what is at the left is not an underscore or the start of the string, and what is directly at the right is a char A-Z
| Or
_ Match an underscore
Regex demo | Java demo
Example code
String regex = "(?=(?<!_)[A-Z](?![A-Za-z]*_))|(?<!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_";
String str = "ThisIsA_SimpleTest_Case";
String[] parts = str.split(regex);
for (String part : parts)
System.out.println(part);
Output
This
Is
A
SimpleTest
Case
Another approach before split:
The string is changed before split, see context:
public static void main(String[] args) {
String input = "ThisIsA_SimpleTest_Case";
String inputReplace1 = input.replaceAll("_(\\w+[a-z])([A-Z]\\w+)_", ",$1#$2");
String inputReplace2 = inputReplace1.replaceAll("(?<=[a-z])(?=[A-Z])", ",");
String inputReplace3 = inputReplace2.replaceAll("#", "");
System.out.println(Arrays.asList(inputReplace3.split(",")));
}
Output:
[This, Is, A, SimpleTest, Case]
I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.
I want to remove that characters from a String:
+ - ! ( ) { } [ ] ^ ~ : \
also I want to remove them:
/*
*/
&&
||
I mean that I will not remove & or | I will remove them if the second character follows the first one (/* */ && ||)
How can I do that efficiently and fast at Java?
Example:
a:b+c1|x||c*(?)
will be:
abc1|xc*?
This can be done via a long, but actually very simple regex.
String aString = "a:b+c1|x||c*(?)";
String sanitizedString = aString.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(sanitizedString);
I think that the java.lang.String.replaceAll(String regex, String replacement) is all you need:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String).
there is two way to do that :
1)
ArrayList<String> arrayList = new ArrayList<String>();
arrayList.add("+");
arrayList.add("-");
arrayList.add("||");
arrayList.add("&&");
arrayList.add("(");
arrayList.add(")");
arrayList.add("{");
arrayList.add("}");
arrayList.add("[");
arrayList.add("]");
arrayList.add("~");
arrayList.add("^");
arrayList.add(":");
arrayList.add("/");
arrayList.add("/*");
arrayList.add("*/");
String string = "a:b+c1|x||c*(?)";
for (int i = 0; i < arrayList.size(); i++) {
if (string.contains(arrayList.get(i)));
string=string.replace(arrayList.get(i), "");
}
System.out.println(string);
2)
String string = "a:b+c1|x||c*(?)";
string = string.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(string);
Thomas wrote on How to remove special characters from a string?:
That depends on what you define as special characters, but try
replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since
you'd then either have to escape it or it would mean "any but these
characters".
Another note: the - character needs to be the first or last one on the
list, otherwise you'd have to escape it or it would define a range (
e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character
positioning, you might want to escape all those characters that have a
special meaning in regular expressions (the following list is not
complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex:
\p{P}\p{S} (keep in mind that in Java strings you'd have to escape
back slashes: "\p{P}\p{S}").
A third way could be something like this, if you can exactly define
what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
Here's less restrictive alternative to the "define allowed characters"
approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and
not a separator (whitespace, linebreak etc.). Note that you can't use
[\P{L}\P{Z}] (upper case P means not having that property), since that
would mean "everything that is not a letter or not whitespace", which
almost matches everything, since letters are not whitespace and vice
versa.
I have a string s that I want to split up so each part separated by the "|" symbol becomes an element of a string array.
Is this how I would go about doing so?
String s = "FirstName1 LastName1|FirstName2 LastName2|FirstName3 LastName4|";
String [] names = s.split("|");
I then want to add these elements to an ArrayList. I did the following
for(int i = 0; i < names.length; i++)
{
friendsNames.add(names[i]);
}
But my ArrayList reads as follows.
Element 1: F
Element 2: I
Element 3: R
Element 4: S
Element 5: T
Element 6:
Element 7: N
Any suggestions for where am I going wrong?
etc.
You need to escape the "|" since its a special character in the regex world.
String [] names = s.split("\\|");
Why 2 "\" ?
For the regex expression, you escape the "|" with "\|". But since "\" is the escape character for Java, you need to escape it to preserve the "\" for the regex expression.
The pipe is special character in a regular expression accepted by the split method. Therefore you must escape that character so that it is interpreted as a separator char.
import static java.util.Arrays.asList;
//...
String s = "FirstName1 LastName1|FirstName2 LastName2|FirstName3 LastName4|";
List<String> items = asList(s.split("\\|"));
If you don't intend your delimiter to represent a regular expression, use the Pattern.quote() method to escape any special characters it contains.
String[] items = s.split(Pattern.quote("|"));
The pipe character, | indicates an alternative in a regular expression. So your split specification means, "split on any empty string, or any empty string." There are empty strings between every character, so that's why every character is split apart.
Special regex characters, like pipe, can be "escaped" with \. This disables it as a special character, and makes it match the | character in input instead.
To complicate things, \ has special meaning in Java character and character string literals, so it needs to be escaped too—with another \ character! This means that, altogether, your delimiter should be "\\|".
I'm new to regular expressions...
I have a problem about the regular expression that will match a string only contains:
0-9, a-z, A-Z, space, comma, and single quote?
If the string contain any char that doesn't belong the above expression, it is invalid.
Is that something like:
Pattern p = Pattern.compile("\\s[a-zA-Z0-9,']");
Matcher m = p.matcher("to be or not");
boolean b = m.lookingAt();
Thank you!
Fix your expression adding bounds:
Pattern p = Pattern.compile("^\\s[a-zA-Z0-9,']+$");
Now your can say m.find() and be sure that this returns true only if your string contains the enumerated symbols only.
BTW is it mistake that you put \\s in the beginning? This means that the string must start from single white space. If this is not the requirement just remove this.
You need to include the space inside the character class and allow more than one character:
Pattern p = Pattern.compile("[\\sa-zA-Z0-9,']*");
Matcher m = p.matcher("to be or not");
boolean b = m.matches();
Note that \s will match any whitespace character (including newlines, tabs, carriage returns, etc.) and not only the space character.
You probably want something like this:
"^[a-zA-Z0-9,' ]+$"