Regex for special characters in java - java

public static final String specialChars1= "\\W\\S";
String str2 = str1.replaceAll(specialChars1, "").replace(" ", "+");
public static final String specialChars2 = "`~!##$%^&*()_+[]\\;\',./{}|:\"<>?";
String str2 = str1.replaceAll(specialChars2, "").replace(" ", "+");
Whatever str1 is I want all the characters other than letters and numbers to be removed, and spaces to be replaced by a plus sign (+).
My problem is if I use specialChar1, it does not remove some characters like ;, ', ", and if I am use specialChar2 it gives me an error :
java.util.regex.PatternSyntaxException: Syntax error U_REGEX_MISSING_CLOSE_BRACKET near index 32:
How can this be to achieved?. I have searched but could not find a perfect solution.

This worked for me:
String result = str.replaceAll("[^\\dA-Za-z ]", "").replaceAll("\\s+", "+");
For this input string:
/-+!##$%^&())";:[]{}\ |wetyk 678dfgh
It yielded this result:
+wetyk+678dfgh

replaceAll expects a regex:
public static final String specialChars2 = "[`~!##$%^&*()_+[\\]\\\\;\',./{}|:\"<>?]";

The problem with your first regex, is that "\W\S" means find a sequence of two characters, the first of which is not a letter or a number followed by a character which is not whitespace.
What you mean is "[^\w\s]". Which means: find a single character which is neither a letter nor a number nor whitespace. (we can't use "[\W\S]" as this means find a character which is not a letter or a number OR is not whitespace -- which is essentially all printable character).
The second regex is a problem because you are trying to use reserved characters without escaping them. You can enclose them in [] where most characters (not all) do not have special meanings, but the whole thing would look very messy and you have to check that you haven't missed out any punctuation.
Example:
String sequence = "qwe 123 :#~ ";
String withoutSpecialChars = sequence.replaceAll("[^\\w\\s]", "");
String spacesAsPluses = withoutSpecialChars.replaceAll("\\s", "+");
System.out.println("without special chars: '"+withoutSpecialChars+ '\'');
System.out.println("spaces as pluses: '"+spacesAsPluses+'\'');
This outputs:
without special chars: 'qwe 123 '
spaces as pluses: 'qwe+123++'
If you want to group multiple spaces into one + then use "\s+" as your regex instead (remember to escape the slash).

I had a similar problem to solve and I used following method:
text.replaceAll("\\p{Punct}+", "").replaceAll("\\s+", "+");
Code with time bench marking
public static String cleanPunctuations(String text) {
return text.replaceAll("\\p{Punct}+", "").replaceAll("\\s+", "+");
}
public static void test(String in){
long t1 = System.currentTimeMillis();
String out = cleanPunctuations(in);
long t2 = System.currentTimeMillis();
System.out.println("In=" + in + "\nOut="+ out + "\nTime=" + (t2 - t1)+ "ms");
}
public static void main(String[] args) {
String s1 = "My text with 212354 digits spaces and \n newline \t tab " +
"[`~!##$%^&*()_+[\\\\]\\\\\\\\;\\',./{}|:\\\"<>?] special chars";
test(s1);
String s2 = "\"Sample Text=\" with - minimal \t punctuation's";
test(s2);
}
Sample Output
In=My text with 212354 digits spaces and
newline tab [`~!##$%^&*()_+[\\]\\\\;\',./{}|:\"<>?] special chars
Out=My+text+with+212354+digits+spaces+and+newline+tab+special+chars
Time=4ms
In="Sample Text=" with - minimal punctuation's
Out=Sample+Text+with+minimal+punctuations
Time=0ms

you can use a regex like this:
[<#![CDATA[¢<(+|!$*);¬/¦,%_>?:#="~{#}\]]]#>]`
remove "#" at first and at end from expression
regards

#npinti
using "\w" is the same as "\dA-Za-z"
This worked for me:
String result = str.replaceAll("[^\\w ]", "").replaceAll("\\s+", "+");

Related

Matching a whole word with leading or trailing special symbols like dollar in a string

I can replace dollar signs by using Matcher.quoteReplacement. I can replace words by adding boundary characters:
from = "\\b" + from + "\\b";
outString = line.replaceAll(from, to);
But I can't seem to combine them to replace words with dollar signs.
Here's an example. I am trying to replace "$temp4" (NOT $temp40) with "register1".
String line = "add, $temp4, $temp40, 42";
String to = "register1";
String from = "$temp4";
String outString;
from = Matcher.quoteReplacement(from);
from = "\\b" + from + "\\b"; //do whole word replacement
outString = line.replaceAll(from, to);
System.out.println(outString);
Outputs
"add, $temp4, $temp40, 42"
How do I get it to replace $temp4 and only $temp4?
Use unambiguous word boundaries, (?<!\w) and (?!\w), instead of \b that are context dependent:
from = "(?<!\\w)" + Pattern.quote(from) + "(?!\\w)";
See the regex demo.
The (?<!\w) is a negative lookbehind that fails the match if there is a non-word char immediately to the left of the current location and (?!\w) is a negative lookahead that fails the match if there is a non-word char immediately to the right of the current location. The Pattern.quote(from) is necessary to escape any special chars in the from variable.
See the Java demo:
String line = "add, $temp4, $temp40, 42";
String to = "register1";
String from = "$temp4";
String outString;
from = "(?<!\\w)" + Pattern.quote(from) + "(?!\\w)";
outString = line.replaceAll(from, to);
System.out.println(outString);
// => add, register1, $temp40, 42
Matcher.quoteReplacement() is for the replacement string (to), not the regex (from). To include a string literal in the regex, use Pattern.quote():
from = Pattern.quote(from);
$ has special meaning in regex (it means “end of input”). To remove any special meaning from characters in your target, wrap it in regex quote/unquote expressions \Q...\E. Also, because $ is not ”word” character, the word boundary won’t wiork, so use look arounds instead:
line = line.replaceAll("(?<!\\S)\\Q" + from + "\\E(?![^ ,])", to);
Normally, Pattern.quote is the way to go to escape characters that may be specially interpreted by the regex engine.
However, the regular expression is still incorrect, because there is no word boundary before the $ in line; space and $ are both non-word characters. You need to place the word boundary after the $ character. There is no need for Pattern.quote here, because you're escaping things yourself.
String from = "\\$\\btemp4\\b";
Or more simply, because you know there is a word boundary between $ and temp4 already:
String from = "\\$temp4\\b";
The from variable can be constructed from the expression to replace. If from has "$temp4", then you can escape the dollar sign and add a word boundary.
from = "\\" + from + "\\b";
Output:
add, register1, $temp40, 42

Using a regex to match a word ending in a comma but not within another word

I want to use a regex to achieve two objectives: match a string only when it is a complete word (don't match "on" inside of "contact"), and match strings that end with a comma or period.
This is an example. It is meant to find the string (str2) in str and replace it with the same string surrounded by parenthesis.
while(scan2.hasNext()) {
    String str2 = scan2.next();
    str = str.replaceAll("\\b" + str2 + "\\b", "(" + str2 + ")");
}
It does avoid matching strings within words, but it ignores strings that end in a comma or period.
How would I do this?
public class Main {
public static void main(String[] args) {
System.out.println(replace("upon contact", "on"));
System.out.println(replace("upon contact,", "contact"));
System.out.println(replace("upon contact", "contact"));
}
private static String replace(String s1, String s2) {
return s1.replaceAll(String.format("\\b(%s)\\b(?=[.,])", s2), "\\($1\\)");
}
}
upon contact // matches only complete words
upon (contact), // replaces match with (match)
upon contact // only matches if ends with , or .
The following regex matches string ending with comma/period or string composed by a single complete word:
(?s)(^(?<A>\b\w+\b)$)|((?s)^(?<B>.+(?<=[,.]))$)
See also https://regex101.com/r/E78rQV/1/ for more explanations.
I took the liberty of adding exclamation point and question mark.
Brackets means it will match for any of the characters inside the brackets.
str = str.replaceAll("\\b" + str2 + "[\\b.,!?]", "(" + str2 + ")");

java regex replaceAll with negated groups

I'm trying to use the String.replaceAll() method with regex to only keep letter characters and ['-_]. I'm trying to do this by replacing every character that is neither a letter nor one of the characters above by an empty string.
So far I have tried something like this (in different variations) which correctly keeps letters but replaces the special characters I want to keep:
current = current.replaceAll("(?=\\P{L})(?=[^\\'-_])", "");
Make it simplier :
current = current.replaceAll("[^a-zA-Z'_-]", "");
Explanation :
Match any char not in a to z, A to Z, ', _, - and replaceAll() method will replace any matched char with nothing.
Tested input : "a_zE'R-z4r#m"
Output : a_zE'R-zrm
You don't need lookahead, just use negated regex:
current = current.replaceAll("[^\\p{L}'_-]+", "");
[^\\p{L}'_-] will match anything that is not a letter (unicode) or single quote or underscore or hyphen.
Your regex is too complicated. Just specify the characters you want to keep, and use ^ to negate, so [^a-z'_-] means "anything but these".
public class Replacer {
public static void main(String[] args) {
System.out.println("with 1234 &*()) -/.,>>?chars".replaceAll("[^\\w'_-]", ""));
}
}
You can try this:
String str = "Se#rbi323a`and_Eur$ope#-t42he-[A%merica]";
str = str.replaceAll("[\\d+\\p{Punct}&&[^-'_\\[\\]]]+", "");
System.out.println("str = " + str);
And it is the result:
str = Serbia'and_Europe-the-[America]

Java - Split string

i have string which is separated by "." when i try to split it by the dot it is not getting spitted.
Here is the exact code i have. Please let me know what could cause this not to split the string.
public class TestStringSplit {
public static void main(String[] args) {
String testStr = "[Lcom.hexgen.ro.request.CreateRequisitionRO;";
String test[] = testStr.split(".");
for (String string : test) {
System.out.println("test : " + string);
}
System.out.println("Str Length : " + test.length);
}
}
I have to separate the above string and get only the last part. in the above case it is CreateRequisitionRO not CreateRequisitionRO; please help me to get this.
You can split this string through StringTokenizer and get each word between dot
StringTokenizer tokenizer = new StringTokenizer(string, ".");
String firstToken = tokenizer.nextToken();
String secondToken = tokenizer.nextToken();
As you are finding for last word CreateRequisitionRO you can also use
String testStr = "[Lcom.hexgen.ro.request.CreateRequisitionRO;";
String yourString = testStr.substring(testStr.lastIndexOf('.')+1, testStr.length()-1);
String testStr = "[Lcom.hexgen.ro.request.CreateRequisitionRO;";
String test[] = testStr.split("\\.");
for (String string : test) {
System.out.println("test : " + string);
}
System.out.println("Str Length : " + test.length);
The "." is a regular expression wildcard you need to escape it.
Change String test[] = testStr.split("."); to String test[] = testStr.split("\\.");.
As the argument to String.split takes a regex argument, you need to escape the dot character (which means wildcard in regex):
Note that String.split takes in a regular expression, and . has special meaning in regular expression (which matches any character except for line separator), so you need to escape it:
String test[] = testStr.split("\\.");
Note that you escape the . at the level of regular expression once: \., and to specify \. in a string literal, \ needs to be escaped again. So the string to pass to String.split is "\\.".
Or another way is to specify it inside a character class, where . loses it special meaning:
String test[] = testStr.split("[.]");
You need to escape the . as it is a special character, a full list of these is available. Your split line needs to be:
String test[] = testStr.split("\\.");
Split takes a regular expression as a parameter. If you want to split by the literal ".", you need to escape the dot because that is a special character in a regular expression. Try putting 2 backslashes before your dot ("\\.") - hopefully that does what you are looking for.
String test[] = testStr.split("\\.");

java - Why replaceAll is not working?

Im starting to learn regex and I don't know if I understand it correctly.
I have a problem with function replaceAll because it does not replace the character in a string that I want to replace.
Here is my code:
public class TestingRegex {
public static void main (String args[]) {
String string = "Hel%l&+++o_Wor_++l%d&#";
char specialCharacters[] = {'%', '%', '&', '_'};
for (char sc : specialCharacters) {
if (string.contains(sc + ""))
string = string.replaceAll(sc + "", "\\" + sc);
}
System.out.println("New String: " + string);
}
}
The output is the same as the original. Nothing changed.
I want the output to be : Hel\%l\&+++o\_Wor\_++l\%d\&\#.
Please help. Thanks in advance.
The reason why it's not working: You need four backslashes in a Java string to create a single "real" backslash.
string = string.replaceAll(sc, "\\\\" + sc);
should work. But this is not the right way to do it. You don't need a for loop at all:
String string = "Hel%l&+++o_Wor_++l%d&#";
string = string.replaceAll("[%&_]", "\\\\$0");
and you're done.
Explanation:
[%&_] matches any of the three characters you want to replace
$0 is the result of the match, so
"\\\\$0" means "a backslash plus whatever was matched by the regex".
Caveat: This solution is obviously not checking whether any of those characters had already been escaped previously. So
Hello\%
would become
Hello\\%
which you would not want to happen. Could this be a problem?

Categories