Regular expression to find everything except a pattern - java

I'm pretty new to regular expressions and am looking for one that matches anything except all that matches a given regex. I've found ways to find anything except a specific string, but I need it to not match a regex. Also it has to work in Java.
Background: I am working with Ansi-colored strings. I want to take a string that has some text that may be formatted with Ansi color codes and remove anything except those color codes. This should give me the current color formatting for any character appended onto the string.
A formatted string may look like this:
Hello \u001b[31;44mWorld\u001b[0m!
which would display as Hello World! where the World would be colored red on a blue background.
My regex to find the codes is
\u001b\[\d+(;\d+)*m
Now I want a regex that matches everything but the color codes, so it matches
Hello \u001b[31;44m World \u001b[0m !

Your regex in context:
public static void main(String[] args) {
String input = "Hello \u001b[31;44mWorld\u001b[0m!";
String result = Pattern.compile("\u001b\\[\\d+(;\\d+)*m").matcher(input).replaceAll("");
System.out.println("Output: '" + result + "'");
}
Output:
Output: 'Hello World!'

Regex isn't really meant to give 'everything but' the regex match. The easiest way to generally do something like this though is match what you want (like the color codes in your case), then take the string you have, and remove the matches you found, this will leave 'everything but' the match.
Quick sample (very untested)
String everythingBut = "string that has regex matches".replaceAll("r[eg]+x ", "");
Should result in string that has matches i.e. the inverse of your regex

String text="Hello \u001b[31;44mWorld\u001b[0m!";
Arrays.asList( text.split("\\[([;0-9]+)m"))
.stream()
.forEach(s -> aa.replaceAll(s,""));
OUTPUT:
[31;44m[0m

You can do it like this. It simply finds all the matches and puts them in an array which can be joined to a String if desired.
String pat = "\u001b\\[\\d+(;\\d+)*m";
String html = "Hello \u001b[31;44mWorld\u001b[0m!";
Matcher m = Pattern.compile(pat).matcher(html);
String[] s = m.results().map(mr->mr.group()).toArray(String[]::new);

Related

Remove everything from String which is not on a allowlist using regex

Following regular expression removes each word from a string:
String regex = "\\b(operation|for the|am i|regex|mountain)\\b";
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: "i am looking inverse by using "
I am looking for the inverse operation by using regex. So following example should work.
The words "am i" and "mountain" just indicate that there can be much more words in the list. And also words with spaces can occur in the list.
String regex = "<yet to find>"; // contains words operation,for the,am i,regex,mountain
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: " for the operation regex"
Regards, Harris
Try the regex:
(?:(?!for the|operation|am i|mountain|regex).)*(for the|operation|am i|mountain|regex|$)
Replace the matches by contents of group 1 \1 or $1
Click for Demo
Click for Code
Explanation:
(?:(?!for the|operation|am i|mountain|regex).)* - matches 0+ occurrences of any character that is NOT followed by either for the or operation or am i or mountain or regex
(for the|operation|am i|mountain|regex|$) - matches either for the or operation or am i or mountain or regex or end of the string and captures it in group 1
To expand on Singh's answer in the comments, I'd add that hard-coding the regex for a set of words is not very portable. What if the words change? Are they just words or are they patterns? Can you isolate the part of code that will do this work and test it?
Assuming they're just words:
Define a whitelist
String[] whitelist = {
"operation",
"for",
"the",
"am i",
"regex",
"mountain"
};
Write a method for filtering the words so that only the whitelisted ones are allowed.
String sanitized(String raw, String[] whitelist) {
StringBuilder termsInOr = new StringBuilder();
termsInOr.append("|");
for (String word : whitelist) {
termsInOr.append(word);
}
String regex = ".*?\\b(" + termsInOr.substring(1) + ")\\b";
return Pattern.compile(regex, Pattern.MULTILINE)
.matcher(raw)
.replaceAll(subst);
}
This way the logic is isolated, you have two inputs - a whitelist and the raw string - and the sanitized output. It can be tested with assertions based on your expected output (test cases) if you have a different whitelist or raw string somewhere else in the code you can call the method with that whitelist / raw string to sanitize.

Java | Split words and round brackets with its content into elements of a String Array using regex

Hopefully you can help me out, since I'm really bad at regex, so
Given these examples of String input patterns:
"string1 string2 (more strings here)"
"string1 (more words)"
"str1 str2 str3 [...] strn [...] (words. again.)"
I want to end up with a String[] that looks like this:
["string1", "string2", "(more strings here)"]
Basically it should detect words and everything (also non characters) in round brackets as an individual group and put it in an String Array.
I understand that this captures the round brackets and their content: (\((.*?)\))
and this captures the words: (\w+)
but i have no idea how to combine them. Or is there a better alternative in Java?
Pattern pattern =
Pattern.compile("([\\w]+|\\(.*?\\))"); // match continous word characters or all strings between "(" and ")"
Matcher matcher =
pattern.matcher("string1 (more words)"); // input string
List<String> stringArrayList = new ArrayList<>();
// run matcher again and again to find the next match of regex on the input
while (matcher.find()) {
stringArrayList.add(matcher.group());
}
String[] output = stringArrayList.toArray(new String[0]); // final output
for (String entry :
output) {
System.out.println(entry); // printing
}
You could match the string with the following regular expression (with the case-indifferent flag set), catching the matches in an array.
"\\([^)]*\\)|[a-z\\d]+"
Start your Java engine! (click "Java")
The following link to regex101.com uses the equivalent regex for the PCRE (PHP) engine. I've included that to allow the reader to examine how each part of the regex works. (Move the cursor around to see interesting details pop up on the screen.)
Start your PCRE engine!

Regex to remove only special characters and not other language letters

I used a regex expression to remove special characters from name. The expression will remove all letters except English alphabets.
public static void main(String args[]) {
String name = "Özcan Sevim.";
name = name.replaceAll("[^a-zA-Z\\s]", " ").trim();
System.out.println(name);
}
Output:
zcan Sevim
Expected Output:
Özcan Sevim
I get bad result as I did it this way, the right way will be to remove special characters based on ASCII codes so that other letters will not be removed, can someone help me with a regex that would remove only special characters.
You can use \p{IsLatin} or \p{IsAlphabetic}
name = name.replaceAll("[^\\p{IsLatin}]", " ").trim();
Or to remove the punctuation just use \p{Punct} like this :
name = name.replaceAll("\\p{Punct}", " ").trim();
Outputs
Özcan Sevim
take a look at the full list of Summary of regular-expression constructs and use the one which can help you.
Use Guava CharMatcher for that :) It will be easier to read and maintain it.
name = CharMatcher.ASCII.negate().removeFrom(name);
use [\W+] or "[^a-zA-Z0-9]" as regex to match any special characters and also use String.replaceAll(regex, String) to replace the spl charecter with an empty string. remember as the first arg of String.replaceAll is a regex you have to escape it with a backslash to treat em as a literal charcter.
String string= "hjdg$h&jk8^i0ssh6";
Pattern pt = Pattern.compile("[^a-zA-Z0-9]");
Matcher match= pt.matcher(string);
while(match.find())
{
String s= match.group();
string=string.replaceAll("\\"+s, "");
}
System.out.println(string);

replacing constant value in java Regex with other value if it matches that Regex

i learn little bit about java regex and in my project i have to do some text replacement. for example i have this line
db.articles.Find(112);
i want to replace every occurrence of Find with byId that matches this regex
\s[a-zA-Z]+(\.)[a-zA-Z]+(\.)Find\([0-9]+\);
i write this java code
public static void main(String[] args) {
String data = " db.articles.Find(112);";
String regex = "\\s[a-zA-Z]+(\\.)[a-zA-Z]+(\\.)Find\\([0-9]+\\);";
data = data.replaceAll(regex, "byId");
System.out.println(data); // output is byId
// but i want output something like this -> db.articles.byId(112);
}
but it is not working as expected
Example input
db.articles.Find(12);
dbContex.users.Find(1);
Db.libs.Find(50);
Example output
db.articles.byId(12);
dbContex.users.byId(1);
Db.libs.byId(50);
The replaceAll() method replaces the entire matched string with the replacement value, so you need to capture the parts you want to keep, and insert them in the replacement value:
replaceAll("\\b([a-zA-Z]+\\.[a-zA-Z]+\\.)Find(\\([0-9]+\\);)", "$1byId$2")
See regex101 for demo.
Changes applied:
Replaced \s with \b (word boundary)
Removed capturing of periods ((\\.) -> \\.)
Added capturing of text before and after Find
Added captured text to replacement ($1 and $2)

Regular Expression to get parts of the string which are not inside of $( )

Can you please help me out to write a regular expression for the below string,
Hi $(abc) frnd $(xyz)
In this text I want to match all words that are not surrounded by $( ).So, in the above string I want to match Hi and frnd
I tried with \$((.[^)]*.)) but it matches $(abc) and $(xyz). But I want to match the ones outside the symbols
Can you use negative lookbehind in Java? This seems to work in C# (but you never can tell 100% with regexes!)
(?<!\$\([A-Za-z]*)[A-Za-z]+(?!\))
You can either split the string into parts that don't contain $(...), or you can use replaceAll function to remove the $(...).
// Raw regex: \$\([^)]+\)
str.split("\\$\\([^)]+\\)");
str.replaceAll("\\$\\([^)]+\\)", "")
Then you can extract text all you want. The regex assumes that the text in between $(...) doesn't allow ) to be specified. In cases such as $(abc$(crap)_outside, only _outside will be left after the replacement.
It is possible to write a single regex to pick out the words and ignore the $(...), by using last match boundary \G, but it is simpler to do as above: remove the $(...) parts before matching the text.
Try below code It will print desired output.
public static void main(String[] args)
{
String text = "Hi $(abc) frnd $(xyz)";
String[] textArr = text.split("\\$\\([^)]*\\)");
for (String word : textArr)
{
System.out.println(word.trim());
}
}

Categories