Remove duplicated characters from String using regex keeping first occurances

Remove duplicated characters from String using regex keeping first occurances - java

I know how to remove duplicated characters from a String and keeping the first occurrences without regex:
String method(String s){
String result = "";
for(char c : s.toCharArray()){
result += result.contains(c+"")
? ""
: c;
}
return result;
}
// Example input: "Type unique chars!"
// Output: "Type uniqchars!"
I know how to remove duplicated characters from a String and keeping the last occurrences with regex:
String method(String s){
return s.replaceAll("(.)(?=.*\\1)", "");
}
// Example input: "Type unique chars!"
// Output: "Typnique chars!"
As for my question: Is it possible, with a regex, to remove duplicated characters from a String, but keep the first occurrences instead of the last?
As for why I'm asking: I came across this codegolf answer using the following function (based on the first example above):
String f(char[]s){String t="";for(char c:s)t+=t.contains(c+"")?"":c;return t;}
and I was wondering if this can be done shorter with a regex and String input. But even if it's longer, I'm just curious in general if it's possible to remove duplicated characters from a String with a regex, while keeping the first occurrences of each character.

It is not the shortest option, and does not only involve a regex, but still an option. You may reverse the string before running the regex you have and then reverse the result back.
public static String g(StringBuilder s){
return new StringBuilder(
s.reverse().toString()
.replaceAll("(?s)(.)(?=.*\\1)", ""))
.reverse().toString();
}
See the online Java demo
Note I suggest adding (?s) (= Pattern.DOTALL inline modifier flag) to the regex so as . could match any symbol including a newline (a . does not match all line breaks by default).

Related

Remove everything from String which is not on a allowlist using regex

Following regular expression removes each word from a string:
String regex = "\\b(operation|for the|am i|regex|mountain)\\b";
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: "i am looking inverse by using "
I am looking for the inverse operation by using regex. So following example should work.
The words "am i" and "mountain" just indicate that there can be much more words in the list. And also words with spaces can occur in the list.
String regex = "<yet to find>"; // contains words operation,for the,am i,regex,mountain
String sentence = "I am looking for the inverse operation by using regex";
String s = Pattern.compile(regex).matcher(sentence.toLowerCase()).replaceAll("");
System.out.println(s); // output: " for the operation regex"
Regards, Harris

Try the regex:
(?:(?!for the|operation|am i|mountain|regex).)*(for the|operation|am i|mountain|regex|$)
Replace the matches by contents of group 1 \1 or $1
Click for Demo
Click for Code
Explanation:
(?:(?!for the|operation|am i|mountain|regex).)* - matches 0+ occurrences of any character that is NOT followed by either for the or operation or am i or mountain or regex
(for the|operation|am i|mountain|regex|$) - matches either for the or operation or am i or mountain or regex or end of the string and captures it in group 1

To expand on Singh's answer in the comments, I'd add that hard-coding the regex for a set of words is not very portable. What if the words change? Are they just words or are they patterns? Can you isolate the part of code that will do this work and test it?
Assuming they're just words:
Define a whitelist
String[] whitelist = {
"operation",
"for",
"the",
"am i",
"regex",
"mountain"
};
Write a method for filtering the words so that only the whitelisted ones are allowed.
String sanitized(String raw, String[] whitelist) {
StringBuilder termsInOr = new StringBuilder();
termsInOr.append("|");
for (String word : whitelist) {
termsInOr.append(word);
}
String regex = ".*?\\b(" + termsInOr.substring(1) + ")\\b";
return Pattern.compile(regex, Pattern.MULTILINE)
.matcher(raw)
.replaceAll(subst);
}
This way the logic is isolated, you have two inputs - a whitelist and the raw string - and the sanitized output. It can be tested with assertions based on your expected output (test cases) if you have a different whitelist or raw string somewhere else in the code you can call the method with that whitelist / raw string to sanitize.

Split a string using multiple delimiters in java [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.

I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)

If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/

The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+

(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.

It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"

String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."

I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"

Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random

1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)

I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.

A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)

You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}

The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();

When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

How to find if a string contains any objects in an array?

Say if I have the following code
String sum = "(5+5)/2*6";
char[] bodmasChars = {'+','-','*','/','.'.'(',')'};
Is there a way to check whether the string contains any of the elements in my char[]?

A regex
String sum = "(5+5)/2*6";
if (sum.matches("(?s).*[-\\+\\*/\\.()].*")) { ...
(?s) lets . also match newlines.
[...] is group of possible character or character ranges. Probably have a backslash too many.

You can try it with Regex or with Simple as :
String s="(5+5)/2*6";
if((s.contains("+")||(s.contains("-")||(s.contains("-"))||....))
{
System.out.println("yes");
}

How to check if a string contains a substring containing spaces?

Say I have a string like this in java:
"this is {my string: } ok"
Note, there can be any number of white spaces in between the various characters. How do I check the above string to see if it contains just the substring:
"{my string: }"
Many thanks!

If you are looking to see if a String contains another specific sequence of characters then you could do something like this :
String stringToTest = "blah blah blah";
if(stringToTest.contains("blah")){
return true;
}
You could also use matches. For a decent explanation on matching Strings I would advise you check out the Java Oracle tutorials for Regular Expressions at :
http://docs.oracle.com/javase/tutorial/essential/regex/index.html
Cheers,
Jamie

If you have any number of white space between each character of your matching string, I think you are better off removing all white spaces from the string you are trying to match before the search. I.e. :
String searchedString = "this is {my string: } ok";
String stringToMatch = "{my string: }";
boolean foundMatch = searchedString.replaceAll(" ", "").contains(stringToMatch.replaceAll(" ",""));

Put it all into a string variable, say s, then do s.contains("{my string: }); this will return true if {my string: } is in s.

For this purpose you need to use String#contains(CharSequence).
Note, there can be any number of white spaces in between the various
characters.
For this purpose String#trim() method is used to returns a copy of the string, with leading and trailing whitespace omitted.
For e.g.:
String myStr = "this is {my string: } ok";
if (myStr.trim().contains("{my string: }")) {
//Do something.
}

The easiest thing to do is to strip all the spaces from both strings.
return stringToSearch.replaceAll("\s", "").contains(
stringToFind.replaceAll("\s", ""));

Look for the regex
\{\s*my\s+string:\s*\}
This matches any sequence that contains
A left brace
Zero or more spaces
'my'
One or more spaces
'string:'
Zero or more spaces
A right brace
Where 'space' here means any whitespace (tab, space, newline, cr)

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?

What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.

[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well

Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.

Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}

Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}

Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove duplicated characters from String using regex keeping first occurances - java

Related

Remove everything from String which is not on a allowlist using regex

Split a string using multiple delimiters in java [duplicate]

How to find if a string contains any objects in an array?

How to check if a string contains a substring containing spaces?

Regular Expression problem in Java

Categories

Resources