How to remove duplicate characters in a string using regex?

How to remove duplicate characters in a string using regex? - java

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?

It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.

If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.

Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.

This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')

Have you tried positive lookbehind
(?<=.')
Regex101

This regex will do it:
[A-Z]+(\?\?)*'

If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

Java Regular Expression for number of exactly 5 digits anywhere in the string

I'm trying to create a regular expression to parse a 5 digit number out of a string no matter where it is but I can't seem to figure out how to get the beginning and end cases.
I've used the pattern as follows \\d{5} but this will grab a subset of a larger number...however when I try to do something like \\D\\d{5}\\D it doesn't work for the end cases. I would appreciate any help here! Thanks!
For a few examples (55555 is what should be extracted):
At the beginning of the string
"55555blahblahblah123456677788"
In the middle of the string
"2345blahblah:55555blahblah"
At the end of the string
"1234567890blahblahblah55555"

Since you are using a language that supports them use negative lookarounds:
"(?<!\\d)\\d{5}(?!\\d)"
These will assert that your \\d{5} is neither preceded nor followed by a digit. Whether that is due to the edge of the string or a non-digit character does not matter.
Note that these assertions themselves are zero-width matches. So those characters will not actually be included in the match. That is why they are called lookbehind and lookahead. They just check what is there, without actually making it part of the match. This is another disadvantage of using \\D, which would include the non-digit character in your match (or require you to use capturing groups).

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)

The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.

You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."

The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/

You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

use of delimiter function from scanner for "abc-def"

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens

Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.