java 8 regular expression for meta characters [duplicate]

java 8 regular expression for meta characters [duplicate] - java

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.

If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.

Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Related

Regex - Disallow certain characters to appear consecutively

I'm not sure if this is possible or not:
Writing a program to convert an infix notation to postfix notation. All is working well so far but trying to implement validation is proving difficult.
I'm trying to use a regex to validate an infix notation, conforming to the following rules:
String must only start with a number or ( (program does not allow negative numbers)
String must only end with number or )
String must only contain 0-9*/()-+
String must not allow following characters to appear together +*/-
I have a regex which conforms to the first 3 rules:
(^[0-9(])([0-9+()*]+)([0-9)]+$)
Is it possible to use regex to implement the last rule?

I will answer only to fourth rule as you have problem only with it.
Yes, there is a possibility, but I think regex is not appropriate tool to check that...
This pattern ^(?(?=.*\+)(?!.*[\*\/-])).+$ will match any string that contain + and not contain other characters: /,*,-. For one character is already lengthy and hard to read. See demo.
It uses conditional expression (?...) to check if lookahead checking for + was successfull, if it is, then negative lookahead assures that you won't have any of \*- characters.
For all characters, the regex will become very big and hard to maintain.
That's why I don't recommend it for this task.

I agree with Michał Turczyn that regex is not the task for this, but not with the reason. It is easy to implement your restrictions. However, your restrictions also allow expressions like (0+3, 2(*4), ((((1, and other things you likely don't want - so regex validation is kind of pointless. If you were writing this with a regex engine with some significant power like PCRE (Perl, PHP) or Onigmo (Ruby), you can fake a parser in regex; but in Java, regex is quite restricted in what it can do. It is enough for the requirements in the question, though:
^[0-9(](?:(?![+*/-][+*/-])[0-9+*/()-])*[0-9)]$
starts with digit or paren
any number of repetitions of any allowed character, such that that character and the next character aren't both operators
ends with digit or thesis.

Is it better to use regular expressions to match on content or on delimiter?

For a concrete example, I want to break a text document into sentences. I'm considering using the follow regular expressions (still might need tweaking):
[!?][\s]*|[.\n][!?\s]+[.!?\s]* which matches on punctuation/whitespace (stuff between sentences - delimiters)
(.|\n)*?([!?.]\s+|[\n]{2,}|$): which captures any string of characters followed by punctuation or newline (full sentences - the content I want itself)
Generally, which of these methods are preferred? In my specific context, I'd like to keep track of the begin and end indices of each sentence, so I can't do something as simple as String#split.
Thanks.

Assuming you intend to use vanilla Pattern and Matcher processing, the first regex will usually be capturing much, much fewer characters (ending punctuation and some whitespace characters at most) and as such should be the fastest. This can make a difference if you're parsing a very huge document.
However, it might be clumsier to extract start and end indexes for each sentence, because you'll need information from two consecutive matches to be able to delimit a whole sentence. The second regex directly maps sentences to individual matches and enables the simplest code.
So no free lunch here. Both will get the job done, although you can probably make them more robust depending on the type of document you're targeting. In particular, beware of unexpected punctuation characters in the middle of sentences, as in :
... the "?" character can be used to...

How to remove duplicate characters in a string using regex?

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?

It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Compare non-English characters

I have a problem when I try compare 'Đ' and 'D' and I need to return true, but Locale English returns false, because 'Đ' cannot be replaced with regex:
"\\p{Block=CombiningDiacriticalMarks}+".

The character class [\u0110D] will match either a Đ or a D (Đ is code point U+0110).
Matching Non-English Characters, a Primer
One common example is the word 'über-geek'. How do I match that word whether or not there is an umlaut above the u? Simple: [\u0252u]ber-geek will match either 'über-geek' or 'uber-geek'.
Depending on your regular expression engine, there are multiple great ways to match locale-specific characters. Buy a book on your specific implementation to discover its wrinkles. By the way, you can find an excellent resource for Unicode-specific regex information at Regular-Expressions.info's Unicode page.
What if I want to match any character? If you have access to the \X character class, it will act as a . in a Unicode context. This means that multiple Unicode code points which combine to form one grapheme will register as one 'character' to the engine.
NOTE: I'm not trying to 'steal' an answer to this one, and I'll delete mine if Ted Hopp moves his out of the comments. I just wanted to make sure that people looking for non-English regex matches can see that this question did indeed get answered.

Java regular expression to match _all_ whitespace characters

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
[Edit]
To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
[Answer]
For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:
[\p{Z}\s]
The answer is in the comments below but since it is a bit hidden I repeat it here.

is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.
You are mixing abstraction levels here.
If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.
You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

You clarified the question the way as I expected: you're actually not looking for the String literal as many here seem to think and for which the solution is too obvious.
Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".
Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.

The is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.

In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392
The short version: \\p{javaSpaceChar}
Why: Per the Pattern class, this maps the Character.isSpaceChar method:
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
👍

Click here for a summary I made of several competing definitions of "whitespace".
You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.

is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up
javax.swing.text.html

The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:
How many non-printing characters are in common use?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.