Exclude words with forward slash in Java regexp

Exclude words with forward slash in Java regexp - java

I'm trying to allow only certain words through a regexp filter in Java, i.e.:
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\.-_]{1," + s.length() + "}$");
But I find that it allows through 140km/h because forward slash isn't handled. Ideally, this word should not be allowed.
Can anyone suggest a fix to my current version?
I'm new to regexp and don't particularly follow it fully yet.
The regexp is in a utils class method as follows:
public static boolean checkStringAlphaNumericChars(String s) {
s = s.trim();
if ((s == null) || (s.equals(""))) {
return false;
}
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\.-_]{1," + s.length() + "}$");
// Pattern p = Pattern.compile("^[a-zA-Z0-9_\\s]{1," + s.length() + "}");
Matcher m = p.matcher(s);
if (m.matches()) {
return true;
}
else {
return false;
}
}
I want to allow strings with underscore, space, period, minus. And to ensure that strings with alpha numerics like 123.45 or -500.00 are accepted but where 5,000.00 is not.

Is it because the hyphen is second-to-last in your character set and is therefore defining a range from the '.' to the '_', which includes '/'?
Try this:
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\._-]$");
Also, NullUserException is right in that there is no need for {1," + s.length() + "}. The fact you start your expression with '^' and end it with '$' will ensure that the entire string is consumed.
Finally, you can make use of \w as a substitute for [a-zA-Z_0-9], simplifying your expression to "^[\\w\\s\\.-]$"

You can just use
public static boolean checkStringAlphaNumericChars(String s) {
return (s != null) && s.matches("[\\w\\s.-]+");
}
The short-circuited null check ensures s is not null when you try to do .matches() on it.
Using \w to look for alphanumerics plus the underscore. tchrist will also be the first to point out this is more correct than [A-Za-z0-9_]
The + at the very end ensures you have at least one character (ie: the string is not empty)
There's no need to use ^ and $ since .matches() tries to match the pattern against the whole string .
There's also no need to escape the dot (.) in a character class.
New Demo: http://ideone.com/qraob

Related

How to replace excessive SQL wildcard by single regex pattern?

I am creating a function that strips the illegal wildcard patterns from the input string. The ideal solution should use a single regex expression, if at all possible.
The illegal wildcard patterns are: %% and %_%. Each instance of those should be replaced with %.
Here's the rub... I'm trying to perform some fuzz testing by running the function against various inputs to try to make it and break it.
It works for the most part; however, with complicated inputs, it doesn't.
The rest of this question has been updated:
The following inputs should return empty string (not an exhaustive list):
The following inputs should return % (not an exhaustive list).
%_%
%%
%%_%%
%_%%%
%%_%_%
%%_%%%_%%%_%
There will be cases where there are other characters with the input... like:
Foo123%_%
Should return "Foo123%"
B4r$%_%
Should return "B4r$%"
B4rs%%_%
Should return "B4rs%"
%%Lorem_%%
Should return "%Lorem_%"
I have tried using several different patterns and my tests are failing.
String input = "%_%%%%_%%%_%";
// old method:
public static String ancientMethod1(String input){
if (input == null)
return "";
return input.replaceAll("%_%", "").replaceAll("%%", ""); // Output: ""
}
// Attempt 1:
// Doesn't quite work right.
// "A%%" is returned as "A%%" instead of "A%"
public static String newMethod1(String input) {
String result = input;
while (result.contains("%%") || result.contains("%_%"))
result = result.replaceAll("%%","%").replaceAll("%_%","%");
if (result.equals("%"))
return "";
return input;
}
// Attempt 2:
// Succeeds, but I would like to simplify this:
public static String newMethod2(String input) {
if (input == null)
return "";
String illegalPattern1 = "%%";
String illegalPattern2 = "%_%";
String result = input;
while (result.contains(illegalPattern1) || result.contains(illegalPattern2)) {
result = result.replace(illegalPattern1, "%");
result = result.replace(illegalPattern2, "%");
}
if (result.equals("%") || result.equals("_"))
return "";
return result;
}
Here's a more complete defined example of how I'm using this: https://gist.github.com/sometowngeek/697c839a1bf1c9ee58be283b1396cf2e

This regular expression string matches all your examples:
"%(?:_?%)+"
It matches strings consisting of a '%' character followed by one or more sequences consisting of zero or one '_' character and one '%' character (close to literal translation), which is another way of saying what I did in comments: "a sequence of '%' and '_' characters, beginning and ending with '%', and not containing two consecutive '_' characters".

I'm not quite sure, if the listed inputs might have other instances, if not, maybe an expression with start and end anchor would be much applicable here, either one by one, or with something similar to:
^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$";
final String string = "%_%\n"
+ "%%\n"
+ "%%_%%\n"
+ "%%%_%%%\n"
+ "%_%%%\n"
+ "%%%_%\n"
+ "%%_%_%\n"
+ "%%_%%%_%%%_%";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Your newMethod1 actually works, except you have a typo - you're returning the input parmeter, not the result of your processing!
Change:
return input; // oops!
to:
return result;
Also, because you're not using regex, you should use replace() rather than replaceAll(), ie:
result = result.replace("%%","%").replace("%_%","%"); // still replaces all occurrences
replace() still replaces all occurrences.
BTW, although not as strict, this works for all of your (currently) posted examples:
public static String myMethod(String input) {
return input.replaceAll("%[%_]*", "%");
}

It looks like all the patterns start with %, then have 0+ % or _ chars and end with %.
Use a mere
input = input.replaceAll("%[%_]*%", "%");
See the regex demo and the regex graph:
Details
% - a % char
[%_]* - 0 or more % or _ chars
% - a % char.

Regex when pattern involves dollar sign ($)

I'm running into a bit of an issue when it comes to matching sub-patterns that involve the dollar sign. For example, consider the following chunk of text:
(en $) foo
oof ($).
ofo (env. 80 $US)
I'm using the following regex :
Pattern p = Pattern.compile(
"\\([\\p{InARABIC}\\s]+\\)|\\([\\p{InBasic_Latin}\\s?\\$]+\\)|\\)([\\p{InARABIC}\\s]+)\\(",
Pattern.CASE_INSENSITIVE);
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return m.group();
} else {
return "(" + match.group(1) + ")";
}
});
return replacement;
}
but can't match text containing $

This code is similar to replaceAll(regex, replacement). Problem is that $ isn't only special in regex argument, but also in replacement where it can be used as reference to match from groups like $x (where x is group ID) or ${groupName} if your regex has (?<groupName>subregex).
This allows us to write code like
String doubled = "abc".replaceAll(".", "$0$0");
System.out.println(doubled); //prints: aabbcc
which will replace each character with its two copies since each character will be matched by . and placed in group 0, so $0$0 represents two repetitions of that matched character.
But in your case you have $ in your text, so when it is matched you are replacing it with itself, so you are using in replacement $ without any information about group ID (or group name) which results in IllegalArgumentException: Illegal group reference.
Solution is to escape that $ in replacement part. You can do it manually, with \, but it is better to use method designed for that purpose Matcher#quoteReplacement (in case regex will evolve and you will need to escape more things, this method should evolve along with regex engine which should save you some trouble later)
So try changing your code to
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return Matcher.quoteReplacement(m.group());
// ^^^^^^^^^^^^^^^^^^^^^^^^
} else {
return Matcher.quoteReplacement("(" + match.group(1) + ")");
// ^^^^^^^^^^^^^^^^^^^^^^^^
}
});
return replacement;
}
}

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.

This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo

You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.

You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}

You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

Investigate a string in java whether it is include some special signs?

I have the following java mehod and have some conditions for the parameter searchPattern:
public boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(Pattern.quote(searchPattern),
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.find();
}
return patternFounded;
}
I want to search for all letter (uppercase and lowercase must be considered) and only (!) the special signs "-", ":" and "=". All other values must be occured a "false" from this method.
How can i implemented this logic for the parameter "searchPattern"?

Try searchPattern = "[a-zA-Z:=-]"

Try this pattern [a-zA-Z=,_!:]
String pattern ="[a-zA-Z=,_!:]";
String input="hello_:,!=";
if(input.matches(pattern)){
System.out.println("true");
}else{
System.out.println("false");
}

"[[a-zA-Z]!-=:\\s]+"
The square bracket mean a character class in which each character in which it will match all character within the brackets. The + means one or more characters in the character class, and the \\s is for spaces.
So if you want just letter an spaces, as per your comment in the original post
"[[a-zA-z]\\s]+"

Use searchPattern as ([a-zA-Z]!-:=)+

searchPattern = "^[A-Za-z!=:-]+$"
^ means "begins with"
$ means "ends with"
[A-Za-z!=:-] is a character class that contains any letter or the symbols !, =, :, -
+ means "1 or more` of the preceding
This will work if the string will solely contain those symbols, ie no spaces or anything else.
If you want a string that contains the given symbols and may also contain whitespace, use:
searchPattern = "^[A-Za-z!=:-\\s]+$"
\\s stands for white-space character
Finally, if you want to simply see if a string contains any one of these symbols, you can use:
searchPattern = "[A-Za-z!=:-]"

Replacing spaces within quotes

I'm really struggling with regex here. Using Java how would I go about replacing all spaces within quotes (double quotes really) with another character (or escaped space "\ ") but ONLY if the phrase ends with a wildcard character.
word1 AND "word2 word3 word4*" OR "word5 word6" OR word7
to
word1 AND "word2\ word3\ word4*" OR "word5 word6" OR word7

I think the best solution is to use a regular expression to find the quoted strings you want, and then to replace the spaces within the regex's match. Something like this:
import java.util.regex.*;
class SOReplaceSpacesInQuotes {
public static void main(String[] args) {
Pattern findQuotes = Pattern.compile("\"[^\"]+\\*\"");
for (String arg : args) {
Matcher m = findQuotes.matcher(arg);
StringBuffer result = new StringBuffer();
while (m.find())
m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
m.appendTail(result);
System.out.println(arg + " -> " + result.toString());
}
}
}
Running java SOReplaceSpacesInQuotes 'word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7' then happily produced the output word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7 -> word1 AND "word2\ word3\ word4*" OR "word5\ word6*" OR word7, which is exactly what you wanted.
The pattern is "[^"]+\*", but backslashes and quotes have to be escaped for Java. This matches a literal quote, any number of non-quotes, a *, and a quote, which is what you want. This assumes that (a) you aren't allowed to have embedded \" escape sequences, and (b) that * is the only wildcard. If you have embedded escape sequences, then use "([^\\"]|\\.)\*" (which, escaped for Java, is \"([^\\\\\\"]|\\\\.)\\*\"); if you have multiple wildcards, use "[^"]+[*+]"; and if you have both, combine them in the obvious way. Dealing with multiple wildcards is a matter of just letting any of them match at the end of the string; dealing with escape sequences is done by matching a quote followed by any number of non-backslash, non-quote characters, or a backslash preceding anything at all.
Now, that pattern finds the quoted strings you want. For each argument to the program, we then match all of them, and using m.group().replace(" ", "\\\\ "), replace each space in what was matched (the quoted string) with a backslash and a space. (This string is \\—why two real backslashes are required, I'm not sure.) If you haven't seen appendReplacement and appendTail before (I hadn't), here's what they do: in tandem, they iterate through the entire string, replacing whatever was matched with the second argument to appendReplacement, and appending it all to the given StringBuffer. The appendTail call is necessary to catch whatever didn't match at the end. The documentation for Matcher.appendReplacement(StringBuffer,String) contains a good example of their use.
Edit: As Roland Illig pointed out, this is problematic if certain kinds of invalid input can appear, such as a AND "b" AND *"c", which would become a AND "b"\ AND\ *"c". If this is a danger (or if it could possibly become a danger in the future, which it likely could), then you should make it more robust by always matching quotes, but only replacing if they ended in a wildcard character. This will work as long as your quotes are always appropriately paired, which is a much weaker assumption. The resulting code is very similar:
import java.util.regex.*;
class SOReplaceSpacesInQuotes {
public static void main(String[] args) {
Pattern findQuotes = Pattern.compile("\"[^\"]+?(\\*)?\"");
for (String arg : args) {
Matcher m = findQuotes.matcher(arg);
StringBuffer result = new StringBuffer();
while (m.find()) {
if (m.group(1) == null)
m.appendReplacement(result, m.group());
else
m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
}
m.appendTail(result);
System.out.println(arg + " -> " + result.toString());
}
}
}
We put the wildcard character in a group, and make it optional, and make the body of the quotes reluctant with +?, so that it will match as little as possible and let the wildcard character get grouped. This way, we match each successive pair of quotes, and since the regex engine won't restart in the middle of a match, we'll only ever match the insides, not the outsides, of quotes. But now we don't always want to replace the spaces—we only want to do so if there was a wildcard character. This is easy: test to see if group 1 is null. If it is, then there wasn't a wildcard character, so replace the string with itself. Otherwise, replace the spaces. And indeed, java SOReplaceSpacesInQuotes 'a AND "b d" AND *"c d"' yields the desired a AND "b d" AND *"c d" -> a AND "b d" AND *"c d", while java SOReplaceSpacesInQuotes 'a AND "b d" AND "c d*"' performs a substitution to get a AND "b d" AND *"c d" -> a AND "b d" AND "c\ *d".

Do you really need regular expressions here? The task seems well-described, but a little too complex for regular expressions. So I would rather program it out explicitly.
package so4478038;
import static org.junit.Assert.*;
import org.junit.Test;
public class QuoteSpaces {
public static String escapeSpacesInQuotes(String input) {
StringBuilder sb = new StringBuilder();
StringBuilder quotedWord = new StringBuilder();
boolean inQuotes = false;
for (int i = 0, imax = input.length(); i < imax; i++) {
char c = input.charAt(i);
if (c == '"') {
if (!inQuotes) {
quotedWord.setLength(0);
} else {
String qw = quotedWord.toString();
if (qw.endsWith("*")) {
sb.append(qw.replace(" ", "\\ "));
} else {
sb.append(qw);
}
}
inQuotes = !inQuotes;
}
if (inQuotes) {
quotedWord.append(c);
} else {
sb.append(c);
}
}
return sb.toString();
}
#Test
public void test() {
assertEquals("word1 AND \"word2\\ word3\\ word4*\" OR \"word5 word6\" OR word7", escapeSpacesInQuotes("word1 AND \"word2 word3 word4*\" OR \"word5 word6\" OR word7"));
}
}

Does it work ?
str.replaceAll("\"", "\\");
I don't have IDE now and I don't test it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Exclude words with forward slash in Java regexp - java

Related

How to replace excessive SQL wildcard by single regex pattern?

Regex when pattern involves dollar sign ($)

Multiple matches with delimiter

Investigate a string in java whether it is include some special signs?

Replacing spaces within quotes

Categories

Resources