regex - disallowing a sequence of characters

regex - disallowing a sequence of characters - java

I have a variable in the form of {varName} or {varName, "defaultValue"} and I want a regex that will match it. varName is only alphanumeric and "_" (\w+), and the default value can be anything except the combination of "} which signifies the end of the variable. White space doesn't matter between the braces, comma, varName or defaultValue. So far the regex that I have come up with is
\{\s*(\w+)\s*(,\s*\"([^(\"\})]*)\"\s*)?\}
The problem is that the match ends at the first " OR } and not the combination, i.e. {hello, "world"} does match but {hello, "wor"ld"} or {hello, "wor}ld"}
Any idea how to solve this? In case this helps, I'm coding it using Java.

final Pattern p = Pattern.compile("\\{\\s*(\\w+)\\s*(,\\s*\"((?!\"\\}).*)\"\\s*)?\\}");
Matcher m1 = p.matcher("{hello, \"world\"}");
if (m1.matches()) {
System.out.println("var1:" + m1.group(1));
System.out.println("val1:" + m1.group(3));
}
Matcher m2 = p.matcher("{hello, \"wor}ld\"}");
if (m2.matches()) {
System.out.println("var2:" + m2.group(1));
System.out.println("val2:" + m2.group(3));
}
Matcher m3 = p.matcher("{hello, \"wor}\"ld\"}");
if (m3.matches()) {
System.out.println("var3:" + m3.group(1));
System.out.println("val3:" + m3.group(3));
}
/*output:
var1:hello
val1:world
var2:hello
val2:wor}ld
var3:hello
val3:wor}"ld */

I have found the solution to my own question, it's only fair to share. The following regex:
\{\s*(\w+)\s*(,\s*\"((.*?)\s*\")?\})
Will do the trick, stopping at the first sequence of '"}'
In Java, this will be (continuing previous answer's example):
final Pattern p = Pattern.compile("\\{\\s*(\\w+)\\s*(,\\s*\"(.*?)\\s*\"\\s*)?\\}");
Matcher m1 = p.matcher("{hello, \"world\"}");
if (m1.matches()) {
System.out.println("var1:" + m1.group(1));
System.out.println("val1:" + m1.group(3));
}
Matcher m2 = p.matcher("{hello, \"wor\"rld\"}\"}");
if (m2.matches()) {
System.out.println("var2:" + m2.group(1));
System.out.println("val2:" + m2.group(3));
}
/* Output
var1:hello
val1:world
var2:hello
val2:wor"rld"}
*/

Related

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.

There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}

you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Java Regex expression not working

I have a problem with not working REGEX. I dont know what I am doing wrong. My code:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("\\btimetable:(.*);");
//also tried "timetable:(.*);" and "(\\btimetable:)(.*)(;)"
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("S:" + m.start() + ", E:" + m.end());
System.out.println("x: "+ test.substring(m.start(), m.end()));
}
Expected result:
(1) "timetable:xxxxxtimetable:"
(2) "timetable: fullihhghtO"
I thanks for any help.

A non-capturing group could be handy in our case:
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
Pattern p = Pattern.compile("(?:\\btimetable:(.*?);)+"); // <-- here
Matcher m = p.matcher(test);
int i = 1;
while (m.find()) {
System.out.println(i + ") "+ m.group(1));
i++;
}
OUTPUT
1) xxxxxtimetable:
2) fullihhghtO
Regex explained:
(?:\\btimetable:(.*?);)+ by using the non-capturing (?:\\btimetable:...) we'll consume the "timetable:" without capturing it, then the second matching group (.*?) captures what we want to capture (everything between \btimetable: and ;). Pay special attention to the non-greedy term: .*? which means that we'll consume the minimum possible amount of characters until the ;. If we won't use this lazy form, the regex will use "greedy" default mode and will consume all the characters until the last ; in the string!
Now, all that is relevant if you wanted to catch only the unique part, but if you wanted to catch the whole thing:
1) timetable:xxxxxtimetable:;
2) timetable: fullihhghtO;
It can be done easily by modifying the line with the regex to:
Pattern p = Pattern.compile("\\b(timetable:.*?;)+");
which is even simpler: only one capturing group (see that we still have to use the non-greedy mode!).

You don't need to use regex, a simple split would do it :
public static void main(String[] args) throws IOException {
String test = "timetable:xxxxxtimetable:; timetable: fullihhghtO;";
String[] array = test.split(";");
String str1 = array[0].trim();
String str2 = array[1].trim();
System.out.println(str1 + "\n" + str2); //timetable:xxxxxtimetable:
//timetable: fullihhghtO
}

Count regex matches with streams

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :
final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");
There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :
// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();
// count is 0, wrong
final long count = pattern.splitAsStream("1").count();
I also tried (ab)using an IntStream. The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.
final long count = IntStream
.iterate(0, i -> matcher.find() ? 1 : 0)
.limit(100)
.sum();
I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?

To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.
final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();

The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.
The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.
If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas, not the numbers.
System.out.println("[" + pattern.splitAsStream("1,2,3,4")
.collect(Collectors.joining("!")) + "]");
prints [!,!,!,]. The odd bit is why it is giving you 4 and not 3.
Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.
A quick demo:
private void test(Pattern pattern, String s) {
System.out.println(s + "-[" + pattern.splitAsStream(s)
.collect(Collectors.joining("!")) + "]");
}
public void test() {
final Pattern pattern = Pattern.compile("\\d+");
test(pattern, "1,2,3,4");
test(pattern, "a1b2c3d4e");
test(pattern, "1");
}
prints
1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]

You can extend AbstractSpliterator to solve this:
static class SpliterMatcher extends AbstractSpliterator<Integer> {
private final Matcher m;
public SpliterMatcher(Matcher m) {
super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
this.m = m;
}
#Override
public boolean tryAdvance(Consumer<? super Integer> action) {
boolean found = m.find();
if (found)
action.accept(m.groupCount());
return found;
}
}
final Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1
matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4
matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0

Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?
final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
.filter(str -> str.matches("\\d+"))
.count();
where first line can be another way to stream List<String>().stream(), ...
Am I wrong ?

Java 9
You may use Matcher#results() to get hold of all matches:
Stream<MatchResult>    results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.
Java 8 and lower
Another simple solution based on using a reverse pattern:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split).
See this demo:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

regex for removing zeros in decimal string

I need to remove zeros from decimal string
eg: 007.004(100.007) should be transformed to 7.4(100.7)
I tried using a matcher based on the pattern "0+(\d)":
Pattern p = Pattern.compile(regex);
Matcher m = null;
try {
m = p.matcher(version);
while (m.find()) {
System.out.println("Group : " + m.group());
System.out.println("Group 1 : " + m.group(1));
version = version.replaceFirst(m.group(), m.group(1));
System.out.println("Version: " + version);
}
but this results in 7.4(10.7). Any thoughts on this ?

You need to do a replacement with this pattern:
(\\([^)]+\\))|0+
and this replacement string
\\1
In other words, you need to capture all that is between parenthesis first and then looking for zeros. use the replaceAll method.

There is no need to perform a replacement in another string while matching another:
while (m.find()) {
version = version.replaceFirst(m.group(), m.group(1));
You can instead use this replacement:
version = version.replaceAll("(^|\\.)0+", "$1");

If you are trying to remove leading zeroes before a nonzero digit, then you can match such runs with this pattern: "(?<!\\d)0+(?=[1-9])". That even uses a zero-length lookahead, as your tags suggest you might have wanted to do. It would be simpler to use than yours, too, because it doesn't match anything you want to keep:
Pattern p = Pattern.compile("(?<!\\d)0+(?=[1-9])");
Matcher m = p.matcher(version);;
version = matcher.replaceAll("");
If you're only going to do this once, then you can simplify to a one-liner:
version = version.replaceAll("(?<!\\d)0+(?=[1-9])", "");

How to use multiple different patterns?

how to check strings for multi-pattern regex not for single pattern if tried for one pattern but I need it for multi-pattern and i tried but it doesn't work.
when I running these codes just I can get one of them (time or price ) that is in the String but when I combine them don't show me any output.
thanks for your help....
here is my code :
String line = "This order was places for QT 30.00$ ! OK? and time is 2:45";
String pattern = "\\d+[.,]\\d+.[$]"+"\\d:\\d\\d";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}

The "+" operator does not separate patterns - it concatenates strings.
What you can do is provide a pattern that accepts characters in between the two groups.
String pattern = "(\\d+[.,]\\d+.[$]).*(\\d:\\d\\d)";
The parentheses above are optional. If you include them, you can get the matched price and time as separate strings:
if (m.find( )) {
System.out.println("Found value: " + m.group(1) + " with time: " + m.group(2));
}
EDIT:
Just noticed your comment that you're looking for OR, not AND.
You can do that with an expression of the form X | Y:
String pattern = "\\d+[.,]\\d+.[$]|\\d:\\d\\d";
This will match either a price or a time, whichever occurs first. You can get the match with m.group(0).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex - disallowing a sequence of characters - java

Related

Search substring in a string using regex

Java Regex expression not working

Count regex matches with streams

regex for removing zeros in decimal string

How to use multiple different patterns?

Categories

Resources