Using Java regex to validate date from a long string

Using Java regex to validate date from a long string - java

I'm trying to write a Java routine that can parse out dates from a long string, i.e. given the string:
"Please have the report to me by 6/15, because the shipment comes in on 6/18"
The regex would find both 6/15 and 6/18. I've looked on Stack Overflow and elsewhere, and most examples of a date regex simply verify whether a given string is a date or not, rather than finding dates within a larger amount of text. Ideally, I'd want a regex that could identify all of the main ways people numerically write dates i.e 6/15, 6/15/12, 06/15/12, 15/6/12, 15/06/12, although perhaps it would be best to separate these into different regexes for the purpose of cla. I'm new to regexes (I just started learning about them two days ago) and regexes are still a bit cryptic to me, so I'd appreciate a detailed explanation of any regex suggestions.

If you're not bothering with range checking, this suffices:
(\d{1,2})/(\d{1,2})(?:/(\d{4}|\d{2}))?
To check that you can't do 2/29/2001 but can do 2/29/2000, you really want to do it after the regexp has done its job, or you're going to end up in an asylum.
EDIT: Better yet, for isolating the century, and protecting against things like 2/193 (prompted by Alex's question, even though it's a separate issue):
\b(\d{1,2})/(\d{1,2})(?:/(\d{2})?(\d{2}))?\b
You'd get 4 captures in each match: [month, day, century, year], where century and year could be empty.

\d{1,2}/\d{1,2}(?:/(?:\d{2}){1,2})?
Here's the breakdown:
\d{1,2} matches 1 or 1 digits
/ followed by a /
\d{1,2} followed 1 or 2 more digits
(?:/(?:\d{2}){1,2})? followed by an optional slash and 2 or 4 digit year
From the matches, you'll probably want to parse them with Java DateParse instead of trying to put all the validation rules in the regex.
You may want to protect against fractions as well 1/4th
This can be done by appending a negative lookahead to your regex: (?!th|rd|nd) which causes the regex to not match if followed by th, rd, or nd.

What exactly is your question? You should read some guide about regex first.
You need a method that returns every match in the String like this:
p is the regex, text is your text.
private LinkedList<String> matches(String p, String text) {
LinkedList<String> results = new LinkedList<String>();
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
results.add(matcher.group());
}
return results;
}
You can separate each date-pattern with |
If you put a part of your regex into braces (...), this part is treated as a "group".
So you can extract single numbers out of the matching string (if you want to).

Related

How to find a last occurrence of set of characters in string using regex in java?

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.
So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:
int findIndex(String str){
int index = -1;
Pattern p = Pattern.compile("(?s).*(x|y|z)");
Matcher m = regex.matcher(str);
if(m.find()){
index = m.start(1);
}
return index;
}
As expected it is returning the values correctly, if there is match.
But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.
I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?
PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.

There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.
The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.
The fastest regex to get your job done is
^(?:[^xyz]*+([xyz]))+
It matches
^ - start of string
(?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
[^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
([xyz]) - Group 1: x, y or z.
The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).

StringBuilder both has a reverse and is a CharSequence, so searching is possible.
Pattern p = Pattern.compile("[xyz]");
StringBuilder sb = new StringBuilder(str).reverse();
Matcher m = p.matcher(sb);
return m.find() ? sb.length() - m.end() : -1;
Unfortunately reversal is costly.
A solution without regex is probably faster.
(BTW surrogate pairs are handled correctly by the reversal.)

Java Pattern What is the issue with these pattern match?

I would like a java pattern match a series of non-whitespace character followed or not by a series of whitespace character, the followed by a pair of parenthesis containing anything within with this code:
Pattern p1 = Pattern.compile("[^\\s+][\\s*]\\({1}[.*]\\){1}");
however, when I tried to match it with "a (a)", false is returned.
Maybe similar problems:
two websites saparated by white spaces:
Pattern p4 = Pattern.compile("([^\\s+]([\\.]{1}[^\\s+])+)[\\s+]([^\\s+]([\\.]{1}[^\\s+])+)");
Two strings of non-whitespace character separated by certain list of punctuation or words present in the code below (ex. and, or, aka...) (it could start with the list of words).
Pattern p2 = Pattern.compile(
"([^\\s+][\\s+])?([and|or|aka|&|Related to|moved from|now|formerly|and by the same host|and any address starting with]{1}[\\s+][^\\s+])+");
Pattern p3 = Pattern.compile("[^\\s]+[\\s*][,|&|;|\\s+/|/\\s+]{1}[\\s*][^\\s+]");

I think reading the docs on patterns in java might be helpful
Particular issue is that you put + and * to wrong place but I think the reason is that you don't understand what [something] means. The following code
Pattern p1 = Pattern.compile("[^\\s]+[\\s]*\\({1}.*\\){1}");
//Pattern p1 = Pattern.compile("[^\\s]+[\\s]*\\(.*\\)"); //simplified same pattern
String t = "a (a)";
Matcher matcher = p1.matcher(t);
System.out.println(matcher.matches());
prints true.

[^\\s]+[\\s]*\\(.*?\\)
Will do what you want. Move the asterisk and plus sign outside the character class brackets. Both instances of {1} do nothing. With no other quantifier, tokens are repeated one time and finally [.*] in the case of those two characters literally means permit one of these two characters
[test] means one of t, e, or s. The second t is irrelevant. Most characters inside character classes mean their literal counterpart, but the exceptions involve a lot more explaining than should be done in an S/O answer.
Not that while this will succeed for say a (b), this will give unexpected results if you have two occurences to match in the same sentence and is generally just a messy expression.
For a realistic expression, you need to provide realistic sample data.
An exceptional resource, after learning the basics, is the realtime testing environments provided by sites like http://regex101.com with syntax highlighting, match highlighting, match breakdown, and tooltips on mouseover of tokens, it's a great way to take the second step. While it only supports a few (commonly used) flavors, most mature programming/scripting languages share the same basic/intermediate capabilities in regex.

I need a regex command that isolates all numbers not adjacent to a caret (^)

I am having a lot of trouble figuring Regex command out, and can't seem to find the right combination to fit what I want
Example:
Input: 1x^3+5x^2+6x+2
Output: 1 5 6 2
I need to isolate those values, as they are the coefficients of my polynomial. The input is a String so I figured the best way to do this was by using the .split() function with a custom Regex command.

You can use this regular expression:
(?<!\^)\d+(?!\^)
This uses a negative lookahead and lookbehind to remove characters next to ^.
Since you want to extract coefficients, it finds one or more digits. Modified the middle part if needed.
You can use it this way in Java, for example:
Matcher m = Pattern.compile("(?<!\\^)\\d+(?!\\^)").matcher("1x^3+5x^2+6x+2");
while (m.find()) {
System.out.println("Coefficient: " + m.group());
}
EDIT:
If you also want to detect negative coefficients, you can check for an optional - before digits:
(?<!\^)-?\d+(?!\^)
Keep in mind that as you try to capture more complicated patterns, regular expressions become less suitable as you may get lost in a number of cases to cover.

Regex which matches a string containing at least the specified characters

I have a huge dictionary which I'm trying to look through using a regex. What I would like to do is to find all the words in the dictionary which contain at least one occurrences of each character I provide in no particular order.
Right now I can find words which only contain the specified characters but like I said that is not exactly what I want.
Example:
I want at least one occurrence of each of the following characters {b, a, d}
astring.matches(regex)
I would expect words like:
badder,
baddest,
baffled
Notice they all contain at least one occurence of each character but in no particular order and other characters are present in the strings.
Anyone know how to do this? Other suggestions are also welcome!

You need a series of look-aheads:
^(?=.*b)(?=.*a)(?=.*d).*
which is a pain to construct. However, you can ease the pain by using regex to build it:
String regex = "^" + "bad".replaceAll(".", "(?=.*$0)") + ".*";
If using repeatedly with String.matches(), you would be better to use the following code, because every call to String.matches() compiles the regex again (there is no caching):
// do this once
Pattern pattern = Pattern.compile(regex);
// reuse the pattern many times
if (pattern.matcher(input).matches())

You can use a lookahead to do this if it's available
(?=.*b)(?=.*a)(?=.*d)
However this is quite inefficient. Any reason you can't use multiple String.indexOf checks?

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.