Incremental Pattern (RegEx) matching in Java? - java

Is there a way or an efficient library that allows for incremental regular expression matching in Java?
What I mean by that is, I would like to have an OutputStream that I can send a couple bytes at a time to and that keeps track of matching the data so far against a regular expression. If a byte is received that will cause this regular expression to definitely not match, I would like the stream to tell me so. Otherwise it should keep me informed about the current best match, if any.
I realize that this is likely to be an extremely difficult and not well defined problem, since one can imagine regular expressions that can match a whole expression or any part of it or not have a decision until the stream is closed anyways. Even something as trivial as .* can match H, He, Hel, Hell, Hello, and so forth. In such a case, I would like the stream to say: "Yes, this expression could match if it was over now, and here are the groups it would return."
But if Pattern internally steps through the string it matches character by character, it might not be so hard?

Incremental matching can be nicely achieved by computing the finite state automaton corresponding to a regular expression, and performing state transitions on that while processing the characters of the input. Most lexers work this way. This approach won't work well for groups, though.
So perhaps you could make this two parts: have one matcher which figures out whether there is any match at all, or any chance of a match in the future. You can use that to give you a quick reply after every input character. Once you have a complete match, you can exucte a backtracking and grouping regular expression engine to identify your matching groups. In some cases, it might be feasible to encode the grouping stuff into the automaton as well, but I can't think of a generic way to accomplish this.

Related

Java Matcher matches() method to match the entire region against the pattern

I have a pattern (\{!(.*?)\})+ that can be used to validate an expression of format {!someExpression} one or more number of times.
I am performing
Pattern.compile("(\\{!(.*?)\\})+").matcher("{!expression1} {!expression2}").matches() to match the entire region against the pattern.
There is a space between expression1 and expression2.
Expected -> false
Actual -> true
I tried both greedy and lazy quantifiers but not able to figure out the catch here. Any help is appreciated.
Of course it matches. Your regexp says so. matches() matches the whole string, so you're doing exactly what you are asking. The point is, that regex matches the whole string. Try it in any regex tool.
Specifically, (.*?) will happily match expression1} {!expression2. Why shouldn't it? You said 'non-greedy' which doesn't do anything unless we're talking about subgroup matching; non-greediness cannot change what is being matched, it only affects, if it matches, how the groups are divided out. Non-greedy does not mean 'magically do what I want you to', however useful that might seem to be. . will match } just as well as x.
As a general rule if you're using non-greediness you're doing it wrong. It's not a universal rule; if you really know what you're doing (mostly: That you're modifying how backrefs / group matches / find() ends up spacing it out), it's fine. If you're tossing non-greediness in there as you write your regexp that's usually a sign you misunderstand what you're actually writing down.
Presumably, your intent with the non-greedy operator here is that you do not want it to also consume the } that 'ends' the {!expr} block.
In which case, just ask for that then: "Consume everything that isn't a }":
Pattern.compile("(\\{!([^}]*)\\})+").matcher("{!expression1} {!expression2}").matches()
works great.
If your intent is instead that expressions can also contain {} symbols and that this is a much more convoluted grammar system then your question cannot be answered without a full breakdown of what the grammarsystem entails. Note that many grammars are not 'regular' (that's a specific term that refers to a subset of all imaginable grammars), then it cannot be parsed out with a regular expression. That's what the 'regular' in regular expression refers to: A class of grammars. regexes can be used meaningfully on anything that fits a regular grammar. They are useless for anything that isn't, even if it seems like it could work. Thus, if there is a sizable grammar behind this {expr} syntax, it's possible you need an actual full parser for it.
As a simple example, java the language is not regular and therefore cannot meaningfully be parsed with regexes (that is: Whatever aim your regex has, I can write a valid java file that the compiler understands which your regex won't).

Finding all regular expression/s from a value

I have a variable having some URL and a file containing 100's of regular expression. How can i find which regular expression/s will hold true for that variable. I don't want to do the pattern match for each and every pattern in the file. Looking for performance efficient solution.
While, ultimately, you won't get away with a "true" performance-efficient solution, there are some simple heuristics you can utilize to help cut down on the number of patterns you need to evaluate.
For instance, try "grouping" patterns using simplified versions. Consider the two patterns
[a-z]\d[a-z]
[a-z]{3}
Any string matching both of these patterns will also match the pattern [a-z].[a-z]. If you skip the previous two patterns if the more general pattern doesn't match, you'll (likely) save on overall processing time. The more you can generalize, the more patterns you can eliminate at once. The ultimate expression of this is hierarchical, in which patterns follow a file-system-like organization of groups. While the worst-case performance of this system is worse than just going through all the patterns, the average case will likely be somewhat better as different groups of patterns are eliminated.
You're not going to get better than O(n) performance on the number of regexes, but you're likely to have savings on the coefficient of n.

Does java world has the counterpart Regexp::Optimizer in perl? [duplicate]

I wrote a Java program which can generate a sequence of symbols, like "abcdbcdefbcdbcdefg". What I need is Regex optimizer, which can result "a((bcd){2}ef){2}g".
As the input may contain unicodes, like "a\u0063\u0063\bbd", I prefer a Java version.
The reason I want to get a "shorter" expression is for saving space/memory. The sequence of symbols here could be very long.
In general, to find the "shortest" optimized regex is hard. So, here, I don't need ones that guarantee the "shortest" criteria.
I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)
Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.
I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
Minimize the DFA with the standard algorithm.
Use the construction from the proof of Kleene's theorem to get to a regex.
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Can I declare preference over matching terms in a regular expression?

Is there a way to declare preference in a regular expression?
For example assume I have the following terms to search:
cat eats mouse
And I have the following text:
I saw yesterday a big mouse in our house. Why? We have a cat!A cat eats mouse.Right?
I want a regular expression that matches the section specifically the section A cat eats mouse.
I.e. although the terms exist in other parts, that sentence is a better match i.e. it is prefered.
But if this part was missing it would have matched the I saw yesterday a big mouse in our house. Or We have a cat.
Can this be expressed in a regular expression?
No, regex isn't the right tool for this.
You can use a regex (though a plain substring search might be more appropriate) to find each of the words you're looking for, and assign weights to the matches (based number of occurrence of each term, appearance of all terms, relative order of the terms...) outside the regex.
But your end goal is too fuzzy, not regular enough - you'll need more than just regular expressions.
I'm not sure what kind of pattern you're looking to apply, but note that when using the vertical bar to write alternatives, the first one that matches will succeed. This means that if you have something like (<pattern1>|<pattern2>) if both of them match, the preference will be given to <pattern1> since that's the first one that will be checked.
Regular expressions are basically for matching words of regular languages, in most programming contexts, parts of the matched word are then extracted and used in the program. However, your matching pattern is context-sensitive (the matcher needs to both remember what has been before and what comes next) and therefore not in the expression power of regular expressions.
An approach to your problem could be that you use a sentence tokenizer to extract sentences and then score each sentence based on the words withing and, eventually, their constellation. Your problem seems highly related to the problem of automated text summarization. So you could look for information on this.

Categories