Pattern matching in file or string - java

I have been working on a program which amongst other things will search for repeating patterns within a string.
Finding and counting the matches for each pattern type is the easy part and I can sort from the highest scoring to the lowest scoring based on number of matches found.
Choosing which of the overlapping matches to keep is a bit more difficult, should i remove the leftmost or rightmost?
Lets say I keep the first match found and remove the right most overlapping one and so on. The issue arises when I move on to the next pattern type and find that it would been better to instead remove the leftmost match from the pattern type per above. This would have allowed this this pattern to fit into the space, etc..
However again when I get to the next set of patterns, it could transpire that leaving things as they were the first time would benefit,etc...
This swinging to and fro could repeat for the entire file.
My question is: are there any algorithms or techniques to calculate best fit for every single pattern while maintaining the most repeated patterns at the top of the list?
Any advice would be much appreciated ;)
Ed

Try to show an example
The only thing that you should do would be (in my opinion):
-Instead of erasing the most-left or most-right, try to save them all, and after analyzing all the matches you should decide what to do. Deleting without certainty is not a good option.

Related

Regex Pattern Catastrophic backtracking

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately.
Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.
Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?
Pattern:
^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$
Values working:
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]
Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.
One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.
But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:
^\[\p{N}+\](?:,\[\p{N}+\])*$
After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.
That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

Detecting phone numbers using sapi

This is regarding to the question in Detecting phone numbers using sapi. I used that grammar in the answer.
But it gets numbers with space how can I create grammar get numbers without space?
Short answer - it would be quite difficult to do that, as SAPI is designed to recognize words, which have spaces in between them.
However, it's pretty straightforward to annotate the grammar so that you can find the start & end position of the phone number within the reco result, at which point you can remove the spaces using string replacement.
Alternatively, it's also pretty straightforward to place the phone number in a tag, at which point you can extract the number from the rules structure. (This might be trickier from Java.)

Regex search pattern in very large file

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line.
It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).
The main steps:
Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
It's possible with java? I mean:
Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
Find the way handle the case, when match result is longer than chunk.
I hope my explanation is pretty clear.
I think the solution for you would be to implement CharSequence as a wrapper over very large text files.
Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!
It seems like you may need to break that search-pattern down into pieces, since, given your restrictions, searching for it in its entirety is failing.
Can you determine that a buffer contains the beginning of a match? If so, save that state and then search the next portion for the next part of the match. Continue until the entire search-term is found.

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Categories