Java & Regex: Matching a substring that is not preceded by specific characters

Java & Regex: Matching a substring that is not preceded by specific characters - java

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?

You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.

It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Related

How to select strings with the most keywords matches?

I'm trying to select top 3 strings which contains the most matches..
I'll explain it like this:
assume that we have the following keywords: "pc, programming, php, java"
and the following sentences:
a[0]="what is java??"<br>
a[1]="I love playing and programming on pc"<br>
a[2]="I'm good at programming php and java"<br>
a[3]="I'm programming php and java on my pc"<br>
so only the last 3 strings must be selected cause they are the top 3 strings containing the most matches.
How to do this in java???

If your dataset is small and you only care about exact matches, you could do something like the following:
Loop over each of your sentences performing an indexOf check for each keyword. If this returns something that isn't -1 then increment a counter for that sentence. Repeat for each keyword. At the end find the 3 sentences that have the highest counter.
This approach will have all kinds of issues though including things such as:
Case insensitivity
Tags matching partial words, e.g. "java" matching "javascript"
Ideally you would use a full text engine like Lucene/Solr/ElasticSearch and let that do all the heavy lifting for you

Arguably the easiest method would be to use Regex, an expression based system which searches for patterns within strings.
Pick up a website which teaches Regex. I suggest this one for starters.
http://regexone.com/
Afterwards, familiarize yourself with Java Regex. I suggest looking into capture groups.
I will not give you code to do this, because I believe there are many online examples you can look at, and it is in your best interest to learn how to do this by yourself.

with regex, is using both "is" and "is not" range definitons within the same range possible?

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.
After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.
I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.
Here are examples of what I thought might work but caused tester to act weird:
[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}
I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.
The regex testers I used gave me different funky results which is what is confusing me.
Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.

Why do you need the \w at all?
[^\s<>.!?]{2}
This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.
In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do
[^\W\d]
because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.
Edit:
Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:
[a-z-[e-g]]
to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.

Another possibility is to use two ranges and combine them; e.g.
([\w]|[^\s<>.!?]){2}
However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.
What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.

From your question, it looks like a no-space regex would match your needs, you can achieve that with:
[\S]{2}

Java string: classes or packages with advanced functions?

I am doing string manipulations and I need more advanced functions than the original ones provided in Java.
For example, I'd like to return a substring between the (n-1)th and nth occurrence of a character in a string.
My question is, are there classes already written by users which perform this function, and many others for string manipulations? Or should I dig on stackoverflow for each particular function I need?

Check out the Apache Commons class StringUtils, it has plenty of interesting ways to work with Strings.
http://commons.apache.org/lang/api-2.3/index.html?org/apache/commons/lang/StringUtils.html

Have you looked at the regular expression API? That's usually your best bet for doing complex things with strings:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Along the lines of what you're looking to do, you can traverse the string against a pattern (in your case a single character) and match everything in the string up to but not including the next instance of the character as what is called a capture group.
It's been a while since I've written a regex, but if you were looking for the character A for instance, then I think you could use the regex A([^A]*) and keep matching that string. The stuff in the parenthesis is a capturing group, which I reference below. To match it, you'd use the matcher method on pattern:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher%28java.lang.CharSequence%29
On the Matcher instance, you'd make sure that matches is true, and then keep calling find() and group(1) as needed, where group(1) would get you what is in between the parentheses. You could use a counter in your looping to make sure you get the n-1 instance of the letter.
Lastly, Pattern provides flags you can pass in to indicate things like case insensitivity, which you may need.
If I've made some mistakes here, then someone please correct me. Like I said, I don't write regexes every day, so I'm sure I'm a little bit off.

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words.
And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.
I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?
Please, explain me the pros of using BreakIterator and the real cases when it should be used.
If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?

The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.
It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.
If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.
Find the sentence where any word occurs the most repeated times. Output it along with that word
Find the word used most times throughout the whole string.
Find all words that occur in every sentence
Find all words that occur a prime number of times in 2 or more sentences
The list goes on...

Java Regex, capturing groups with comma separated values

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.

When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/

Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.

The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java & Regex: Matching a substring that is not preceded by specific characters - java

You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b. Note also that this isn't perfect, but does handle the situations that you have suggested.

It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Related

How to select strings with the most keywords matches?

with regex, is using both "is" and "is not" range definitons within the same range possible?

Java string: classes or packages with advanced functions?

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

Java Regex, capturing groups with comma separated values

Categories

Resources