Java Regex Matcher Problem with dynamic Strings - java

I have some problems with Regex in Java and dynamic input - No problems with Regex at all ;)
private static Pattern START_SUITE = Pattern.compile("Test Suite '(\\S+)'.*started at\\s+(.*)");
String line = "Test Suite '/a/long/path/to/some/file.octest(Tests)' started at 2011-07-09 08:01:34 +0000";
Matcher m = START_SUITE.matcher(line);
if (m.matches) {
//do something
}
This works fine with my test java application with the string above.
But when the String does come from an other source Matcher doesn't match it.
processHandler.addProcessListener(new ProcessAdapter() {
#Override
public void onTextAvailable(final ProcessEvent event, final Key outputType) {
try {
outputParser.myMatchStringFunction(event.getText());
}
...
}
public void myMatchStringFunction(String line) {
Matcher m = START_SUITE.matcher(line);
if (m.matches) {
...
I checked the String with printing and it looks ok.
Any ideas what could happen?

Whether the string came from a string literal or dynamically from input won't affect anything at all. So it's either something wrong with your regular expression, or something in your input that you weren't expecting and need to trim off.
You say you've printed the string - but it's easy to miss non-printable characters, or newlines etc.
I suggest you print a sample failing string out in full, including the Unicode character values, e.g.
for (int i = 0; i < text.length(); i++)
{
char c = text.charAt(i);
System.out.println("Position: " + i + "Character: " + c
+ " Unicode: " + (int) c);
}
Then you'll be able to put exactly that string into your code if you need to, and you'll probably be able to spot what's wrong just by inspecting it in that form.

Thanks for that hint.
Adding DOTALL and (.*) at the end of every pattern solved the problem
private static Pattern START_SUITE = Pattern.compile("Test Suite '(\\S+)'.*started at\\s+(.*)", Pattern.DOTALL);

Related

String matching in java

I am currently struggling with my "dirty word" filter finding partial matches.
example: if I pass in these two params replaceWord("ass", "passing pass passed ass")
to this method
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
StringBuilder returnString = new StringBuilder();
int index = 0;
while(matcher.find()) {
returnString.append(input.substring(index,matcher.start()));
for(int i = 0; i < word.length() - 1; i++) {
returnString.append('*');
}
returnString.append(word.substring(word.length()-1));
index = matcher.end();
}
if(index < input.length() - 1){
returnString.append(input.substring(index));
}
return returnString.toString();
}
I get p*sing p*s p**sed **s
When I really just want "passing pass passed **s.
Does anyone know how to avoid this partial matching with this method??
Any help would be great thanks!
This tutorial from Oracle should point you in the right direction.
You want to use a word boundary in your pattern:
Pattern p = Pattern.compile("\\bword\\b", Pattern.CASE_INSENSITIVE);
Note, however that this still is problematic (as profanity filtering always is). A "non-word character" that defines the boundary is anything not included in [0-9A-Za-z_]
So for example, _ass would not match.
You also have the problem of profanity derived terms ... where the term is prepended to say, "hole", "wipe", etc
I'm working on a dirty word filter as we speak, and the option I chose to go with was Soundex and some regex.
I first filter out strange character with \w which is [a-zA-Z_0-9].
Then use soundex(String) to make a string that you can check against the soundex string of the word you want to test.
String soundExOfDirtyWord = Soundex.soundex(dirtyWord);
String soundExOfTestWord = Soundex.soundex(testWord);
if (soundExOfTestWord.equals(soundExOfDirtyWord)) {
System.out.println("The test words sounds like " + dirtyWord);
}
I just keep a list of dirty words in the program and have SoundEx run through them to check. The algorithm is something worth looking at.
You could also use replaceAll() method from the Matcher class. It replaces all the occurences of the pattern with your specified replacement word. Something like below.
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
String replacement = "";
for (int i = 0; i < word.length() - 1; i++) {
replacement += "*";
}
replacement += word.charAt(word.length() - 1);
return matcher.replaceAll(replacement);
}

Replacing regex with the same amount of "." as its length

See this for my current attempt: http://regexr.com?374vg
I have a regex that captures what I want it to capture, the thing is that the String().replaceAll("regex", ".") replaces everything with just one ., which is fine if it's at the end of the line, but otherwise it doesn't work.
How can I replace every character of the match with a dot, so I get the same amount of . symbols as its length?
Here's a one line solution:
str = str.replaceAll("(?<=COG-\\d{0,99})\\d", ".").replaceAll("COG-(?=\\.+)", "....");
Here's some test code:
String str = "foo bar COG-2134 baz";
str = str.replaceAll("(?<=COG-\\d{0,99})\\d", ".").replaceAll("COG-(?=\\.+)", "....");
System.out.println(str);
Output:
foo bar ........ baz
This is not possible using String#replaceAll. You might be able to use Pattern.compile(regexp) and iterate over the matches like so:
StringBuilder result = new StringBuilder();
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(inputString);
int previous = 0;
while (matcher.find()) {
result.append(inputString.substring(previous, matcher.start()));
result.append(buildStringWithDots(matcher.end() - matcher.start()));
previous = matcher.end();
}
result.append(inputString.substring(previous, inputString.length()));
To use this you have to define buildStringWithDots(int length) to build a String containing length dots.
Consider this code:
Pattern p = Pattern.compile("COG-([0-9]+)");
Matcher mt = p.matcher("Fixed. Added ''Show annualized values' chackbox in EF Comp Report. Also fixed the problem with the missing dots for the positions and the problem, described in COG-18613");
if (mt.find()) {
char[] array = new char[mt.group().length()];
Arrays.fill(array, '.');
System.out.println( " <=> " + mt.replaceAll(new String(array)));
}
OUTPUT:
Fixed. Added ''Show annualized values' chackbox in EF Comp Report. Also fixed the problem with the missing dots for the positions and the problem, described in .........
Personally, I'd simplify your life and just do something like this (for starters). I'll let you finish.
public class Test {
public static void main(String[] args) {
String cog = "COG-19708";
for (int i = cog.indexOf("COG-"); i < cog.length(); i++) {
System.out.println(cog.substring(i,i+1));
// build new string
}
}
}
Can you put your regex in grouping so replace it with string that matches the length of matched grouping? Something like:
regex = (_what_i_want_to_match)
String().replaceAll(regex, create string that has that many '.' as length of $1)
?
note: $1 is what you matched in your search
see also: http://www.regular-expressions.info/brackets.html

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

Regex to remove spaces from file name

I have some html strings which contains images. I need to remove spaces from image name because some tablets do not accept them. (I already renamed all image resources). I think the only fix part is ...
src="file:///android_asset/images/ ?? ?? .???"
because those links are valid links.
I spent half day on it and still struggling on performance issue. The following code works but really slow...
public static void main(String[] args) {
String str = "<IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/eye_anatomy 1 .jpg\" width=350 border=0></P> fd ssda f \r\n"
+ "fd <P align=center><IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/ eye_anato my 1 .bmp\" width=350 border=0></P>\r\n"
+ "\r\n<IMG height=286 alt=\"eye_anatomy 1.jpg\" src=\"file:///android_asset/images/eye_anatomy1.png\" width=350 border=0>\r\n";
Pattern p = Pattern.compile("(.*?)(src=\"file:///android_asset/images/)(.*?\\s+.*?)(\")", Pattern.DOTALL);
Matcher m = p.matcher(str);
StringBuilder sb = new StringBuilder("");
int i = 0;
while (m.find()) {
sb.append(m.group(1)).append(m.group(2)).append(m.group(3).replaceAll("\\s+", "")).append(m.group(4));
i = m.end();
}
sb.append(str.substring(i, str.length()));
System.out.println(sb.toString());
}
So the real question is, how can I remove spaces from image name efficiently using regex.
Thank you.
Regex is as regex does. :-) Serious the regex stuff is great for really particular cases, but for stuff like this I find myself writing lower-level code. So the following isn't a regex; it's a function. But it does what you want and does it much faster than your regex. (That said, if someone does comes up with a regex that fits the bill and performs well I'd love to see it.)
The following function segments the source string using spaces as delimiters, then recognizes and cleans up your alt and src attributes by not appending spaces while assembling the result. I did the alt attribute only because you were putting file names there too. One side effect is that this will collapse multiple spaces into one space in the rest of the markup, but browsers do that anyway. You can optimize the code a bit by re-using a StringBuilder. It presumes double-quotes around attributes.
I hope this helps.
private String removeAttrSpaces(final String str) {
final StringBuilder sb = new StringBuilder(str.length());
boolean inAttribute = false;
for (final String segment : str.split(" ")) {
if (segment.startsWith("alt=\"") || segment.startsWith("src=\"")) {
inAttribute = true;
}
if (inAttribute && segment.endsWith("\"")) {
inAttribute = false;
}
sb.append(segment);
if (!inAttribute) {
sb.append(' ');
}
}
return sb.toString();
}
Here's a function that should be faster http://ideone.com/vlspF:
private static String removeSpacesFromImages(String aText){
Pattern p = Pattern.compile("(?<=src=\"file:///android_asset/images/)[^\"]*");
StringBuffer result = new StringBuffer();
Matcher matcher = p.matcher(aText);
while ( matcher.find() ) {
matcher.appendReplacement(result, matcher.group(0).replaceAll("\\s+",""));
}
matcher.appendTail(result);
return result.toString();
}

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Categories