java performance issue - regular expression VS internal String method - java

I'm having the following issue:
I have some string somewhere in my application that I want to check - the check is whether this string contains a character that is different than " "(white space), /n and /r
For example:
" g" - Contains
" /n " - Not Contains
" " - Not Contains
I want to do it in a reg expression, but I don't want to use the common pattern .*[a-zA-Z0-9]+.* . Instead, I want something like .*[!" ""/n"/r"]. (every character that is different than " " "/r" and "n").
My problems are that
I don't know if this pattern is valid (the above isn't working)
I'm not sure if it would be me much faster then using the
regular Strings methods.

Firstly, you mean \n and \r, and in Java this means escaping the backslash as well with \\n and \\r.
Secondly, if you merely mean to catch any non-whitespace, just use the pattern \\S* or [^\\s]. \S is non-whitespace, or \s is whitespace and [^<charset>] means "match anything that isn't one of these."
Thirdly, if this is a repeated check, be sure to only compile the regex once then use it multiple times.
Fourthly, follow usual strategy for profiling. Firstly is this in a critical strip in your application? If so then benchmark yourself.

here's something that does exactly what you want, but (like i said above), it'll be faster going over characters:
Pattern NOT_WHITESPACE_DETECTOR = Pattern.compile("[^ \\n\\r]");
Matcher m = NOT_WHITESPACE_DETECTOR.matcher(" \n \r bla ");
if (m.find()) {
//string contains a non-white-space
}
also note that the definition of whitespace in java is much wider than you specified, and even then there are whitespaces out there in unicode that java doesnt detect (there are libraries that do, however)

Related

Java regular expression for number starts with code

I am not a Java developer but I am interfacing with a Java system.
Please help me with a regular expression that would detect all numbers starting with with 25678 or 25677.
For example in rails would be:
^(25677|25678)
Sample input is 256776582036 an 256782405036
^(25678|25677)
or
^2567[78]
if you do ^(25678|25677)[0-9]* it Guarantees that the others are all numbers and not other characters.
Should do the trick for you...Would look for either number and then any number after
In Java the regex would be the same, assuming that the number takes up the entire line. You could further simplify it to
^2567[78]
If you need to match a number anywhere in the string, use \b anchor (double the backslash if you are making a string literal in Java code).
\b2567[78]
how about if there is a possibility of a + at the beginning of a number
Add an optional +, like this [+]? or like this \+? (again, double the backslash for inclusion in a string literal).
Note that it is important to know what Java API is used with the regular expression, because some APIs will require the regex to cover the entire string in order to declare it a match.
Try something like:
String number = ...;
if (number.matches("^2567[78].*$")) {
//yes it starts with your number
}
Regex ^2567[78].*$ Means:
Number starts with 2567 followed by either 7 or 8 and then followed by any character.
If you need just numbers after say 25677, then regex should be ^2567[78]\\d*$ which means followed by 0 or n numbers after your matching string in begining.
The regex syntax of Java is pretty close to that of rails, especially for something this simple. The trick is in using the correct API calls. If you need to do more than one search, it's worthwhile to compile the pattern once and reuse it. Something like this should work (mixed Java and pseudocode):
Pattern p = Pattern.compile("^2567[78]");
for each string s:
if (p.matcher(s).find()) {
// string starts with 25677 or 25678
} else {
// string starts with something else
}
}
If it's a one-shot deal, then you can simplify all this by changing the pattern to cover the entire string:
if (someString.matches("2567[78].*")) {
// string starts with 25677 or 25678
}
The matches() method tests whether the entire string matches the pattern; hence the leading ^ anchor is unnecessary but the trailing .* is needed.
If you need to account for an optional leading + (as you indicated in a comment to another answer), just include +? at the start of the pattern (or after the ^ if that's used).

Parse Drools rule file with Java regex

I'm interested in parsing a Drools rule file using regular expressions.
Having a string with the content of the whole .drl file, I'd like to have 4 substrings:
A substring with the content of <name>
A substring with the content of <attribute>
A substring with the content of <conditional element>
A substring with the content of <action>
A Drools rule has the following structure, according to the official documentation:
rule "<name>"
<attribute>*
when
<conditional element>*
then
<action>*
end
I've tried using this pattern, but it hasn't worked well:
^rule"(.|\n|\r|\t)+"(.|\n|\r|\t)+\bwhen\b(.|\n|\r|\t)+\bthen\b(.|\n|\r|\t)+\bend\b?$
Does anyone have an idea of how could I proceed?
I know your question is about regexp, but I would strongly advise against using it. There are way too many cases that will fail with your regexp... for instance, rule names that are a single word don't need "", rule keyword does not need to be the first thing in the line, etc...
/*this is a comment on the start of the line*/ rule X...
Instead of regexp, just use the DrlParser directly and it will give you all the information you need:
String drl = "package foo \n"
+ "declare Bean1 \n"
+ "field1: java.math.BigDecimal \n"
+ "end \n"
+ "rule bigdecimal\n"
+ "when \n"
+ "Bean1( field1 == 0B ) \n"
+ "then \n"
+ "end";
DrlParser parser = new DrlParser(LanguageLevelOption.DRL6);
PackageDescr pkgDescr = parser.parse( null, drl );
PackageDescr.getRules() will give you all the RuleDescr in the file, each RuleDescr has a getName() to give you the rule name, etc. All type safe, no edge cases, etc.
You almost got it. This work:
^rule\s+\"(.|\n|\r|\t)+\"(.|\n|\r|\t)+\bwhen\b(.|\n|\r|\t)+\bthen\b(.|\n|\r|\t)+\bend\b?$
Another solution:
^\s*rule\s+\"([^\"]+)\"[\s\S]+\s+when\s+([\s\S]+)\s+then\s+([\s\S]+)\send\s*$
Note: You missed the space and " -> \"
Tips:
You can use \s for white space charcters.
[^\"] for all non " character.
[\s\S] for all characters.
\b stop at [a-zA-Z0-9_]. \s+ stop at any non-whitespace character. It is just an extra precaution if any attribute start with a special character.
Use a program like Rad Software Regular Expression Designer. That will dramatically simplify editing and testing your regex code.

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

Regex; backreferencing a character that was NOT matched in a character set

I want to construct a regex, that matches either ' or " and then matches other characters, ending when a ' or an " respectively is matched, depending on what was encountered right at the start. So this problem appears simple enough to solve with the use of a backreference at the end; here is some regex code below (it's in Java so mind the extra escape chars such as the \ before the "):
private static String seekerTwo = "(['\"])([a-zA-Z])([a-zA-Z0-9():;/`\\=\\.\\,\\- ]+)(\\1)";
This code will successfully deal with things such as:
"hello my name is bob"
'i live in bethnal green'
The trouble comes when I have a String like this:
"hello this seat 'may be taken' already"
Using the above regex on it will fail on the initial part upon encountering ' then it would continue and successfully match 'may be taken'... but this is obviously insufficient, I need the whole String to be matched.
What I'm thinking, is that I need a way to ignore the type of quotation mark, which was NOT matched in the very first group, by including it as a character in the character set of the 3rd group. However, I know of no way to do this. Is there some sort of sneaky NOT backreference function or something? Something I can use to reference the character in the 1st group that was NOT matched?? Or otherwise some kind of solution to my predicament?
This can be done using negative lookahead assertions. The following solution even takes into account that you could escape a quote inside a string:
(["'])(?:\\.|(?!\1).)*\1
Explanation:
(["']) # Match and remember a quote.
(?: # Either match...
\\. # an escaped character
| # or
(?!\1) # (unless that character is identical to the quote character in \1)
. # any character
)* # any number of times.
\1 # Match the corresponding quote.
This correctly matches "hello this seat 'may be taken' already" or "hello this seat \"may be taken\" already".
In Java, with all the backslashes:
Pattern regex = Pattern.compile(
"([\"']) # Match and remember a quote.\n" +
"(?: # Either match...\n" +
" \\\\. # an escaped character\n" +
"| # or\n" +
" (?!\\1) # (unless that character is identical to the matched quote char)\n" +
" . # any character\n" +
")* # any number of times.\n" +
"\\1 # Match the corresponding quote",
Pattern.COMMENTS);
Tim's solution works fairly well if you can use lookaround (which Java does support). But if you should find yourself using a language or tool that does not support lookaround, you could simply match both cases (double quoted strings and single quoted strings) separately:
"(\\"|[^"])*"|'(\\'|[^'])*'
matches each case separately, but returns either case as the whole match
HOWEVER
Both cases can fall prey to at least one eventuality. If you don't look closely, you may think there should be two matches in this excerpt:
He turned to get on his bike. "I'll see you later, when I'm done with all this" he said, looking back for a moment before starting his journey. As he entered the street, one of the city's trolleys collided with Mike's bicycle. "Oh my!" exclaimed an onlooker.
...but there are three matches, not two:
"I'll see you later, when I'm done with all this"
's trolleys collided with Mike'
"Oh my!"
and this excerpt contains only ONE match:
The fight wasn't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn't love you!"
can you find that one? :D
't over yet, though. "Hey!" yelled Bob. "What do you want?" I retorted. "I hate your guts!" "Why would I care?" "Because I love you!" "You do?" Bob paused for a moment before whispering "No, I couldn'
I would recommend (if you are up for using lookaround), that you consider doing some extra checking (such as a positive lookbehind for whitespace or similar before the first quote) to make sure you don't match things like 's trolleys collided with Mike' - though I wouldn't put much money on any solution without a lot of testing first. Adding (?<=\s|^) to the beginning of either expression will avoid the above cases... i.e.:
(?<=\s|^)(["'])(?:\\.|(?!\1).)*\1 #based on Tim's
or
(?<=\s|^)("(\\"|[^"])*"|'(\\'|[^'])*') #based on my alternative
I'm not sure how efficient lookaround is compared to non-lookaround, so the two above may be equivalent, or one may be more efficient than the other (?)

Parsing quoted text in java

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.
You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}
There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!
Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Categories