Optmizing a regex-based lookup function

Optmizing a regex-based lookup function - java

I have the following function.
private boolean codeContains(String name, String code) {
if (name == null || code == null) {
return false;
}
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(name) + "\\b");
Matcher matcher = pattern.matcher(code);
return matcher.find();
}
It is called many thousand times in my code, and is the function in which my program spends the most amount of time in. Is there any way to make this function go faster, or is it already as fast as it can be?

If you don't need to check word boundaries, you might do this :
private boolean codeContains(String name, String code) {
return name != null && code != null && code.indexOf(name)>=0;
}
If you need to check word boundaries but, as I suppose is your case, you have a big code in which you often search, you could "compile" the code once by
splitting the code string using the split method
putting the tokens in a HashSet (checking if a token is in a hashset is reasonably fast).
Of course, if you have more than one code, it's easy to store them in a structure adapted to your program, for example in a map having as key the file name.

"Plain" string operations will (almost) always be faster than regex, especially when you can't pre-compile the pattern.
Something like this would be considerably faster (with large enough name and code strings), assuming Character.isLetterOrDigit(...) suits your needs:
private boolean codeContains(String name, String code) {
if (name == null || code == null || code.length() < name.length()) {
return false;
}
if (code.equals(name)) {
return true;
}
int index = code.indexOf(name);
int nameLength = name.length();
if (index < 0) {
return false;
}
if (index == 0) {
// found at the start
char after = code.charAt(index + nameLength);
return !Character.isLetterOrDigit(after);
}
else if (index + nameLength == code.length()) {
// found at the end
char before = code.charAt(index - 1);
return !Character.isLetterOrDigit(before);
}
else {
// somewhere inside
char before = code.charAt(index - 1);
char after = code.charAt(index + nameLength);
return !Character.isLetterOrDigit(after) && !Character.isLetterOrDigit(before);
}
}
And a small test succeeds:
#Test
public void testCodeContainsFaster() {
final String code = "FOO some MU code BAR";
org.junit.Assert.assertTrue(codeContains("FOO", code));
org.junit.Assert.assertTrue(codeContains("MU", code));
org.junit.Assert.assertTrue(codeContains("BAR", code));
org.junit.Assert.assertTrue(codeContains(code, code));
org.junit.Assert.assertFalse(codeContains("FO", code));
org.junit.Assert.assertFalse(codeContains("BA", code));
org.junit.Assert.assertFalse(codeContains(code + "!", code));
}

This code seemed to do it:
private boolean codeContains(String name, String code) {
if (name == null || code == null || name.length() == 0 || code.length() == 0) {
return false;
}
int nameLength = name.length();
int lastIndex = code.length() - nameLength;
if (lastIndex < 0) {
return false;
}
for (int curr = 0; curr < lastIndex; ) {
int index = code.indexOf(name, curr);
int indexEnd = index + nameLength;
if (index < 0 || lastIndex < index) {
break;
}
boolean leftOk = index == curr ||
index > curr && !Character.isAlphabetic(code.charAt(index - 1));
boolean rightOk = index == lastIndex ||
index < lastIndex && !Character.isAlphabetic(code.charAt(indexEnd));
if (leftOk && rightOk) {
return true;
}
curr += indexEnd;
}
return false;
}
The accepted answer goes to dystroy as he was the first to point me in the right direction, excellent answer by Bart Kiers though, +1!

Related

check if char[] contains only one letter and one int

I have no idea how to check if char[] contains only one letter (a or b) on the first position and only one int (0-8) on the second position. for example a2, b2
I have some this, but I do not know, what should be instead of digital +=1;
private boolean isStringValidFormat(String s) {
boolean ret = false;
if (s == null) return false;
int digitCounter = 0;
char first = s.charAt(0);
char second = s.charAt(1);
if (first == 'a' || first == 'b') {
if (second >= 0 && second <= '8') {
digitCounter +=1;
}
}
ret = digitCounter == 2; //only two position
return ret;
}
` public char[] readFormat() {
char[] ret = null;
while (ret == null) {
String s = this.readString();
if (isStringValidFormat(s)) {
ret = s.toCharArray();
}else {
System.out.println("Incorrect. Values must be between 'a0 - a8' and 'b0 - b8'");
}
}
return new char[0];
}`

First, I would test for null and that there are two characters in the String. Then you can use a simple boolean check to test if first is a or b and the second is between 0 and 8 inclusive. Like,
private boolean isStringValidFormat(String s) {
if (s == null || s.length() != 2) {
return false;
}
char first = s.charAt(0);
char second = s.charAt(1);
return (first == 'a' || first == 'b') && (second >= '0' && second <= '8');
}

For a well understood pattern, use Regex:
private static final Pattern pattern = Pattern.compile("^[ab][0-8]$")
public boolean isStringValidFormat(String input) {
if (input != null) {
return pattern.matcher(input).matches();
}
return false;
}

Verify if String matches real number

I'm trying to verify if a String s match/is a real number. For that I created this method:
public static boolean Real(String s, int i) {
boolean resp = false;
//
if ( i == s.length() ) {
resp = true;
} else if ( s.charAt(i) >= '0' && s.charAt(i) <= '9' ) {
resp = Real(s, i + 1);
} else {
resp = false;
}
return resp;
}
public static boolean isReal(String s) {
return Real(s, 0);
}
But obviously it works only for round numbers. Can anybody give me a tip on how to do this?
P.S: I can only use s.charAt(int) e length() Java functions.

You could try doing something like this. Added recursive solution as well.
public static void main(String[] args) {
System.out.println(isReal("123.12"));
}
public static boolean isReal(String string) {
boolean delimiterMatched = false;
char delimiter = '.';
for (int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
if (!(c >= '0' && c <= '9' || c == delimiter)) {
// contains not number
return false;
}
if (c == delimiter) {
// delimiter matched twice
if (delimiterMatched) {
return false;
}
delimiterMatched = true;
}
}
// if matched delimiter once return true
return delimiterMatched;
}
Recursive solution
public static boolean isRealRecursive(String string) {
return isRealRecursive(string, 0, false);
}
private static boolean isRealRecursive(String string, int position, boolean delimiterMatched) {
char delimiter = '.';
if (position == string.length()) {
return delimiterMatched;
}
char c = string.charAt(position);
if (!(c >= '0' && c <= '9' || c == delimiter)) {
// contains not number
return false;
}
if (c == delimiter) {
// delimiter matched twice
if (delimiterMatched) {
return false;
}
delimiterMatched = true;
}
return isRealRecursive(string, position+1, delimiterMatched);
}

You need to use Regex. The regex to verify that whether a string holds a float number is:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$

Can anybody give me a tip on how to do this?
Starting with your existing recursive matcher for whole numbers, modify it and use it in another method to match the whole numbers in:
["+"|"-"]<whole-number>["."[<whole-number>]]
Hint: you will most likely need to change the existing method to return the index of that last character matched rather than just true / false. Think of the best way to encode "no match" in an integer result.

public static boolean isReal(String str) {
boolean real = true;
boolean sawDot = false;
char c;
for(int i = str.length() - 1; 0 <= i && real; i --) {
c = str.charAt(i);
if('-' == c || '+' == c) {
if(0 != i) {
real = false;
}
} else if('.' == c) {
if(!sawDot)
sawDot = true;
else
real = false;
} else {
if('0' > c || '9' < c)
real = false;
}
}
return real;
}

Analyze a string of characters in java for format A^nB^n

I have a string of characters with As and Bs that I need to analyze for a Language A^nB^n. I can use the following code to work most of the time but when there is a letter that is not an "A" or "B" it may still return true, for example: AABACABAA should not be true, but it says it is. AABB is true; AABBAABB is not true. I have to use stacks and am not allowed to use counting.
public static boolean isL2(String line){
// set up empty stacks
Stack L2Stack = new Stack();
// initialize loop counter
int i = 0;
int n = line.length();
/* Push all 'A's to a_stack */
while ((i < line.length()) && (line.charAt(i) == 'A')) {
char ch = line.charAt(i);
L2Stack.push(ch);
i++;
}
/* Pop an 'A' for each consecutive 'B' */
while ((i < line.length()) && (line.charAt(i) == 'B')) {
if (!L2Stack.empty()){
L2Stack.pop();
i++;
}
else
return false;
}
if (i == n && !L2Stack.empty()){
return false; // more As than Bs
}
if (i != n && L2Stack.empty()){
return false; //more Bs than As
}else
return true;
}

if (i != n && L2Stack.empty()) {
return false; //more Bs than As
}
Should be
if (i != n) {
return false;
}
Since if you haven't finished reading all the characters, you can't return true, regardless of whether or not the stack is empty.
I'm assuming that AAABBBA should return false.
That change would also handle illegal characters.

String equation error checker not working

I have a method that checks to see if an equation written is correct.
This method check for:
Multiple Parentheses
Excess operators
Double Digits
q's
and any character in a string that is not and of these:
.
private static final String operators = "-+/*%_";
private static final String operands = "0123456789x";
It was working fine, but then I added in modular to the operators and now whenever my code reaches the part in the method that checks to the left and the right of an operand to see if it is neither the end of the string or the beginning I get an error saying
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
My method and all it's additional methods.
private static final String operators = "-+/*%_";
private static final String operands = "0123456789x";
public Boolean errorChecker(String infixExpr)
{
char[] chars = infixExpr.toCharArray();
StringBuilder out = new StringBuilder();
for (int i = 0; i<chars.length; i++)
{
System.out.print(infixExpr.charAt(i));
if (isOperator(infixExpr.charAt(i)))
{
if (i == 0 || i == infixExpr.length())
{
out.append(infixExpr.charAt(i));
}
else if (isOperator(infixExpr.charAt(i + 1)) && isOperator(infixExpr.charAt(i - 1)))
{
System.out.println("To many Operators.");
return false;
}
else if (isOperator(infixExpr.charAt(i + 1)))
{
if (infixExpr.charAt(i) != '-' || infixExpr.charAt(i + 1) != '-')
{
System.out.println("To many Operators.");
return false;
}
}
else if (isOperator(infixExpr.charAt(i - 1)))
{
if (infixExpr.charAt(i) != '-' || infixExpr.charAt(i - 1) != '-')
{
System.out.println("To many Operators.");
return false;
}
}
}
else if (isOperand(infixExpr.charAt(i)))
{
if (i == 0 || i == infixExpr.length())
{
out.append(infixExpr.charAt(i));
}//THE LINE RIGHT BELOW THIS COMMENT THROWS THE ERROR!!!!!
else if (isOperand(infixExpr.charAt(i + 1)) || isOperand(infixExpr.charAt(i - 1)))
{
System.out.println("Double digits and Postfix form are not accepted.");
return false;
}
}
else if (infixExpr.charAt(i) == 'q')
{
System.out.println("Your meow is now false. Good-bye.");
System.exit(1);
}
else if(infixExpr.charAt(i) == '(' || infixExpr.charAt(i) == ')')
{
int p1 = 0;
int p2 = 0;
for (int p = 0; p<chars.length; p++)
{
if(infixExpr.charAt(p) == '(')
{
p1++;
}
if(infixExpr.charAt(p) == ')')
{
p2++;
}
}
if(p1 != p2)
{
System.out.println("To many parentheses.");
return false;
}
}
else
{
System.out.println("You have entered an invalid character.");
return false;
}
out.append(infixExpr.charAt(i));
}
return true;
}
private boolean isOperator(char val)
{
return operators.indexOf(val) >= 0;
}
private boolean isOperand(char val)
{
return operands.indexOf(val) >= 0;
}
My main portion that runs the method:
Boolean meow = true;
while(meow)
{
System.out.print("Enter infix expression: ");
infixExpr = scan.next();//THE LINE RIGHT BELOW THIS COMMENT THROWS THE ERROR!!!!!
if(makePostfix.errorChecker(infixExpr) == true)
{
System.out.println("Converted expressions: "
+ makePostfix.convert2Postfix(infixExpr));
meow = false;
}
}
It was working fine before, but now it won't even pass 1+2 which was previously working and I changed NONE of that you see. What's wrong!?!?

What looks like what's happening is that you check for the character at index (i + 1) several times in your code. Lets say you input a string with a length of five characters. The program goes through and reaches the line:
else if (isOperator(infixExpr.charAt(i + 1)) && isOperator(infixExpr.charAt(i - 1)))
If i == 4, this will cause the code:
infixExpr.charAt(i + 1)
to throw an index error.
In essance, you're checking for a character at index five (the sixth character) in a string with a maximum index index of four which is five characters in length. Also, your checking for
if(i==0 || i == infixExpr.length)
won't work as is. Maybe check for (i==infixExpr.length-1).

sorting collections with string elements in java

I have a collections...i wrote code to sort values using my own comparator
my comparator code is
private static class MatchComparator implements Comparator<xmlparse> {
#Override
public int compare(xmlparse object1, xmlparse object2) {
String match1 = object1.getMatchId();
String match2 = object2.getMatchId();
return match1.compareTo(match2);
}
}
I will call Collections.sort(list,new MatchComparator());
Everything is fine but my problem is the sorted list is wrong when i print it...
Input for list
Match19
Match7
Match12
Match46
Match32
output from the sorted list
Match12
Match19
Match32
Match46
Match7
my expected output is
Match7
Match12
Match19
Match32
Match46

to get the order you need, you could either prefix the 1 digit numbers with zero ( eg Match07 ) or you have to split the string in a prefix and a numeric part, and implement the sorting as numeric comparison

The problem is that String.compareTo(..) compares the words char by char.
If all string start with Match, then you can easily fix this with:
public int compare(xmlparse object1, xmlparse object2) {
String match1 = object1.getMatchId();
String match2 = object2.getMatchId();
return Integer.parseInt(match1.replace("Match"))
- Integer.parseInt(match2.replace("Match"));
}
In case they don't start all with Match, then you can use regex:
Integer.parseInt(object1.replaceAll("[a-zA-Z]+", ""));
Or
Integer.parseInt(object1.replaceAll("[\p{Alpha}\p{Punch}]+", ""));
And a final note - name your classes with uppercase, camelCase - i.e. XmlParse instead of xmlparse - that's what the convention dictates.

Implement a function and use it for comparison:
instead of
return match1.compareTo(match2);
use
return compareNatural(match1,match2);
Here is a function which does a natural comparison on strings:
private static int compareNatural(String s, String t, boolean caseSensitive) {
int sIndex = 0;
int tIndex = 0;
int sLength = s.length();
int tLength = t.length();
while (true) {
// both character indices are after a subword (or at zero)
// Check if one string is at end
if (sIndex == sLength && tIndex == tLength) {
return 0;
}
if (sIndex == sLength) {
return -1;
}
if (tIndex == tLength) {
return 1;
}
// Compare sub word
char sChar = s.charAt(sIndex);
char tChar = t.charAt(tIndex);
boolean sCharIsDigit = Character.isDigit(sChar);
boolean tCharIsDigit = Character.isDigit(tChar);
if (sCharIsDigit && tCharIsDigit) {
// Compare numbers
// skip leading 0s
int sLeadingZeroCount = 0;
while (sChar == '0') {
++sLeadingZeroCount;
++sIndex;
if (sIndex == sLength) {
break;
}
sChar = s.charAt(sIndex);
}
int tLeadingZeroCount = 0;
while (tChar == '0') {
++tLeadingZeroCount;
++tIndex;
if (tIndex == tLength) {
break;
}
tChar = t.charAt(tIndex);
}
boolean sAllZero = sIndex == sLength || !Character.isDigit(sChar);
boolean tAllZero = tIndex == tLength || !Character.isDigit(tChar);
if (sAllZero && tAllZero) {
continue;
}
if (sAllZero && !tAllZero) {
return -1;
}
if (tAllZero) {
return 1;
}
int diff = 0;
do {
if (diff == 0) {
diff = sChar - tChar;
}
++sIndex;
++tIndex;
if (sIndex == sLength && tIndex == tLength) {
return diff != 0 ? diff : sLeadingZeroCount - tLeadingZeroCount;
}
if (sIndex == sLength) {
if (diff == 0) {
return -1;
}
return Character.isDigit(t.charAt(tIndex)) ? -1 : diff;
}
if (tIndex == tLength) {
if (diff == 0) {
return 1;
}
return Character.isDigit(s.charAt(sIndex)) ? 1 : diff;
}
sChar = s.charAt(sIndex);
tChar = t.charAt(tIndex);
sCharIsDigit = Character.isDigit(sChar);
tCharIsDigit = Character.isDigit(tChar);
if (!sCharIsDigit && !tCharIsDigit) {
// both number sub words have the same length
if (diff != 0) {
return diff;
}
break;
}
if (!sCharIsDigit) {
return -1;
}
if (!tCharIsDigit) {
return 1;
}
} while (true);
} else {
// Compare words
// No collator specified. All characters should be ascii only. Compare character-by-character.
do {
if (sChar != tChar) {
if (caseSensitive) {
return sChar - tChar;
}
sChar = Character.toUpperCase(sChar);
tChar = Character.toUpperCase(tChar);
if (sChar != tChar) {
sChar = Character.toLowerCase(sChar);
tChar = Character.toLowerCase(tChar);
if (sChar != tChar) {
return sChar - tChar;
}
}
}
++sIndex;
++tIndex;
if (sIndex == sLength && tIndex == tLength) {
return 0;
}
if (sIndex == sLength) {
return -1;
}
if (tIndex == tLength) {
return 1;
}
sChar = s.charAt(sIndex);
tChar = t.charAt(tIndex);
sCharIsDigit = Character.isDigit(sChar);
tCharIsDigit = Character.isDigit(tChar);
} while (!sCharIsDigit && !tCharIsDigit);
}
}
}
a better one is here

The comparison is lexicographic and not numerical, that's your problem. In lexicoraphic ordering, 10 comes before 9.
See this question for open source implementation solutions. You can also implement your own string comparison, which shouldn't be that hard.

It seems that you expect not what the String.compareTo() really is. It performs so called lexicographical comarsion, but you try to compare it by number. You need to modify the code of your comparator.
#Override
public int compare(xmlparse object1, xmlparse object2) {
String match1 = object1.getMatchId();
String match2 = object2.getMatchId();
Long n1 = getNumber(match1);
Long n2 = getNumber(match2);
return n1.compareTo(n2);
}
where getNumber() extracts the last nuber from string "matchXX"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optmizing a regex-based lookup function - java

Related

check if char[] contains only one letter and one int

Verify if String matches real number

Analyze a string of characters in java for format A^nB^n

String equation error checker not working

sorting collections with string elements in java

Categories

Resources