Using Regex to analyse Strings in Java

Using Regex to analyse Strings in Java - java

I created a simple quiz program and I am trying to figure out a method to return 3 types of answer's using regex. The 3 answers would be either fully correct, correct (but spelling error) and partially correct, but still awarded being correct.
So for an example, the three strings will be correct from the method comparing to the String "Elephants" : 1. "Elephants", 2. "Elephents", 3. "Elephant".
The 1st string is fully correct, so would return "Correct Answer".
The 2nd string is correct but spelling error ('a' instead of an 'e'), so will return "Correct although spelled Elephants".
The 3rd string is partially correct (No 's' at the end), but will return "Answer accepted"
Could anyone figure out the three types of Regex expressions I could use for this method?
Thanks much appreciated.

There is no regex solution for this, but you can implement a "distance algorithm" to measure the relative similarity of two words. One very common algorithm for this is Levenshtein Distance, or Edit Distance: it tells you how many "editing actions" it would take to go from the answer the user typed in to the correctly spelled answer. Replacing, inserting, or deleting a symbol counts as one action. If the distance is two or less, the answer the user typed in is likely just a spelling error; if the distance is three or more, it's either a very poorly spelled answer, or an incorrect answer (both should be counted as incorrect).
The wikipedia article linked above has pseudocode implementation for the algorithm.

The first regex to match: Elephants
If it doesn't match try Eleph[ae]nt for the second.
If also not, try Elephant.
You could additionally combine it with word end markers.
For testing regexes, this site is really cool: http://gskinner.com/RegExr/
With regexs, you have to try to guess the spelling errors..

Fully Correct Regex:
"Elephants"
Correct although spelled "Elephants" Regex:
"[^E]lephants|E[^l]ephents|El[^e]phants|Ele[^p]hants|Elep[^h]ants|Eleph[^a]nts|Elepha[^n]ts|Elephan[^t]s|Elephant[^s]"
Answer accepted Regex:
"lephants|Eephants|Elphants|Elehants|Elephants|Elephnts|Elephnts|Elephats|Elephans|Elephant"
You could write a small program which automatically generates the regexp which validate your answer and outputs you the case in which your regexp fall
Correct
Correct although misspelled
Answer accepted
For instance, assuming that the correct answer is "Elephants", you could write a routine which test for the second case (Correct although misspelled).
String generateCorrectAltoughMispelledAnswerRegex(final String answer) {
StringBuilder builder = new StringBuilder();
String answer = "Elephants";
for (int i = 0; i < answer.length; i++) {
String mispelled = answer.substring(0, i) + "[^" + char.at(i) + "]" +
(i < length ? answer.substring(i + 1) : "");
answer.append(mispelled);
if (i < length - 1) { answer.append("|"); }
}
String regex = builder.build();
return regex;
}
e.g.: By calling the function generateCorrectAlthoughMispelledAnswerRegex with the argument "Elephants"
generateCorrectAltoughMispelledAnswerRegex("Elephants")
it will generate the regexp to the test for the second case:
"[^E]lephants|E[^l]ephents|El[^e]phants|Ele[^p]hants|Elep[^h]ants|Eleph[^a]nts|Elepha[^n]ts|Elephan[^t]s|Elephant[^s]"
You can do the same for the other cases.

Related

Searching for a word in a String using parallelism with Java Fork/Join

Let's say I want to search the occurrence of a word in a string in a parallel way.
Say for example we have a string "Hello i am bob and my name is bob" and a word "bob".
The function needs to return 2.
Achieving this sequentially is pretty easy. We just need to use a for loop to go over our string and count whenever our word matches another word in the string.
I am trying to solve this using parallelism. I thought about splitting the string on every white space and passing the word to each thread, which then will check if it matches our searched word. However, looking for white spaces in our string is still being done sequentially. So, parallelism can not be beneficial here.
Is there any other way to achieve this?

This is not a problem to be solved with fork join since this is not recursive action. Stream api is the way to go here:
String str = "Hello i am bob and my name is bob";
long count = Arrays.stream(str.split("\\s+"))
.parallel()
.filter(s -> s.equals("bob"))
.count();
System.out.println("Bob appeared " + count + " times");

You can do str.indexOf(“bob”) != str.lastIndexOf(“bob”). If it’s not equal, you got two. You can do another check by removing first bob and the last bob becomes the first index, if you find another one by indexOf != lastIndexOf, you remove the first one again and continue searching until you are done. I’m sure there will be a way to still make this better.

Java Regex First Name Validation

I understand that validating the first name field is highly controversial due to the fact that there are so many different possibilities. However, I am just learning regex and in an effort to help grasp the concept, I have designed some simple validations to create just try to make sure I am able to make the code do exactly what I want it to, despite whether or not it conforms to best business logic practices.
I am trying to validate a few things.
The first name is between 1 and 25 characters.
The first name can only start with an a-z (ignore case) character.
After that the first name can contain a-z (ignore case) and [ '-,.].
The first name can only end with an a-z (ignore case) character.
public static boolean firstNameValidation(String name){
valid = name.matches("(?i)(^[a-z]+)[a-z .,-]((?! .,-)$){1,25}$");
System.out.println("Name: " + name + "\nValid: " + valid);
return valid;
}

Try this regex
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))[A-Za-z- '.]{2,}$
Demo

Your expression is almost correct. The following is a modification that satisfies all of the conditions:
valid = name.matches("(?i)(^[a-z])((?![ .,'-]$)[a-z .,'-]){0,24}$");

A regex for the same:
([a-zA-z]{1}[a-zA-z_'-,.]{0,23}[a-zA-Z]{0,1})

Lets change the order of the requirements:
ignore case: "(?i)"
can only start with an a-z character: "(?i)[a-z]"
can only end with an a-z: "(?i)[a-z](.*[a-z])?"
is between 1 and 25 characters: "(?i)[a-z](.{0,23}[a-z])?"
can contain a-z and [ '-,.]: "(?i)[a-z]([- ',.a-z]{0,23}[a-z])?"
the last one should do the job:
valid = name.matches("(?i)[a-z]([- ',.a-z]{0,23}[a-z])?")
Test on RegexPlanet (press java button).
Notes for above points
could have used "[a-zA-Z]"' instead of"(?i)"'
need ? since we want to allow one character names
23 is total length minus first and the last charracter (25-1-1)
the - must come first (or last) inside [] else it is interpreted as range sepparator (assuming you didn't mean the characters between ' and ,)

Try this simplest version:
^[a-zA-Z][a-zA-Z][-',.][a-zA-Z]{1,25}$
Thanks for sharing.

A unicode compatible version of the answer of #ProPhoto:
^[^- '](?=(?!\p{Lu}?\p{Lu}))(?=(?!\p{Ll}+\p{Lu}))(?=(?!.*\p{Lu}\p{Lu}))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))(\p{L}|[- '.]){2,}$

Java if else on partial match

I've moved into Selenium WebDriver, and still finding the most confusing examples.
I need to be able to read a string (succeeded) run a conditional that asks If specific text is present.
For the sake of this text.
String oneoff = "Jeff is old"
I need to match on Jeff, see code below, as long as Jeff exists in the string, I want to return true. If Jeff doesn't exist, then I will check for oh say 50-75 other names. However the string may contain their name and additional text that cannot be controlled. so I have to do a partial match.
Question 1. am I screwed and will have to build each regex expression in that crazy format that I have been seeing, or am I missing something obvious?
Question 2. Will someone for my sanity please show me the proper way to match on Jeff, with the possibility of text being before and after the name Jeff.
Thank you!
String oneoff = driver.findElement(By.id("id_one_off_byline"))
.getAttribute("value");
System.out.println("One Off is:" + oneoff);
if (oneoff.matches("Jeff")) {
System.out.println("It is Jeff");
} else {
System.out.println("it is not jeff");
}
This is just the functional part of the code,

as Jeff exists in the string, I want to return true
Then you probably should test it with
if (oneoff.contains("Jeff"))
since matches use regex as parameter, so if (oneoff.matches("Jeff")) would return true only if oneoff = "Jeff".

You do not need to use match() for the code you have supplied. Instead use oneoff.equals("String") for string matching. Match() is more for a regex expressions. You could also use oneoff.contains("String") if you want to return true even if the string only exists as a subset of the target string.

if (oneoff.contains("Jeff")) {
System.out.println("It is Jeff");
} else if (!oneoff.contains("Jeff")) {
System.out.println("it is not jeff");
}
I think you should improve your code to be like this, because java probably didn't recognize else string if contained with other "jeff" maybe "JEef" or "JEEF" or even maybe "Jeef "
I hope it works, I used to found same bug like yours and I try this way to overcome it.

Regexp for a string to contain only letters , numbers and space in Java

Requirement: String should contain only letters , numbers and space.
I have to pass a clean name to another API.
Implementation: Java
I came up with this for my requirement
public static String getCleanFilename(String filename) {
if (filename == null) {
return null;
}
return filename.replaceAll("[^A-Za-z0-9 ]","");
}
This works well for few of my testcase , but want to know am I missing any boundary conditions, or any better way (in performance) to do it.

Additional to comments: i don't think that performance is an issue in a scenario where user input is taken (and a filename shouldn't be that long...).
But concerning your question: you may reduce the number of replacements by adding an additional + in your regex:
[^A-Za-z0-9 ]+

To answer you're direct question, \t fails your method and passes through as "space." Switch to \s ([...\s] and you're good.
At any rate, your design is probably flawed. Instead of arbitrarily dicking with user input, let the user know what you don't allow and make the correction manual.
EDIT:
If the filename doesn't matter, take the SHA-2 hash of the file name and use that. Guaranteed to meet your requirements.

codingBat plusOut using regex

This is similar to my previous efforts (wordEnds and repeatEnd): as a mental exercise, I want to solve this toy problem using regex only.
Description from codingbat.com:
Given a string and a non-empty word string, return a version of the original string where all chars have been replaced by pluses ("+"), except for appearances of the word string which are preserved unchanged.
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
There is no mention whether or not to allow overlap (e.g. what is plusOut("+xAxAx+", "xAx")?), but my non-regex solution doesn't handle overlap and it passes, so I guess we can assume non-overlapping occurrences of word if it makes it simpler (bonus points if you provide solutions for both variants!).
In any case, I'd like to solve this using regex (of the same style that I did before with the other two problems), but I'm absolutely stumped. I don't even have anything to show, because I have nothing that works.
So let's see what the stackoverflow community comes up with.

This passes all their tests:
public String plusOut(String str, String word) {
return str.replaceAll(
String.format("(?<!(?=\\Q%s\\E).{0,%d}).", word, word.length()-1),
"+"
);
}
Also, I get:
plusOut("1xAxAx2", "xAx") → "+xAxAx+"
If that's the result you were looking for then I pass your overlap test as well, but I have to admit, that one's by accident. :D

This is provided here just for reference. This is essentially Alan's solution, but using replace instead of String.format.
public String plusOut(String str, String word) {
return str.replaceAll(
"(?<!(?=word).{0,M})."
.replace("word", java.util.regex.Pattern.quote(word))
.replace("M", String.valueOf(word.length()-1)),
"+"
);
}

An extremely simple solution, using \G:
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
However, there is a caveat. Calling plusOut("12xxxxx34", "xxx") with the implementation above will return ++xxx++++.
Anyway, the problem is not clear about the behavior in such case to begin with. There is even no test case for such situation (since my program passed all test cases).
The regex is basically the same as the looping solution (which also passes all test cases):
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
Repeatedly skips the word, then replace the current character if it is not prefix of word.

I think you could leverage a negated range to do this. As this is just a hint, it's not tested though!
Turn your "xy" into a regexp like this: "[^xy]"
...and then wrap that into a regexp which replaces strings matched by that expression with "+".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Regex to analyse Strings in Java - java

Related

Searching for a word in a String using parallelism with Java Fork/Join

Java Regex First Name Validation

Java if else on partial match

Regexp for a string to contain only letters , numbers and space in Java

codingBat plusOut using regex

Categories

Resources