codingBat plusOut using regex - java

This is similar to my previous efforts (wordEnds and repeatEnd): as a mental exercise, I want to solve this toy problem using regex only.
Description from codingbat.com:
Given a string and a non-empty word string, return a version of the original string where all chars have been replaced by pluses ("+"), except for appearances of the word string which are preserved unchanged.
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
There is no mention whether or not to allow overlap (e.g. what is plusOut("+xAxAx+", "xAx")?), but my non-regex solution doesn't handle overlap and it passes, so I guess we can assume non-overlapping occurrences of word if it makes it simpler (bonus points if you provide solutions for both variants!).
In any case, I'd like to solve this using regex (of the same style that I did before with the other two problems), but I'm absolutely stumped. I don't even have anything to show, because I have nothing that works.
So let's see what the stackoverflow community comes up with.

This passes all their tests:
public String plusOut(String str, String word) {
return str.replaceAll(
String.format("(?<!(?=\\Q%s\\E).{0,%d}).", word, word.length()-1),
"+"
);
}
Also, I get:
plusOut("1xAxAx2", "xAx") → "+xAxAx+"
If that's the result you were looking for then I pass your overlap test as well, but I have to admit, that one's by accident. :D

This is provided here just for reference. This is essentially Alan's solution, but using replace instead of String.format.
public String plusOut(String str, String word) {
return str.replaceAll(
"(?<!(?=word).{0,M})."
.replace("word", java.util.regex.Pattern.quote(word))
.replace("M", String.valueOf(word.length()-1)),
"+"
);
}

An extremely simple solution, using \G:
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
However, there is a caveat. Calling plusOut("12xxxxx34", "xxx") with the implementation above will return ++xxx++++.
Anyway, the problem is not clear about the behavior in such case to begin with. There is even no test case for such situation (since my program passed all test cases).
The regex is basically the same as the looping solution (which also passes all test cases):
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
Repeatedly skips the word, then replace the current character if it is not prefix of word.

I think you could leverage a negated range to do this. As this is just a hint, it's not tested though!
Turn your "xy" into a regexp like this: "[^xy]"
...and then wrap that into a regexp which replaces strings matched by that expression with "+".

Related

Regex to match if string *only* contains *all* characters from a character set, plus an optional one

I ran into a wee problem with Java regex. (I must say in advance, I'm not very experienced in either Java or regex.)
I have a string, and a set of three characters. I want to find out if the string is built from only these characters. Additionally (just to make it even more complicated), two of the characters must be in the string, while the third one is **optional*.
I do have a solution, my question is rather if anyone can offer anything better/nicer/more elegant, because this makes me cry blood when I look at it...
The set-up
There mandatory characters are: | (pipe) and - (dash).
The string in question should be built from a combination of these. They can be in any order, but both have to be in it.
The optional character is: : (colon).
The string can contain colons, but it does not have to. This is the only other character allowed, apart from the above two.
Any other characters are forbidden.
Expected results
Following strings should work/not work:
"------" = false
"||||" = false
"---|---" = true
"|||-|||" = true
"--|-|--|---|||-" = true
...and...
"----:|--|:::|---::|" = true
":::------:::---:---" = false
"|||:|:::::|" = false
"--:::---|:|---G---n" = false
...etc.
The "ugly" solution
Now, I have a solution that seems to work, based on this stackoverflow answer. The reason I'd like a better one will become obvious when you've recovered from seeing this:
if (string.matches("^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$") || string.matches("^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$")) {
//do funny stuff with a meaningless string
} else {
//don't do funny stuff with a meaningless string
}
Breaking it down
The first regex
"^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$"
checks for all three characters
The next one
"^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$"
check for the two mandatory ones only.
...Yea, I know...
But believe me I tried. Nothing else gave the desired result, but allowed through strings without the mandatory characters, etc.
The question is...
Does anyone know how to do it a simpler / more elegant way?
Bonus question: There is one thing I don't quite get in the regexes above (more than one, but this one bugs me the most):
As far as I understand(?) regular expressions, (?\\|)? should mean that the character | is either contained or not (unless I'm very much mistaken), still in the above setup it seems to enforce that character. This of course suits my purpose, but I cannot understand why it works that way.
So if anyone can explain, what I'm missing there, that'd be real great, besides, this I suspect holds the key to a simpler solution (checking for both mandatory and optional characters in one regex would be ideal.
Thank you all for reading (and suffering ) through my question, and even bigger thanks for those who reply. :)
PS
I did try stuff like ^[\\|\\-(?:\\:)?)]$, but that would not enforce all mandatory characters.
Use a lookahead based regex.
^(?=.*\\|)(?=.*-)[-:|]+$
or
^(?=.*\\|)[-:|]*-[-:|]*$
or
^[-:|]*(?:-:*\\||\\|:*-)[-:|]*$
DEMO 1DEMO 2
(?=.*\\|) expects atleast one pipe.
(?=.*-) expects atleast one hyphen.
[-:|]+ any char from the list one or more times.
$ End of the line.
Here is a simple answer:
(?=.*\|.*-|.*-.*\|)^([-|:]+)$
This says that the string needs to have a '-' followed by '|', or a '|' followed by a '-', via the look-ahead. Then the string only matches the allowed characters.
Demo: http://fiddle.re/1hnu96
Here is one without lookbefore and -hind.
^[-:|]*\\|[-:|]*-[-:|]*|[-:|]*-[-:|]*\\|[-:|]*$
This doesn't scale, so Avinash's solution is to be preferred - if your regex system has the lookbe*.

How can I use hash sets in java to determine if a string contains valid characters?

I'm writing a lexical analyzer and have never used hash sets. I want to take a string and make sure it's legal. I think I understand how to build the hash set with valid characters but I'm not sure how to compare the string with teh hash set to ensure it contains valid characters. I can't find an example anywhere. Can someone point me to code that would do this?
HashSet has the function contains() for this, since it implements the Collection interface.
You cannot compare an entire string to a HashSet<Character>, but you can do it one character at a time:
HashSet<Character> valid = new HashSet<Character>();
valid.add('a');
valid.add('d');
valid.add('f');
boolean allOk = true;
for (char c : "fad".toCharArray()) {
if (!valid.contains(c)) {
allOk = false;
break;
}
}
System.out.println(allOk);
However, this is not the most efficient way of doing it. A better approach would be to construct a regex with the characters that you need, and call match() on the string:
// Let's say x, y, and z are the valid characters
String regex = "[xyz]*";
if (myString.matches(regex)) {
System.out.println("All characters in the string are in 'x', 'y', and 'z'");
}
I think you are probably over-thinking this problem. (For instance, spending too much time thinking how to make the lexer "efficient" ...)
The conventional ways to test for valid / invalid characters in a lexer are:
use a big switch statement, or
perform a sequence of "character class" tests; e.g. using the result of Character.getType(char)
Or better still, use a lexer generator.
Using a HashSet is neither more efficient or more readable than a switch. And the "character class" approach could be a lot more readable than both ... depending on your validation rules.
But if I haven't convinced you, see #blinkenlights' Answer :-)

Using Regex to analyse Strings in Java

I created a simple quiz program and I am trying to figure out a method to return 3 types of answer's using regex. The 3 answers would be either fully correct, correct (but spelling error) and partially correct, but still awarded being correct.
So for an example, the three strings will be correct from the method comparing to the String "Elephants" : 1. "Elephants", 2. "Elephents", 3. "Elephant".
The 1st string is fully correct, so would return "Correct Answer".
The 2nd string is correct but spelling error ('a' instead of an 'e'), so will return "Correct although spelled Elephants".
The 3rd string is partially correct (No 's' at the end), but will return "Answer accepted"
Could anyone figure out the three types of Regex expressions I could use for this method?
Thanks much appreciated.
There is no regex solution for this, but you can implement a "distance algorithm" to measure the relative similarity of two words. One very common algorithm for this is Levenshtein Distance, or Edit Distance: it tells you how many "editing actions" it would take to go from the answer the user typed in to the correctly spelled answer. Replacing, inserting, or deleting a symbol counts as one action. If the distance is two or less, the answer the user typed in is likely just a spelling error; if the distance is three or more, it's either a very poorly spelled answer, or an incorrect answer (both should be counted as incorrect).
The wikipedia article linked above has pseudocode implementation for the algorithm.
The first regex to match: Elephants
If it doesn't match try Eleph[ae]nt for the second.
If also not, try Elephant.
You could additionally combine it with word end markers.
For testing regexes, this site is really cool: http://gskinner.com/RegExr/
With regexs, you have to try to guess the spelling errors..
Fully Correct Regex:
"Elephants"
Correct although spelled "Elephants" Regex:
"[^E]lephants|E[^l]ephents|El[^e]phants|Ele[^p]hants|Elep[^h]ants|Eleph[^a]nts|Elepha[^n]ts|Elephan[^t]s|Elephant[^s]"
Answer accepted Regex:
"lephants|Eephants|Elphants|Elehants|Elephants|Elephnts|Elephnts|Elephats|Elephans|Elephant"
You could write a small program which automatically generates the regexp which validate your answer and outputs you the case in which your regexp fall
Correct
Correct although misspelled
Answer accepted
For instance, assuming that the correct answer is "Elephants", you could write a routine which test for the second case (Correct although misspelled).
String generateCorrectAltoughMispelledAnswerRegex(final String answer) {
StringBuilder builder = new StringBuilder();
String answer = "Elephants";
for (int i = 0; i < answer.length; i++) {
String mispelled = answer.substring(0, i) + "[^" + char.at(i) + "]" +
(i < length ? answer.substring(i + 1) : "");
answer.append(mispelled);
if (i < length - 1) { answer.append("|"); }
}
String regex = builder.build();
return regex;
}
e.g.: By calling the function generateCorrectAlthoughMispelledAnswerRegex with the argument "Elephants"
generateCorrectAltoughMispelledAnswerRegex("Elephants")
it will generate the regexp to the test for the second case:
"[^E]lephants|E[^l]ephents|El[^e]phants|Ele[^p]hants|Elep[^h]ants|Eleph[^a]nts|Elepha[^n]ts|Elephan[^t]s|Elephant[^s]"
You can do the same for the other cases.

select a word from a section of string?

I'm trying to find out if there are any methods in Java which would me achieve the following.
I want to pass a method a parameter like below
"(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
I want the method to select one of the words inside the parenthesis separated by '|' and return the full string with one of the words randomly selected. Does Java have any methods for this or would I have to code this myself using character by character checks in loops?
You can parse it by regexes.
The regex would be \(\w+(\|\w+)*\); in the replacement you just split the argument on the '|' and return the random word.
Something like
import java.util.regex.*;
public final class Replacer {
//aText: "(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
//returns: "hello my name is Bob. Today is a wonderful day."
public static String getEditedText(String aText){
StringBuffer result = new StringBuffer();
Matcher matcher = fINITIAL_A.matcher(aText);
while ( matcher.find() ) {
matcher.appendReplacement(result, getReplacement(matcher));
}
matcher.appendTail(result);
return result.toString();
}
private static final Pattern fINITIAL_A = Pattern.compile(
"\\\((\\\w+(\\\|\w+)*)\\\)",
Pattern.CASE_INSENSITIVE
);
//aMatcher.group(1): "hi|hello"
//words: ["hi", "hello"]
//returns: "hello"
private static String getReplacement(Matcher aMatcher){
var words = aMatcher.group(1).split('|');
var index = randomNumber(0, words.length);
return words[index];
}
}
(Note that this code is written just to illustrate an idea and probably won't compile)
May be it helps,
Pass three strings("hi|hello"),(Bob|Robert) and (good|great|wonderful) as arguments to the method.
Inside method split the string into array
by, firststringarray[]=thatstring.split("|"); use this for other two.
and Use this to use random string selection.
As per my knowledge java don't have any method to do it directly.
I have to write code for it or regexe
I don't think Java has anything that will do what you want directly. Personally, instead of doing things based on regexps or characters, I would make a method something like:
String madLib(Set<String> greetings, Set<String> names, Set<String> dispositions)
{
// pick randomly from each of the sets and insert into your background string
}
There is no direct support for this. And you should ideally not try a low level solution.
You should search for 'random sentence generator'. The way you are writing
`(Hi|Hello)`
etc. is called a grammar. You have to write a parser for the grammar. Again there are many solutions for writing parsers. There are standard ways to specify grammar. Look for BNF.
The parser and generator problems have been solved many time over, and the interesting part of your problem will be writing the grammar.
Java does not provide any readymade method for this. You can use either Regex as described by Penartur or create your own java method to split Strings and store random words. StringTokenizer class can help you if following second approach.

Is there a regular expression for finding/replacing the common start of all lines in a chunk of text?

Imagine this string:
if(editorPart instanceof ITextEditor){
ITextEditor editor = (ITextEditor)editorPart;
selection = (ITextSelection) editor.getSelectionProvider().getSelection();
}else if( editorPart instanceof MultiPageEditorPart){
//this would be the case for the XML editor
selection = (ITextSelection) editorPart.getEditorSite().getSelectionProvider().getSelection();
}
I can see, visually, that the "common" start in each of these lines is two tab characters. Is there a regular expression that would replace -- only at the beginning of each line (including the first and last line), this common start, such that after the regex I'd end up with that same string, only essentially un-indented?
I can't simply search for "two tabs" in this case because there might be two tabs elsewhere in the text but not at the start of a line.
I've implemented this functionality with a different method but thought it'd be a fun regex challenge, if it's possible at all
The ^ symbol in a regular expression matches the beginning of a line. So:
/^\t\t//g
Would remove two tabs at the beginning of a line.
In general (i.e. if you want to match an arbitrary prefix, not necessarily two tabs), there may or may not be a way. It depends on which regular expression engine you're using. I would imagine that maybe something roughly like this might work:
\B^(.+).*?$(?:^\1.*?$)+\E
note that I've probably screwed up the regex syntax, just think of it as regex pseudocode of sorts (\B is beginning of string, ^ is beginning of line, $ is end of line, \E is end of string)
But this really isn't a job I would do with a regular expression. A simple character-by-character parser seems much better suited.
Not in one regex. You need to make two passes: matches() to find the longest common prefix, then replaceAll() to remove it. Here's my best solution:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String target =
"\t\tif(editorPart instanceof ITextEditor){\n"
+ "\t\t\tITextEditor editor = (ITextEditor)editorPart;\n"
+ "\t\t\tselection = (ITextSelection) fee.fie().fum();\n"
+ "\t\t}else if( editorPart instanceof MultiPageEditorPart){\n"
+ "\t\t\t//this would be the case for the XML editor\n"
+ "\t\t\tselection = (ITextSelection) fee.fie().foe().fum();\n"
+ "\t\t}";
System.out.printf("%n%s%n", target);
Pattern p = Pattern.compile("^(\\s+).*+(?:\n\\1.*+)*+");
Matcher m = p.matcher(target);
if (m.matches())
{
String indent = m.group(1);
String result = target.replaceAll("(?m)^" + indent, "");
System.out.printf("%n%s%n", result);
}
}
}
Of course, this assumes (as Jonathan Leffler hinted at in his comment to your question) that the target string is not part of a larger string, and you're only removing whitespace. Without those assumptions the task becomes a lot more complex.
It's absolutely possible. As everyone points out, I'd never inflict this on a real project, though.
My answer, if you're curious, is here. I tried writing it in perl, but it doesn't support variable-length lookbehinds.
EDIT: Fixed it! The linked code now works. If you'd like hints, just comment -- I don't want to give it away if you want to solve it yourself, though.

Categories