Java regex - erase characters followed by \b (backspace) - java

I have a string constructed from user keyboard types, so it might contain '\b' characters (backspaces).
I want to clean the string, so that it will not contain the '\b' characters, as well as the characters they are meant to erase. For instance, the string:
String str = "\bHellow\b world!!!\b\b\b.";
Should be printed as:
Hello world.
I have tried a few things with replaceAll, and what I have now is:
System.out.println(str.replaceAll("^\b+|.\b+", ""));
Which prints:
Hello world!!.
Single '\b' is handled fine, but multiples of it are ignored.
So, can I solve it with Java's regex?
EDIT:
I have seen this answer, but it seem to not apply for java's replaceAll.
Maybe I'm missing something with the verbatim string...

It can't be done in one pass unless there is a practical limit on the number of consecutive backspaces (which there isn't), and there is a guarantee (which there isn't) that there are no "extra" backspaces for which there is no preceding character to delete.
This does the job (it's only 2 small lines):
while (str.contains("\b"))
str = str.replaceAll("^\b+|[^\b]\b", "");
This handles the edge case of input like "x\b\by" which has an extra backspace at the start, which should be trimmed once the first one consumes the x, leaving just "y".

This looks like a job for Stack!
Stack<Character> stack = new Stack<Character>();
// for-each character in the string
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
// push if it's not a backspace
if (c != '\b') {
stack.push(c);
// else pop if possible
} else if (!stack.empty()) {
stack.pop();
}
}
// convert stack to string
StringBuilder builder = new StringBuilder(stack.size());
for (Character c : stack) {
builder.append(c);
}
// print it
System.out.println(builder.toString());
Regex, while nice, isn't well suited to every task. This approach is not as concise as Bohemian's, but it is more efficient. Using a stack is O(n) in every case, while a regex approach like Bohemian's is O(n2) in the worst case.

The problem you are trying to solve can't be solved with single regular expression. The problem there is that grammar, that generates language {any_symbol}*{any_symbol}^n{\b}^n (which is special case of your input) isn't regular. You need to store state somewhere (how much symbols before \b and \b it has read), but DFA can't do it (because DFA can't know how much sequential \b it can find). All proposed solutions are just regexes for your case ("\bHellow\b world!!!\b\b\b.") and can easily be broken with more complicated test.
Easiest solution for your case is replacing in cycle pair {all except \b}{\b}
UPD: Solution, proposed by #Bohemian seems perfectly correct:
UPD 2:
Seems like java's regexes can parse not only regular languages, but also inputs like {a}^n{b}^n with recursive lookahead, so in case for java it is possible to match those groups with single regex.
Thanks for #Pshemo comments and #Elist edits!

If i understand the question correctly, this is the solution to your question:
String str = "\bHellow\b world!!!\b\b\b.";
System.out.println(str.replace(".?\\\b", ""));

This has been a nice riddle. I think you can use a regex to remove the same number of identical repeated characters and \bs (i.e. for your particular input string):
String str = "\bHellow\b world!!!\b\b\b.";
System.out.println(str.replaceAll("^\b+|(?:([^\b])(?=\\1*+(\\2?+\b)))+\\2", ""));
This is an adaptation of How can we match a^n b^n with Java regex?.
See IDEONE demo, where I added .replace("\b","<B>")); to see if there are any \bs left.
Output:
Hello world.
A generic regex-only solution is outside of regex scope... for now.

Related

Regex for finding between 1 and 3 character in a string

I am trying to write a regex which should return true, if [A-Za-z] is occured between 1 and 3, but I am not able to do this
public static void main(String[] args) {
String regex = "(?:([A-Za-z]*){3}).*";
String regex1 = "(?=((([A-Za-z]){1}){1,3})).*";
Pattern pattern = Pattern.compile(regex);
System.out.println(pattern.matcher("AD1CDD").find());
}
Note: for consecutive 3 characters I am able to write it, but what I want to achieve is the occurrence should be between 1 and 3 only for the entire string. If there are 4 characters, it should return false. I have used look-ahead to achieve this
If I understand your question correctly, you want to check if
1 to 3 characters of the range [a-zA-Z] are in the string
Any other character can occur arbitrary often?
First of all, just counting the characters and not using a regular expression is more efficient, as this is not a regular language problem, but a trivial counting problem. There is nothing wrong with using a for loop for this problem (except that interpreters such as Python and R can be fairly slow).
Nevertheless, you can (ab-) use extended regular expressions:
^([^A-Za-z]*[A-Za-z]){1,3}[^A-Za-z]*$
This is fairly straightforward, once you also model the "other" characters. And that is what you should do to define a pattern: model all accepted strings (i.e. the entire "language"), not only those characters you want to find.
Alternatively, you can "findAll" matches of ([A-Za-z]), and look at the length of the result. This may be more convenient if you also need the actual characters.
The for loop would look something like this:
public static boolean containsOneToThreeAlphabetic(String str) {
int matched = 0;
for(int i=0; i<str.length; i++) {
char c = str.charAt(i);
if ((c>='A' && c<='Z') || (c>='a' && c<='z')) matched++;
}
return matched >=1 && matched <= 3;
}
This is straightforward, readable, extensible, and efficient (in compiled languages). You can also add a if (matched>=4) return false; (or break) to stop early.
Please, stop playing with regex, you'll complicate not only your own life, but the life of the people, who have to handle your code in the future. Choose a simpler approach, find all [A-Za-z]+ strings, put them into the list, then check every string, if the length is within 1 and 3 or beyond that.
Regex
/([A-Za-z])(?=(?:.*\1){3})/s
Looking for a char and for 3 repetitions of it. So if it matches there are 4 or more equal chars present.

Java Program to reverse every word, special chars should remain in the exact same place

I am trying to solve a java program where i need to reverse every word keeping special characters in same place.
Input: String s = "Why $Java Is# Great?"
Output: yhW $avaJ sI# taerG?
I was able to do word reverse but not sure how to handle special characters in place.
Solution using stack :
String s = "Why $Java Is# Great?";
String Output= "yhW $avaJ sI# taerG?";
StringBuilder sb1 = new StringBuilder();
Stack<Character> stack = new Stack<>();
for(int i=0; i<s.length();i++){
char ch = s.charAt(i);
if(!((ch >='a' && ch<='z') ||
(ch >='A' && ch <='Z'))){
while(!stack.isEmpty()){
sb1.append(stack.pop());
}
sb1.append(ch);
}else{
stack.push(ch);
}
}
System.out.println(sb1.toString());
System.out.println(sb1.toString().equals(Output));
}
I won't give you code, as this could be an assignment and in that case you should complete it on your own, but here is (one of) the solution(s):
Use a Pattern and Matcher with the regex of \w+ (consecutive word characters) to find single words without altering other stuff.
Create a loop that you will use the setup from step-1 to handle each word without altering anything else.
Inside the loop, using the index and length of the matches to have the range of indexes in the string which the word is in, do a simple loop that swaps the order of the characters in that range.
Since the first loop handles finding every word (and nothing else), and the second, inner loop handles reversing the words (and nothing else), the result should be what you expect.
The java.lang.Character.isLetter(char ch) determines if the specified character is a letter.
Here is a small pseudocode
if(!character.isLetter) don't reverse
Hope that helped
A simple solution would be to write a method that checks if a character is special or not, e.g. call it isSpecialChar() that returns a boolean. If you're unfamiliar as of how to do it, check representing chars using ASCII values. It's pretty basic and straightforward.
Somewhat you have to do the following: check the character, append to the string appropriately. If this is a word, you can reverse it using another helper method, for instance, reverse() which would return the string reversed. For better performance and learning, try using string builder instead of regular string.
References: How to get ASCII from char
About String builders

java String.replaceAll char between two numbers

I would like to replace all char '-' that between two numbers, or that between number and '.' by char '&'.For example
String input= "2.1(-7-11.3)-12.1*-2.3-.11"
String output= "2.1(-7&11.3)-12.1*-2.3&.11"
I have something like this, but I try to do it easier.
public void preperString(String input) {
input=input.replaceAll(" ","");
input=input.replaceAll(",",".");
input=input.replaceAll("-","&");
input=input.replaceAll("\\(&","\\(-");
input=input.replaceAll("\\[&","\\[-");
input=input.replaceAll("\\+&","\\+-");
input=input.replaceAll("\\*&","\\*-");
input=input.replaceAll("/&","/-");
input=input.replaceAll("\\^&","\\^-");
input=input.replaceAll("&&","&-");
input=input.replaceFirst("^&","-");
for (String s :input.split("[^.\\-\\d]")) {
if (!s.equals(""))
numbers.add(Double.parseDouble(s));
}
You can make it in one shot using groups of regex to solve your problem, you can use this :
String input = "2.1(-7-11.3)-12.1*-2.3-.11";
input = input.replaceAll("([\\d.])-([\\d.])", "$1&$2");
Output
2.1(-7&11.3)-12.1*-2.3&.11
([\\d.])-([\\d.])
// ^------------replace the hyphen(-) that it between
// ^__________^--------two number(\d)
// ^_^______^_^------or between number(\d) and dot(.)
regex demo
Let me guess. You don't really have a use for & here; you're just trying to replace certain minus signs with & so that they won't interfere with the split that you're trying to use to find all the numbers (so that the split doesn't return "-7-11" as one of the array elements, in your original example). Is that correct?
If my guess is right, then the correct answer is: don't use split. It is the wrong tool for the job. The purpose of split is to split up a string by looking for delimiter patterns (such as a sequence of whitespace or a comma); but where the format of the elements between the delimiters doesn't much matter. In your case, though, you are looking for elements of a particular numeric format (it might start with -, and otherwise will have at least one digit and at most one period; I don't know what your exact requirements are). In this case, instead of split, the right way to do this is to create a regular expression for the pattern you want your numbers to have, and then use m.find in a loop (where m is a Matcher) to get all your numbers.
If you need to treat some - characters differently (e.g. in -7-11, where you want the second - to be an operator and not part of -11), then you can make special checks for that in your loop, and skip over the - signs that you know you want to treat as operators.
It's simpler, readers will understand what you're trying to do, and it's less error-prone because all you have to do is make sure your pattern for expressing numbers accurately reflects what you're looking for.
It's common for newer Java programmers to think regexes and split are magic tools that can solve everything. But often the result ends up being too complex (code uses overly complicated regexes, or relies on trickery like having to replace characters with & temporarily). I cannot look at your original code and convince myself that it works right. It's not worth it.
You can use lookahead and lookbehind to match digit or dot:
input.replaceAll("(?<=[\\d\\.])-(?=[\\d\\.])","&")
Have a look on this fiddle.

Need help for writing regular expression

I am weak in writing regular expressions so I'm going to need some help on the one. I need a regular expression that can validate that a string is an set of alphabets (the alphabets must be unique) delimited by comma.
Only one character and after that a comma
Examples:
A,E,R
R,A
E,R
Thanks
You can use a repeated group to validate it's a comma separated string.
^[AER](?:,[AER])*$
To not have unique characters, you would do something like:
^([AER])(?:,(?!\1)([AER])(?!.*\2))*$
If I understand it correctly, a valid string will be a series (possibly zero long) of two-character patterns, where each pattern is a letter followed by a comma; finally followed at the end by one letter.
Thus:
"^([A-Za-z],)*[A-Za-z]$"
EDIT: Since you've clarified that the letters have to be A, E, or R:
"^([AER],)*[AER]$"
Something like this "^([AER],)*[AER]$"
#Edit: regarding the uniqueness, if you can drop the "last character cannot be a comma" requirement (which can be checked before the regex anyway in constant time) then this should work:
"^(?:([AER],?)(?!.*\\1))*$"
This will match A,E,R, hence you need that check before performing the regex. I do not take responsibility for the performance but since it's only 3 letters anyway...
The above is a java regex obviously, if you want a "pure one" ^(?:([AER],?)(?!.*\1))*$
#Edit2: sorry, missed one thing: this actually requires that check and then you need to add a comma at the end since otherwise it will also match A,E,E. Kind of limited I know.
My own ugly but extensible solution, which will disallow leading and trailing commas, and checks that the characters are unique.
It uses forward-declared backreference: note how the second capturing group is behind the reference made to it (?!.*\2). On the first repetition, since the second capturing group hasn't captured anything, Java treats any attempt to reference text match by second capturing group as failure.
^([AER])(?!.*\1)(?:,(?!.*\2)([AER]))*+$
Demo on regex101 (PCRE flavor has the same behavior for this case)
Demo on RegexPlanet
Test cases:
A,E,R
A,R,E
E,R,A
A
R,E
R
E
A,
A,R,
A,A,R
E,A,E
A,E,E
X,R,E
R,A,E,
,A
AA,R,E
Note: I'm going to answer the original question. That is, I don't care if the elements repeat.
We've had several suggestions for this regex:
^([AER],)*[AER]$
Which does indeed work. However, to match a String, it first has to back up one character because it will find that there is no , at the end. So we switch it for this to increase performance:
^[AER](,[AER])*$
Notice that this will match a correct String the very first time it attempts to. But also note that we don't need to worry about the ( )* backing up at all; it will either match the first time, or it won't match the String at all. So we can further improve performance by using a possessive quantifier:
^[AER](,[AER])*+$
This will take the whole String and attempt to match it. If it fails, then it stops, saving time by not doing useless backing up.
If I were trying to ensure the String had no repeated elements, I would not use regex; it just complicates things. You end up with less-readable code (sadly, most people don't understand regex) and, oftentimes, slower code. So I would build my own validator:
public static boolean isCommaDelimitedSet(String toValidate, HashSet<Character> toMatch) {
for (int index = 0; index < toValidate.length(); index++) {
if (index % 2 == 0) {
if (!toMatch.contains(toValidate.charAt(index))) return false;
} else {
if (toValidate.charAt(index) != ',') return false;
}
}
return true;
}
This assumes that you want to be able to pass in a set of characters that are allowed. If you don't want that and have explicit chars you want to match, change the contents of the if (index % 2 == 0) block to:
char c = toValidate.charAt(index);
if (c == 'A' || c == 'E' || c == 'R' || /* and so on */ ) return false;

How to use regular expressions to match everything before a certain type of word

I am new to regular expressions.
Is it possible to match everything before a word that meets a certain criteria:
E.g.
THIS IS A TEST - - +++ This is a test
I would like it to encounter a word that begins with an uppercase and the next character is lower case. This constitutes a proper word. I would then like to delete everything before that word.
The example above should produce: This is a test
I only want to this processing until it finds the proper word and then stop.
Any help would be appreciated.
Thanks
Replace
^.*?(?=[A-Z][a-z])
with the empty string. This works for ASCII input. For non-ASCII input (Unicode, other languages), different strategies apply.
Explanation
.*? Everything, until
(?= followed by
[A-Z] one of A .. Z and
[a-z] one of a .. z
)
The Java Unicode-enabled variant would be this:
^.*?(?=\p{Lu}\p{Ll})
Having woken up a bit, you don't need to delete anything, or even create a sub-group - just find the pattern expressed elsewhere in answers. Here's a complete example:
import java.util.regex.*;
public class Test
{
public static void main(String args[])
{
Pattern pattern = Pattern.compile("[A-Z][a-z].*");
String original = "THIS IS A TEST - - +++ This is a test";
Matcher match = pattern.matcher(original);
if (match.find())
{
System.out.println(match.group());
}
else
{
System.out.println("No match");
}
}
}
EDIT: Original answer
This looks like it's doing the right thing:
import java.util.regex.*;
public class Test
{
public static void main(String args[])
{
Pattern pattern = Pattern.compile("^.*?([A-Z][a-z].*)$");
String original = "THIS IS A TEST - - +++ This is a test";
String replaced = pattern.matcher(original).replaceAll("$1");
System.out.println(replaced);
}
}
Basically the trick is not to ignore everything before the proper word - it's to group everything from the proper word onwards, and replace the whole text with that group.
The above would fail with "*** FOO *** I am fond of peanuts" because the "I" wouldn't be considered a proper word. If you want to fix that, change the [a-z] to [a-z\s] which will allow for whitespace instead of a letter.
I really don't get why people go to regular expressions so quickly.
I've done a lot of string parsing (Used to screen-scrape vt100 menu screens) and I've never found a single case where Regular Expressions would have been much easier than just writing code. (Maybe a couple would have been a little easier, but not much).
I kind of understand they are supposed to be easier once you know them--but you see someone ask a question like this and realize they aren't easy for every programmer to just get by glancing at it. If it costs 1 programmer somewhere down the line 10 minutes of thought, it has a huge net loss over just coding it, even if you took 5 minutes to write 5 lines.
So it's going to need documentation--and if someone who is at that same level comes across it, he won't be able to modify it without knowledge outside his domain, even with documentation.
I mean if the poster had to ask on a trivial case--then there just isn't such thing as a trivial case.
public String getRealText(String scanMe) {
for(int i=0 ; i < scanMe.length ; i++)
if( isUpper(scanMe[i]) && isLower(scanMe[i+1]) )
return scanMe.subString(i);
return null; }
I mean it's 5 lines, but it's simple, readable, and faster than most (all?) RE parsers. Once you've wrapped a regular expression in a method and commented it, the difference in size isn't measurable. The difference in time--well for the poster it would have obviously been a LOT less time--as it might be for the next guy that comes across his code.
And this string operation is one of the ones that are even easier in C with pointers--and it would be even quicker since the testing functions are macros in C.
By the way, make sure you look for a space in the second slot, not just a lower case variable, otherwise you'll miss any lines starting with the words A or I.
([A-Z][a-z].+)
would match:
This is a text
then you can do something like this
'.*([A-Z][a-z].*)\s*'
.* matches anything
( [A-Z] #followed by an uper case char
[a-z] #followed by a lower case
.*) #followed by anything
\s* #followed by zeror or more white space
Which is what you are looking for I think

Categories