How to remove front and end character of a string - java

For this input : |PS D#W ||OOOP #||# || QQWQ|
I want to remove first and last pipe from the string by modifying below regex which I have written for any space and special char removal.
str = str.replaceAll("[^a-zA-Z|]","");
Also I want to combine this regex - str.replaceAll("\\|+","|") (For updating many pipelines in between string to one pipe). Is it possible to combine this to one regex?
Expected output: PSDW|OOOP|QQWQ

I won't ask why you want regex to remove first and last pipe (you could check if string starts and ends with pipe and use substring)
Remember that regex is heavier than working with strings.
But...to remove first and last you can use
str.replaceAll("^\\|(.*)\\|$","$1")
Explanation here
then the others.
Ofcourse it depends how long is string and how often you use this method - but it's shortest answer for so asked question.

Related

Regular expression to match whole list as well as its parts [duplicate]

This question already has answers here:
Java Regex: repetitive groups?
(3 answers)
Closed 2 years ago.
In Java I have a string like +aba,biba,-miba, which is a list to sort orders. There might be any number of parts. "aba" "biba", "miba" are just examples.
I would like to make a regular expression, which finds +/- and aba, biba, miba.
I would also like to check if a full string matches the syntax. Which means, that I need to find +aba,biba,-miba as well.
I managed to write regex for the first part:
([+-]?)([^,]*)[,]?
How should I complete the expression that I can get 2nd part out of it as well?
Depending on the complexity of the list, i.e. what could be part of it, a regex to check the entire list would be quite straight forward. This regex could contain a group that represents each part as well as a quantifier but you wouldn't be able to extract the all the parts from a single regex as Java's implementation isn't built that way. Thus you'd need to either use a simple split() to get the parts or a second regex to extract them.
Assuming your list is separated by comma, doesn't contain whitespace and only allows +/- as well as lower-case characters you could use the following expression to check the format of the list:
boolean listMatches = list.matches("^([+-]?[a-z]+(,(?!$))?)*$");
Note that String.matches() makes ^ and $ superfluous but I added them for completeness in case you use another method to apply the expression. This basically checks for any number ob lower-case "names" preceded by an optional + or - and followed by a comma if it isn't the last character in the string.
Note that this would allow for a empty lists as well. If the list must contain at least one element you might use something like this:
boolean listMatches = list.matches("^[+-]?[a-z]+(,[+-]?[a-z]+)*$");
Looking for the parts could then look like this:
Pattern partPattern = Pattern.compile("([+-]?)([a-z]+)");
Matcher partMatcher = partPattern.matcher(list);
while( partMatcher.find() ) {
String direction = partMatcher.group(1);
String name = partMatcher.group(2);
}
Note that this could also be done with a combination of list.split(","), list.charAt(0) and list.subString(1,list.length()) - it's up to you :)

how to break the string using keywords using regex

I have a scenario where i need to break the below input string based on the keywords using regex.
Keywords are UPRCAS, REPLC, LOWCAS and TUPIL.
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
The output should be as follows
UPRCAS-00040-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-00030-ABCD
How can i achieve this using java regex.
I have tried using split by '-' and using regex but both the approach gives an array of strings and again i have to process each string and combine 3 strings together to form UPRCAS-00040-abcd. I felt this is not the efficient way to do as it takes an extra array and process them back.
String[] tokens = input.split("-");
String[] r = input.split("(?=\\p{Upper})");
Please let me know if we can split the string using regex based on the keyword. Basically i need to extract the string between the keyword boundary.
Edited question after understanding the limitation of existing problem
The regex should be generic to extract the string from input between the UPPERCASE characters
The regex should not contains keywords to split the string.
I understood that, it is a bad idea to add new keyword everytime in regex pattern for searching. My expectation is to be a generic as possible.
Thanks all for your time. Really appreciate it.
Split using the following regex:
(?=UPRCAS|REPLC|LOWCAS|TUPIL)
The (?=xxx) is a zero-width positive lookahead, meaning that it matches the empty space immediately preceding one of the 4 keywords.
See Regular-Expressions.info for more information: Lookahead and Lookbehind Zero-Length Assertions
Test
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
String[] output = input.split("(?=UPRCAS|REPLC|LOWCAS|TUPIL)");
for (String value : output)
System.out.println(value);
Output
UPRCAS-0004-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-0003-ABCD
You can try this regex:
\w+-\w+-(?:[a-z0-9]+|[A-Z]+)
Demo: https://regex101.com/r/etKBjI/3

How to remove duplicate characters in a string using regex?

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?
It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai
Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

combine multiple regex to extract sub string from : separated string

I have been stuck for some time developing a single regex to extract a path from either of the following strings :
1. "life:living:fast"
2. "life"
3. ":life"
4. ":life:"
I have these regex expressions to use :
(.{3,}):", ":(.{3,}):", ":(.{3,})", "(.{3,})
The first match is all I need. i.e. the desired result for each should be the string located where the word life is. consider life to be a variable
But for some reason combining these individual regex's is a pain: If I excecute them sequentially I get the word 'life' extracted. However I am unable to combine them into one.
I appreciate your effort.
If you want the first life with the colons, you can use this:
^:?(?:.{3,}?)(?::|$)
See demo
If you prefer the first life without the colons, switch to this:
((?<=^:)|^)([^:]{3,}?)(?=:|$)
See demo
How it Works #1: ^:?(?:.{3,}?)(?::|$)
With ^:?, at the beginning of the string, we match an optional colon
(?:.{3,}?) lazily matches three or more chars up to...
(?::|$) a colon or the end of the string
How it Works #1: ((?<=^:)|^)([^:]{3,}?)(?=:|$)
((?<=^:)|^) ensures that we are either positioned at the beginning of the string, or after a colon immediately after the beginning of the string
([^:]{3,}?) lazily matches chars that are not colons...
up to a point where the lookahead (?=:|$) can assert that what follows is a colon or the end of the string.
You can use this pattern, since you are looking for the first word:
(?<=^:?)[^:]{3,}
Note that this pattern doesn't check all the string.

Regular expression removing all words shorter than n

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
I thought something like \s\w{1,2}\s would grab all the 1 and 2 letter words (a whitespace, one to two word characters and another whitespace), but it just doesn't work.
Where am I wrong?
I've got it working fairly well, but it took two passes.
public static void main(String[] args) {
String passage = "Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.";
System.out.println(passage);
passage = passage.replaceAll("\\b[\\w']{1,2}\\b", "");
passage = passage.replaceAll("\\s{2,}", " ");
System.out.println(passage);
}
The first pass replaces all words containing less than three characters with a single space. Note that I had to include the apostrophe in the character class to eliminate because the word "I'm" was giving me trouble without it. You may find other special characters in your text that you also need to include here.
The second pass is necessary because the first pass left a few spots where there were double spaces. This just collapses all occurrences of 2 or more spaces down to one. It's up to you whether you need to keep this or not, but I think it's better with the spaces collapsed.
Output:
Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
Well, looking for regexp Java that deletes all words shorter than characters.
If you don't want the whitespace matched, you might want to use
\b\w{1,2}\b
to get the word boundaries.
That's working for me in RegexBuddy using the Java flavor; for the test string
"The dog is fun a cat"
it highlights "is" and "a". Similarly for words at the beginning/end of a line.
You might want to post a code sample.
(And, as GameFreak just posted, you'll still end up with double spaces.)
EDIT:
\b\w{1,2}\b\s?
is another option. This will partially fix the space-stripping issue, although words at the end of a string or followed by punctuation can still cause issues. For example, "A dog is fun no?" becomes "dog fun ?" In any case, you're still going to have issues with capitalization (dog should now be Dog).
Try: \b\w{1,2}\b although you will still have to get rid of the double spaces that will show up.
If you have a string like this:
hello there my this is a short word
This regex will match all words in the string greater than or equal to 3 characters in length:
\w{3,}
Resulting in:
hello there this short word
That, to me, is the easiest approach. Why try to match what you don't want, when you can match what you want a lot easier? No double spaces, no leftovers, and the punctuation is under your control. The other approaches break on multiple spaces and aren't very robust.

Categories