How to remove specific repeated characters from text? - java

I have a String like
"this is line 1\n\n\nthis is line 2\n\n\nthis is line 3\t\t\tthis is line 3 also"
What I want to do is remove repeated specific characters like "\n", "\t" from this text.
"this is line 1\nthis is line 2\nthis is line 3\tthis is line 3 also"
I tried some regular expressions but didn't work for me.
text = text.replace("/[^\\w\\s]|(.)\\1/gi", "");
Is there any regex for this?

If you need to only remove sepcific whitespace chars, \s won't help as it will overmatch, i.e. it will also match spaces, hard spaces, etc.
You may use a character class with the chars, wrap them with a capturing group and use a backreference to the value captured. And replace with the backreference to the Group 1 value:
.replaceAll("([\n\t])\\1+", "$1")
See the regex demo.
Details
([\n\t]) - Group 1 (referred to with \1 from the pattern and $1 from the replacement pattern): a character class matching either a newline or tab symbols
\1+ - one or more repetitions of the value in Group 1.

I would use Guava's CharMatcher:
CharMatcher.javaIsoControl().removeFrom(myString)

Related

Java Regex, remove leading spaces of each line

Want to use Java String.replaceAll(regex, "") to remove all leading spaces of each line in a multi-line text string. if possible remove all carriage-returns as well. What the "regex" should be?
Converting my comment to answer so that solution is easy to find for future visitors.
You may use regex replacement in Java:
str = str.replaceAll("(?m)^\\s+|\\s+$", "");
RegEx Details:
(?m): Enable MULTILINE mode so that ^ and $ are matched in every line.
^\\s+: Match 1+ whitespaces at line start
|: OR
\\s+$: Match 1+ whitespaces before line end

Replace URL String with Integer characters located in the end of that String

I have some URL link and tried to replace all non-integer values with integers in the end of the link using regex
The URL is something like
https://some.storage.com/test123456.bucket.com/folder/80.png
Regex i tried to use:
Integer.parseInt(string.replaceAll(".*[^\\d](\\d+)", "$1"))
Output for that regex is "80.png", and i need only "80". Also i tried this tool - https://regex101.com. And as i see the main problem is that ".png" not matching with my regex and then, after substitution, this part adding to matching group.
I'm totally noob in regex, so i kindly ask you for help.
You may use
String result = string.replaceAll("(?:.*\\D)?(\\d+).*", "$1");
See the regex demo.
NOTE: If there is no match, the result will be equal to the string value. If you do not want this behavior, instead of "(?:.*\\D)?(\\d+).*", use "(?:.*\\D)?(\\d+).*|.+".
Details
(?:.*\D)? - an optional (it must be optional because the Group 1 pattern might also be matched at the start of the string) sequence of
.* - any 0+ chars other than line break chars, as many as possible
\D - a non-digit
(\d+) - Group 1: any one or more digits
.* - any 0+ chars other than line break chars, as many as possible
The replacement is $1, the backreference to Group 1 value, actually, the last 1+ digit chunk in the string that has no line breaks.
Line breaks can be supported if you prepend the pattern with the (?s) inline DOTALL modifier, i.e. "(?s)(?:.*\\D)?(\\d+).*|.+".

How to extract and replace a String with specific format?

I have input String like;
(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)
What I want to do is find all words starting with "rm" and replace them with remove function.
(remove(01ADS21212), 'adfffddd', remove(Adssssss), '1231232131', remove(2321312322))
I am trying to use replaceAll function but I don't know how to extract parts after "rm" literal.
statement.replaceAll("\\(rm*.,", "remove($1)");
Is there any way to get these parts?
You have not captured any substring with a capturing group, thus $1 is null.
You may use
.replaceAll("\\brm(\\w*)", "remove($1)")
See the regex demo
Details
\b - a word boundary (to start matching only at the start of a word)
rm - a literal part
(\w*) - Group 1: 0+ word chars (letters, digits or underscores)
The $1 in the replacement pattern stands for Group 1 value.
If you mean to match any chars other than a comma and whitespace after rm, use "\\brm([^\\s,]*)", see this regex demo.
Use "Replace" with empty string .
Eg;
string str = "(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)";
Console.WriteLine(str.Replace("rm", ""));
Output : (01ADS21212, 'adfffddd', Adssssss, '1231232131', 2321312322)

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.
Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line
But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space
We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

Java Regex does not match newline

My code is as follows:
public class Test {
static String REGEX = ".*([ |\t|\r\n|\r|\n]).*";
static String st = "abcd\r\nefgh";
public static void main(String args[]){
System.out.println(st.matches(REGEX));
}
}
The code outputs false. In any other cases it matches as expected, but I can't figure out what the problem here is.
You need to remove the character class.
static String REGEX = ".*( |\t|\r\n|\r|\n).*";
You can't put \r\n inside a character class. If you do that, it would be treated as \r, \n as two separate items which in-turn matches either \r or \n. You already know that .* won't match any line breaks so, .* matches the first part and the next char class would match a single character ie, \r. Now the following character is \n which won't be matched by .*, so your regex got failed.
UPDATE:
Based on your comments, you need something like this:
.*(?:[ \r\n\t].*)+
EXPLANATION:
In plain words, it is a regex that matches a line, then 1 or more lines. Or, just a multiline text.
.* - 0 or more characters other than a newline
(?:[ \r\n\t].*)+ - a non-capturing group that matches 1 or more times a sequence of
[ \r\n\t] - either a space, or a \r or \n or \t
.* - 0 or more characters other than a newline
See demo
Original answer
You can fix your pattern 2 ways:
String REGEX = ".*(?:\r\n|[ \t\r\n]).*";
This way we match either \r\n sequence, or any character in the character class.
Or (since the character class only matches 1 character, we can add + after it to capture 1 or more:
String REGEX = ".*[ \t\r\n]+.*";
See IDEONE demo
Note that it is not a good idea to use single characters in alternations, it decreases performance.
Also note that capturing groups should not be overused. If you do not plan to use the contents of the group, use non-capturing groups ((?:...)), or remove them.

Categories