I have an Arabic string, that I need to remove all special characters, LATIN ALPHABET , punctuation e.g. (, . ;) ,and Arabic punctuation e.g. (َ ً ُ ِ) I have wrote the following code
String input = "some text";
Pattern p = Pattern.compile("[\\p{P}\\w]");
java.util.regex.Matcher m = p.matcher(input);
while (m.find()) {
}
m.reset();
input = m.replaceAll(" ");
p = Pattern.compile("[\\p{Mn}\\p{Nd}\\p{InLatin-1Supplement}]+");
m = p.matcher(input);
while (m.find()) {
}
m.reset();
input = m.replaceAll("");
it worked will for almost all characters, but I still have problems removing or replacing those ($ ^ + < > |), I don't want to remove each one apart by repeating replaceAll statement, I even tried
Pattern p = Pattern.compile("[^\\p{L}\\p{Nd}]+");
also kept finding those in the resulting text ($ ^ + < > |), any way to do it?
Related
CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
String replace = "-";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
boolean isMatch = matcher.find();
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < content.length(); i++) {
while (matcher.find()) {
matcher.appendReplacement(buffer, replace);
}
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
In the above code content is input string,
I am trying to find repetitive occurrences from string and want to replace it with max no of occurrences
For Example
input -("abaaadccc",2)
output - "abaadcc"
here aaaand cccis replced by aa and cc as max allowed repitation is 2
In the above code, I found such occurrences and tried replacing them with -, it's working, But can someone help me How can I get current char and replace with allowed occurrences
i.e If aaa is found it is replaced by aa
or is there any alternative method w/o using regex?
You can declare the second group in a regex and use it as a replacement:
String result = "aaabbbccaaa".replaceAll("(([a-zA-Z])\\2)\\2+", "$1");
Here's how it works:
( first group - a character repeated two times
([a-zA-Z]) second group - a character
\2 a character repeated once
)
\2+ a character repeated at least once more
Thus, the first group captures a replacement string.
It isn't hard to extrapolate this solution for a different maximum value of allowed repeats:
String input = "aaaaabbcccccaaa";
int maxRepeats = 4;
String pattern = String.format("(([a-zA-Z])\\2{%s})\\2+", maxRepeats-1);
String result = input.replaceAll(pattern, "$1");
System.out.println(result); //aaaabbccccaaa
Since you defined a group in your regex, you can get the matching characters of this group by calling matcher.group(1). In your case it contains the first character from the repeating group so by appending it twice you get your expected result.
CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
System.out.println("found : "+matcher.start()+","+matcher.end()+":"+matcher.group(1));
matcher.appendReplacement(buffer, matcher.group(1)+matcher.group(1));
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
Output:
found : 0,3:a
found : 3,6:b
found : 8,11:a
aabbccaa
Below is my Java code to delete all pair of adjacent letters that match, but I am getting some problems with the Java Matcher class.
My Approach
I am trying to find all successive repeated characters in the input e.g.
aaa, bb, ccc, ddd
Next replace the odd length match with the last matched pattern and even length match with "" i.e.
aaa -> a
bb -> ""
ccc -> c
ddd -> d
s has single occurrence, so it's not matched by the regex pattern and excluded from the substitution
I am calling Matcher.appendReplacement to do conditional replacement of the patterns matched in input, based on the group length (even or odd).
Code:
public static void main(String[] args) {
String s = "aaabbcccddds";
int i=0;
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("([a-z])\\1+");
Matcher m = repeatedChars.matcher(s);
while(m.find()) {
if(m.group(i).length()%2==0)
m.appendReplacement(output, "");
else
m.appendReplacement(output, "$1");
i++;
}
m.appendTail(output);
System.out.println(output);
}
Input : aaabbcccddds
Actual Output : aaabbcccds (only replacing ddd with d but skipping aaa, bb and ccc)
Expected Output : acds
This can be done in a single replaceAll call like this:
String repl = str.replaceAll( "(?:(.)\\1)+", "" );
Regex expression (?:(.)\\1)+ matches all occurrences of even repetitions and replaces it with empty string this leaving us with first character of odd number of repetitions.
RegEx Demo
Code using Pattern and Matcher:
final Pattern p = Pattern.compile( "(?:(.)\\1)+" );
Matcher m = p.matcher( "aaabbcccddds" );
String repl = m.replaceAll( "" );
//=> acds
You can try like that:
public static void main(String[] args) {
String s = "aaabbcccddds";
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("(\\w)(\\1+)");
Matcher m = repeatedChars.matcher(s);
while(m.find()) {
if(m.group(2).length()%2!=0)
m.appendReplacement(output, "");
else
m.appendReplacement(output, "$1");
}
m.appendTail(output);
System.out.println(output);
}
It is similar to yours but when getting just the first group you match the first character and your length is always 0. That's why I introduce a second group which is the matched adjacent characters. Since it has length of -1 I reverse the odd even logic and voila -
acds
is printed.
You don't need multiple if statements. Try:
(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)
Replace with $1
Regex live demo
Java code:
str.replaceAll("(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)", "$1");
Java live demo
Regex breakdown:
(?: Start of non-capturing group
(\\w) Capture a word character
(?:\\1\\1)+ Match an even number of same character
| Or
(\\w) Capture a word character
\\2+ Match any number of same character
) End of non-capturing group
(?!\\1|\\2) Not followed by previous captured characters
Using Pattern and Matcher with StringBuffer:
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)");
Matcher m = repeatedChars.matcher(s);
while(m.find()) m.appendReplacement(output, "$1");
m.appendTail(output);
System.out.println(output);
I am very stuck. I use this format to read a player's name in a string, like so:
"[PLAYER_yourname]"
I have tried for a few hours and can't figure out how to read only the part after the '_' and before the ']' to get there name.
Could I have some help? I played around with sub strings, splitting, some regex and no luck. Thanks! :)
BTW: This question is different, if I split by _ I don't know how to stop at the second bracket, as I have other string lines past the second bracket. Thanks!
You can do:
String s = "[PLAYER_yourname]";
String name = s.substring(s.indexOf("_") + 1, s.lastIndexOf("]"));
You can use a substring. int x = str.indexOf('_') gives you the character where the '_' is found and int y = str.lastIndexOF(']') gives you the character where the ']' is found. Then you can do str.substring(x + 1, y) and that will give you the string from after the symbol until the end of the word, not including the closing bracket.
Using the regex matcher functions you could do:
String s = "[PLAYER_yourname]";
String p = "\\[[A-Z]+_(.+)\\]";
Pattern r = Pattern.compile(p);
Matcher m = r.matcher(s);
if (m.find( ))
System.out.println(m.group(1));
Result:
yourname
Explanation:
\[ matches the character [ literally
[A-Z]+ match a single character (case sensitive + between one and unlimited times)
_ matches the character _ literally
1st Capturing group (.+) matches any character (except newline)
\] matches the character ] literally
This solution uses Java regex
String player = "[PLAYER_yourname]";
Pattern PLAYER_PATTERN = Pattern.compile("^\\[PLAYER_(.*?)]$");
Matcher matcher = PLAYER_PATTERN.matcher(player);
if (matcher.matches()) {
System.out.println( matcher.group(1) );
}
// prints yourname
see DEMO
You can do like this -
public static void main(String[] args) throws InterruptedException {
String s = "[PLAYER_yourname]";
System.out.println(s.split("[_\\]]")[1]);
}
output: yourname
Try:
Pattern pattern = Pattern.compile(".*?_([^\\]]+)");
Matcher m = pattern.matcher("[PLAYER_yourname]");
if (m.matches()) {
String name = m.group(1);
// name = "yourname"
}
This code doesn't seem doing the right job. It removes the spaces between the words!
input = scan.nextLine().replaceAll("[^A-Za-z0-9]", "");
I want to remove all extra spaces and all numbers or abbreviations from a string, except words and this character: '.
For Example:
input: 34 4fF$##D one 233 r # o'clock 329riewio23
returns: one o'clock
public static String filter(String input) {
return input.replaceAll("[^A-Za-z0-9' ]", "").replaceAll(" +", " ");
}
The first replace replaces all characters except alphabetic characters, the single-quote, and spaces. The second replace replaces all instances of one or more spaces, with a single space.
Your solution doesn't work because you don't replace numbers and you also replace the ' character.
Check out this solution:
Pattern pattern = Pattern.compile("[^| ][A-Za-z']{2,} ");
String input = scan.nextLine();
Matcher matcher = pattern.matcher(input);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append(matcher.group());
}
System.out.println(result.toString());
It looks for the beginning of the string or a space ([^| ]) and then takes all the following characters ([A-Za-z']). However, it only takes the word if there are 2 or more charactes ({2,}) and there has to be a trailing space.
If you want to just extract that time information use this regex group match:
input = scan.nextLine();
Pattern p = Pattern.compile("([a-zA-Z]{3,})\\s.*?(o'clock)");
Matcher m = p.matcher(input);
if (m.find()) {
input = m.group(1) + " " + m.group(2);
}
The regex is quite naive though, and will only work if the input is always of a similar format.
String regex = "(\\s*T\\s*R\\s*A\\s*)*";
Pattern p = Pattern.compile(regex);
Trying to match "TRA", "T R A", "T R A", etc. Works fine for first case, with no spaces, but not for anything with spaces (just ignores). Not sure what I'm doing wrong.
EDIT
Essentially, I'm trying to match all occurrences of TRA, whether or not there are an arbitrary number of spaces between each letter (or occurrence).
For example: "TRATTR A T RA T RA" has 4 occurrences, and I want to match them all with one regex.
You should use:
String regex = "(\\s*T\\s*R\\s*A\\s*)";
instead of:
String regex = "(\\s*T\\s*R\\s*A\\s*)*";
Your regex is trying to match 0 or more occurrences of the given text and as per your question you're just trying to match it once.
Update: To match multiple occurrences use code like this:
String regex = "(\\s*T\\s*R\\s*A\\s*)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("T R A T R A T R A");
while (m.find())
System.out.printf("name=[%s]%n", m.group(1));
For your goal, the correct regex would be (\\s*T\\s*R\\s*A\\s*)+, as it requires at least one occurence of TRA group and won't match out the empty string.
Example:
String regex = "(\\s*T\\s*R\\s*A\\s*)+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("S T R A T R A T R A N G E");
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("No match");
}
Output:
T R A T R A T R A
This works for me:
String regex = "(\\s*T\\s*R\\s*A\\s*)";