Split a string with multiple delimiters while keeping these delimiters - java

Let's say we have a string:
String x = "a| b |c& d^ e|f";
What I want to obtain looks like this:
a
|
b
|
c
&
d
^
e
|
f
How can I achieve this by using x.split(regex)? Namely what is the regex?
I tried this link: How to split a string, but also keep the delimiters?
It gives a very good explanation on how to do it with one delimiter.
However, using and modifying that to fit multiple delimiters (lookahead and lookbehind mechanism) is not that obvious to someone who is not familiar with regex.

The regex for splitsplitting on optional spaces after a word boundary is
\\b\\s*
Note that \\b checks if the preceding character is a letter, or a digit or an underscore, and then matches any number of whitespace characters that the string will be split on.
Here is a sample Java code on IDEONE:
String str = "a| b |c& d^ e|f";
String regex = "\\b\\s*";
String[] spts = str.split(regex);
for(int i =0; i < spts.length && i < 20; i++)
{
System.out.println(spts[i]);
}

Related

Is there a regex expression suitable for splitting two one digit numbers separated by a whitespace?

I'm prompting the user for input using the java scanner. The input needs to be two single digits from 0-2 separated by a whitespace (eg. "1 2").
When I try do \s to split "1 2" i get an arrayoutofbounds exception
whereas when i split "1-2" with \- it works perfectly fine.
I'm completely new to regex and would really appreciate some help :)
My code:
public void x() {
int n = -1;
Scanner scanner = new Scanner(System.in);
System.out.println("Pick your coordinates. X goes first. Eg. 1 1");
String input = scanner.nextLine();
for (int i = 0; i <= 2; i++) {
// input = input.replaceAll("\\s", "-").toLowerCase();
parts = input.split("\\d{1}\\s\\d{1}");
String x = parts[0];
String y = parts[1];
Pattern pattern = Pattern.compile(String.valueOf(i));
Matcher matcher = pattern.matcher(x);
boolean matchFound = matcher.find();
if (matchFound) {
break;
} else {
n = i;
System.out.println("match N");
}
}
System.out.println("t");
}
In IntelliJ IDEA, if you keep your mouse on the split you can see the documentation of the method:
public String[] split(String regex)
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The string "boo:and:foo", for example, yields the following results
with these expressions:
+-------+-------------------------+
| Regex | Result |
+-------+-------------------------+
| : | { "boo", "and", "foo" } |
| o | { "b", "", ":and:f" } |
+-------+-------------------------+
Parameters:
regex - the delimiting regular expression
Returns:
the array of strings computed by splitting this string around matches
of the given regular expression
Therefore, you only have to pass in \\s to split by any space character.
Try
parts = input.split("\\s");
I believe you're looking for something like this:
\d{1}\s\d{1}
\d{1} = Match any digit, 1 time
\s = Followed by Any whitespace
According to documentation:
split(String regex)
Splits this string around matches of the given regular expression.
So, you are trying to split string in the wrong way.
If you want to validate that pattern then you can do this by the following approach:
input.matches("\\d{1}\\s\\d{1}")

trouble with writing regex java

String always consists of two distinct alternating characters. For example, if string 's two distinct characters are x and y, then t could be xyxyx or yxyxy but not xxyy or xyyx.
But a.matches() always returns false and output becomes 0. Help me understand what's wrong here.
public static int check(String a) {
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
if(a.matches("["+on+"("+to+""+on+")*]|["+to+"("+on+""+to+")*]")) {
return a.length();
}
}
return 0;
}
Use regex (.)(.)(?:\1\2)*\1?.
(.) Match any character, and capture it as group 1
(.) Match any character, and capture it as group 2
\1 Match the same characters as was captured in group 1
\2 Match the same characters as was captured in group 2
(?:\1\2)* Match 0 or more pairs of group 1+2
\1? Optionally match a dangling group 1
Input must be at least two characters long. Empty string and one-character string will not match.
As java code, that would be:
if (a.matches("(.)(.)(?:\\1\\2)*\\1?")) {
See regex101.com for working examples1.
1) Note that regex101 requires use of ^ and $, which are implied by the matches() method. It also requires use of flags g and m to showcase multiple examples at the same time.
UPDATE
As pointed out by Austin Anderson:
fails on yyyyyyyyy or xxxxxx
To prevent that, we can add a zero-width negative lookahead, to ensure input doesn't start with two of the same character:
(?!(.)\1)(.)(.)(?:\2\3)*\2?
See regex101.com.
Or you can use Austin Anderson's simpler version:
(.)(?!\1)(.)(?:\1\2)*\1?
Actually your regex is almost correct but problem is that you have enclosed your regex in 2 character classes and you need to match an optional 2nd character in the end.
You just need to use this regex:
public static int check(String a) {
if (a.length() < 2)
return 0;
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
String re = on+"("+to+on+")*"+to+"?|"+to+"("+on+to+")*"+on+"?";
System.out.println("re: " + re);
if(a.matches(re)) {
return a.length();
}
}
return 0;
}
Code Demo

How to match strings split by "|e|" sign

I've written a program to split a string by |o| and |e| signs.
This is my whole string (which I want to process):
code|e|0.07610 |o| p|e|0.02225 |o| li|e|0.02032 |o| applet|e|0.01305 |o| pre|e|0.01289
I write a utility function to parse the above string, The following is a part of this utility function :
String [] trs = tgs[1].split("[^ |o| ]"); //tgs[1] have the whole string
for (int i=0 ; i<9; i++) {
String t = trs[i].split("[^|e|]")[0];
e.add(new ProbTranslate(t, Double.parseDouble(trs[i].split("[^|e|]")[1])));
}
But it seems to be incorrect (cause I debug the program and then i get incorrect results). I feel that my mistake is in incorrect regex part. So I seek a proper regex for parsing the above string.
Any help would be appreciated. Thanks.
To quote special characters in regular expressions, Java provides a method: java.util.regex.Pattern#quote
Applying to your example above, this could e.g. lead to
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String[] split1 = "code|e|0.07610 |o| p|e|0.02225 |o| li|e|0.02032 |o| applet|e|0.01305 |o| pre|e|0.01289".split(Pattern.quote(" |o| "));
for (int i = 0; i < split1.length; ++i) {
final String name = split1[i];
final String[] split2 = name.split(Pattern.quote("|e|"));
for (int j = 0; j < split2.length; ++j) {
System.out.println(split2[j]);
}
System.out.println("");
}
}
}
Output:
code
0.07610
p
0.02225
li
0.02032
applet
0.01305
pre
0.01289
Solution
Make two changes:
"[^ |o| ]" ➔ "( \\|o\\| )"
"[^|e|]" ➔ "(\\|e\\|)"
With those changes, your code would look like this:
String [] trs = tgs[1].split("( \\|o\\| )");
for (int i=0 ; i<9; i++) {
String t = trs[i].split("(\\|e\\|)")[0];
e.add(new ProbTranslate(t, Double.parseDouble(trs[i].split("(\\|e\\|)")[1])));
}
Explanation
There are three problems with your regex.
String#split(String) splits around the subsequences that match the given regex. Therefore, if you want to split around / remove every |o|, then your regex needs to match |o|. However, it appears that you think (incorrectly) that the regex should match everything other than the split subsequence, since you are using a negated character class. Don't do that.
In order to match (or exclude, for that matter) a complete substring in regex, the substring must be contained in parentheses, e.g. (substring). Parentheses denote a capture group. If you use brackets (e.g. [characters]), then it is a character class, which is equivalent to saying "any of these individual characters" rather than "this complete substring".
The character | is a control character in regex. That means that if you want to match a literal | rather than using it to denote regex alternation, then you must escape it. And since this is Java, you must escape the \ too so that Java doesn't try to change \| to some special Java character before the string even gets to the regex engine. Hence, \\|.

Replace all characters bigger than f

I use normally this command to replace characters in one String
myString.replace("f", "a").trim()
but this time I want to create a Hex String so I want to replace all characters that are bigger than f with the character a.
Is it possible to adapt this command ?
If you have an upper bounding character (I'll use z as an example), you could use a regular expression with replaceAll:
myString = myString.trim().replaceAll("[g-z]", "a");
The regular expression [g-z] means "any character g through z inclusive", see Pattern for details.
You may want to create the regular expression explicitly rather than relying on replaceAll's default version, if you want case-insensitivity for instance:
myString = Pattern.compile("[g-z]", Pattern.CASE_INSENSITIVE)
.matcher(myString.trim())
.replaceAll("a");
You can iterate the characters of your string, and replace those greater than f with what you want;
StringBuilder newString=new StringBuilder();
for (int i = 0; i < myString.length(); i++) {
char c = myString.charAt(i);
if (c > 'f') {
newString.append('a');
} else {
newString.append(c);
}
}
You could use a regular expression to do so [^a-f0-9] would select any characters not allowed in a string representing a hexadecimal number. You would need to replace all occurrences of this group by your desired value.

Remove Special Characters For A Pattern Java

I want to remove that characters from a String:
+ - ! ( ) { } [ ] ^ ~ : \
also I want to remove them:
/*
*/
&&
||
I mean that I will not remove & or | I will remove them if the second character follows the first one (/* */ && ||)
How can I do that efficiently and fast at Java?
Example:
a:b+c1|x||c*(?)
will be:
abc1|xc*?
This can be done via a long, but actually very simple regex.
String aString = "a:b+c1|x||c*(?)";
String sanitizedString = aString.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(sanitizedString);
I think that the java.lang.String.replaceAll(String regex, String replacement) is all you need:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String).
there is two way to do that :
1)
ArrayList<String> arrayList = new ArrayList<String>();
arrayList.add("+");
arrayList.add("-");
arrayList.add("||");
arrayList.add("&&");
arrayList.add("(");
arrayList.add(")");
arrayList.add("{");
arrayList.add("}");
arrayList.add("[");
arrayList.add("]");
arrayList.add("~");
arrayList.add("^");
arrayList.add(":");
arrayList.add("/");
arrayList.add("/*");
arrayList.add("*/");
String string = "a:b+c1|x||c*(?)";
for (int i = 0; i < arrayList.size(); i++) {
if (string.contains(arrayList.get(i)));
string=string.replace(arrayList.get(i), "");
}
System.out.println(string);
2)
String string = "a:b+c1|x||c*(?)";
string = string.replaceAll("[+\\-!(){}\\[\\]^~:\\\\]|/\\*|\\*/|&&|\\|\\|", "");
System.out.println(string);
Thomas wrote on How to remove special characters from a string?:
That depends on what you define as special characters, but try
replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since
you'd then either have to escape it or it would mean "any but these
characters".
Another note: the - character needs to be the first or last one on the
list, otherwise you'd have to escape it or it would define a range (
e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character
positioning, you might want to escape all those characters that have a
special meaning in regular expressions (the following list is not
complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex:
\p{P}\p{S} (keep in mind that in Java strings you'd have to escape
back slashes: "\p{P}\p{S}").
A third way could be something like this, if you can exactly define
what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
Here's less restrictive alternative to the "define allowed characters"
approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and
not a separator (whitespace, linebreak etc.). Note that you can't use
[\P{L}\P{Z}] (upper case P means not having that property), since that
would mean "everything that is not a letter or not whitespace", which
almost matches everything, since letters are not whitespace and vice
versa.

Categories