Java replaceAll() & split() irregularities

Java replaceAll() & split() irregularities - java

I know, I know, now I have two problems 'n all that, but regex here means I don't have to write two complicated loops. Instead, I have a regex that only I understand, and I'll be employed for yonks.
I have a string, say stack.overflow.questions[0].answer[1].postDate, and I need to get the [0] and the [1], preferably in an array. "Easy!" my neurons exclaimed, just use regex and the split method on your input string; so I came up with this:
String[] tokens = input.split("[^\\[\\d\\]]");
which produced the following:
[, , , , , , , , , , , , , , , , [0], , , , , , , [1]]
Oh dear. So, I thought, "what would replaceAll do in this instance?":
String onlyArrayIndexes = input.replaceAll("[^\\[\\d\\]]", "");
which produced:
[0][1]
Hmm. Why so? I'm looking for a two-element string array that contains "[0]" as the first element and "[1]" as the second. Why does split not work here, when the Javadocs declare they both use the Pattern class as per the Javadoc?
To summarise, I have two questions: why does the split() call produce that large array with seemingly random space characters and am I right in thinking the replaceAll works because the regex replaces all characters not matching "[", a number and "]"? What am I missing that means I expect them to produce similar output (OK that's three, and please don't answer "a clue?" to this one!).

well from what I can see the split does work, it gives you an array that holds the string split for each match that is not a set of brackets with a digit in the middle.
as for the replaceAll I think your assumption is right. it removes everything (replace the match with "") that is not what you want.
From the API documentation:
Splits this string around matches of
the given regular expression.
This method works as if by invoking
the two-argument split method with the
given expression and a limit argument
of zero. Trailing empty strings are
therefore not included in the
resulting array.
The string "boo:and:foo", for example,
yields the following results with
these expressions:
Regex Result
: { "boo", "and", "foo" }
o { "b", "", ":and:f" }

This is not a direct answer to your question, however I want to show you a great API that will suit your need.
Check out Splitter from Google Guava.
So for your example, you would use it like this:
Iterable<String> tokens = Splitter.onPattern("[^\\[\\d\\]]").omitEmptyStrings().trimResults().split(input);
//Now you get back an Iterable which you can iterate over. Much better than an Array.
for(String s : tokens) {
System.out.println(s);
}
This prints:
0
1

split splits on boundaries defined by the regex you provide, so it's no great surprise you're getting lots of entries — nearly all of the characters in the string match your regex and so, by definition, are boundaries on which a split should occur.
replaceAll replaces matches for your regex with the replacement you give it, which in your case is a blank string.
If you're trying to grab the 0 and the 1, it's a trivial loop:
String text = "stack.overflow.questions[0].answer[1].postDate";
Pattern pat = Pattern.compile("\\[(\\d+)\\]");
Matcher m = pat.matcher(text);
List<String> results = new ArrayList<String>();
while (m.find()) {
results.add(m.group(1)); // Or just .group() if you want the [] as well
}
String[] tokens = results.toArray(new String[0]);
Or if it's always exactly two of them:
String text = "stack.overflow.questions[0].answer[1].postDate";
Pattern pat = Pattern.compile(".*\\[(\\d+)\\].*\\[(\\d+)\\].*");
Matcher m = pat.matcher(text);
m.find();
String[] tokens = new String[2];
tokens[0] = m.group(1);
tokens[1] = m.group(2);

The problem is that split is the wrong operation here.
In ruby, I'd tell you to string.scan(/\[\d+\]/), which would give you the array ["[0]","[1]"]
Java doesn't have a single-method equivalent, but we can write a scan method as follows:
public List<String> scan(String string, String regex){
List<String> list = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while(matcher.find()) {
list.add(matcher.group());
}
return retval;
}
and we can call it as scan(string,"\\[\\d+\\]")
The equivalent Scala code is:
"""\[\d+\]""".r findAllIn string

Related

Java | Split words and round brackets with its content into elements of a String Array using regex

Hopefully you can help me out, since I'm really bad at regex, so
Given these examples of String input patterns:
"string1 string2 (more strings here)"
"string1 (more words)"
"str1 str2 str3 [...] strn [...] (words. again.)"
I want to end up with a String[] that looks like this:
["string1", "string2", "(more strings here)"]
Basically it should detect words and everything (also non characters) in round brackets as an individual group and put it in an String Array.
I understand that this captures the round brackets and their content: (\((.*?)\))
and this captures the words: (\w+)
but i have no idea how to combine them. Or is there a better alternative in Java?

Pattern pattern =
Pattern.compile("([\\w]+|\\(.*?\\))"); // match continous word characters or all strings between "(" and ")"
Matcher matcher =
pattern.matcher("string1 (more words)"); // input string
List<String> stringArrayList = new ArrayList<>();
// run matcher again and again to find the next match of regex on the input
while (matcher.find()) {
stringArrayList.add(matcher.group());
}
String[] output = stringArrayList.toArray(new String[0]); // final output
for (String entry :
output) {
System.out.println(entry); // printing
}

You could match the string with the following regular expression (with the case-indifferent flag set), catching the matches in an array.
"\\([^)]*\\)|[a-z\\d]+"
Start your Java engine! (click "Java")
The following link to regex101.com uses the equivalent regex for the PCRE (PHP) engine. I've included that to allow the reader to examine how each part of the regex works. (Move the cursor around to see interesting details pop up on the screen.)
Start your PCRE engine!

Converting string of lists to list of string in java

I am getting a value as list of string in string format like this: "["a", "b"]". I would like to convert them to a list of strings. I can do this by stripping the leading and trailing braces and then splitting on comma. But here the problem is that I may receive the same value as single string also "a" that too I want to convert to a list of strings. So is there any way to generalize this.

One possible solution is to use Regex.
Your expression can look like this: "(.+?)"
.+? matches any character (except for line terminators)
+? Quantifier - Matches between one and unlimited times, as few times as possible, expanding as needed.
String tokens = "[\"a\", \"b,c\", \"test\"]";
String pattern = "\"(.+?)\"";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(tokens);
List<String> tokenList = new ArrayList<String>();
while (m.find()) {
tokenList.add(m.group());
}
System.out.println(tokenList);

you can generalize the following:
String str = "\"[\"a\",\"b\"]\"";
String[] splitStrs = str.split("\"",7);
System.out.println(splitStrs[0]+" "+splitStrs[1]+" "+splitStrs[2]+" "+splitStrs[3]+" "+splitStrs[4]+" "+splitStrs[5]+" "+splitStrs[6]);
My output
[ a , b ]

Java split returns white spaces in result

I'm using the function "split" on this string:
p(80,2)
I would like to obtain just the two numbers, so this is what I do:
String[] split = msg.msgContent().split("[p(,)]")
The regex is correct (or at least, I think so) since it splits the two numbers and puts them in the vector "split", but it turns out that this vector has a length of 4, and the first two positions are occupied by white spaces.
In fact, if I print each vector position, this is the result:
Split:
80
2
I've tried adding \\s to the regex to match with white spaces, but since there are none in my string, it didn't work.

You don't need split here, just use a simple regex to extract the digits from your string:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(msg.msgContent());
while (m.find()) {
String number = m.group();
// add to array
}
Note that String#split takes a regex, and the regex you passed doesn't match the pattern you're looking for.
You might want to read the documentation of Pattern and Matcher for more information about the solution above.

split accepts a regular expression as parameter, and this is a character class: [p(,)].
Given that your code is splitting on all characters in the class:
"p(80,2)" will return an array {"", "80", "2"}
I know is not very beautiful:
List<String> collect = Pattern.compile("[^\\d]+")
.splitAsStream(s)
.filter(s -> s.length() > 0)
.collect(Collectors.toList());

Since you're splitting on p and (, the first two characters of your string are resulting in splits. I would split on the comma after replacing the p, (, and ). Like this:
String x = "p(80,2)";
String [] y = x.replaceAll("[p()]", "").split(",");

Split it's not really what you need here, but if you want to use it you can do something like that:
"p(80,2)".replace("p(", "").replace(")", "").split(",")
Results with
[80, 2]

Split a string based on pattern and merge it back

I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.

You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.

I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word

public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);

I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.

Splitting a string using Regex in Java

Would anyone be able to assist me with some regex.
I want to split the following string into a number, string number
"810LN15"
1 method requires 810 to be returned, another requires LN and another should return 15.
The only real solution to this is using regex as the numbers will grow in length
What regex can I used to accomodate this?

String.split won't give you the desired result, which I guess would be "810", "LN", "15", since it would have to look for a token to split at and would strip that token.
Try Pattern and Matcher instead, using this regex: (\d+)|([a-zA-Z]+), which would match any sequence of numbers and letters and get distinct number/text groups (i.e. "AA810LN15QQ12345" would result in the groups "AA", "810", "LN", "15", "QQ" and "12345").
Example:
Pattern p = Pattern.compile("(\\d+)|([a-zA-Z]+)");
Matcher m = p.matcher("810LN15");
List<String> tokens = new LinkedList<String>();
while(m.find())
{
String token = m.group( 1 ); //group 0 is always the entire match
tokens.add(token);
}
//now iterate through 'tokens' and check whether you have a number or text

In Java, as in most regex flavors (Python being a notable exception), the split() regex isn't required to consume any characters when it finds a match. Here I've used lookaheads and lookbehinds to match any position that has a digit one side of it and a non-digit on the other:
String source = "810LN15";
String[] parts = source.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(Arrays.toString(parts));
output:
[810, LN, 15]

(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.

This gives you the exact thing you guys are looking for
Pattern p = Pattern.compile("(([a-zA-Z]+)|(\\d+))|((\\d+)|([a-zA-Z]+))");
Matcher m = p.matcher("810LN15");
List<Object> tokens = new LinkedList<Object>();
while(m.find())
{
String token = m.group( 1 );
tokens.add(token);
}
System.out.println(tokens);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java replaceAll() & split() irregularities - java

Related

Java | Split words and round brackets with its content into elements of a String Array using regex

Converting string of lists to list of string in java

Java split returns white spaces in result

Split a string based on pattern and merge it back

Splitting a string using Regex in Java

Categories

Resources