Splitting a string using Regex in Java

Splitting a string using Regex in Java - java

Would anyone be able to assist me with some regex.
I want to split the following string into a number, string number
"810LN15"
1 method requires 810 to be returned, another requires LN and another should return 15.
The only real solution to this is using regex as the numbers will grow in length
What regex can I used to accomodate this?

String.split won't give you the desired result, which I guess would be "810", "LN", "15", since it would have to look for a token to split at and would strip that token.
Try Pattern and Matcher instead, using this regex: (\d+)|([a-zA-Z]+), which would match any sequence of numbers and letters and get distinct number/text groups (i.e. "AA810LN15QQ12345" would result in the groups "AA", "810", "LN", "15", "QQ" and "12345").
Example:
Pattern p = Pattern.compile("(\\d+)|([a-zA-Z]+)");
Matcher m = p.matcher("810LN15");
List<String> tokens = new LinkedList<String>();
while(m.find())
{
String token = m.group( 1 ); //group 0 is always the entire match
tokens.add(token);
}
//now iterate through 'tokens' and check whether you have a number or text

In Java, as in most regex flavors (Python being a notable exception), the split() regex isn't required to consume any characters when it finds a match. Here I've used lookaheads and lookbehinds to match any position that has a digit one side of it and a non-digit on the other:
String source = "810LN15";
String[] parts = source.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(Arrays.toString(parts));
output:
[810, LN, 15]

(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.

This gives you the exact thing you guys are looking for
Pattern p = Pattern.compile("(([a-zA-Z]+)|(\\d+))|((\\d+)|([a-zA-Z]+))");
Matcher m = p.matcher("810LN15");
List<Object> tokens = new LinkedList<Object>();
while(m.find())
{
String token = m.group( 1 );
tokens.add(token);
}
System.out.println(tokens);

Related

Java split returns white spaces in result

I'm using the function "split" on this string:
p(80,2)
I would like to obtain just the two numbers, so this is what I do:
String[] split = msg.msgContent().split("[p(,)]")
The regex is correct (or at least, I think so) since it splits the two numbers and puts them in the vector "split", but it turns out that this vector has a length of 4, and the first two positions are occupied by white spaces.
In fact, if I print each vector position, this is the result:
Split:
80
2
I've tried adding \\s to the regex to match with white spaces, but since there are none in my string, it didn't work.

You don't need split here, just use a simple regex to extract the digits from your string:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(msg.msgContent());
while (m.find()) {
String number = m.group();
// add to array
}
Note that String#split takes a regex, and the regex you passed doesn't match the pattern you're looking for.
You might want to read the documentation of Pattern and Matcher for more information about the solution above.

split accepts a regular expression as parameter, and this is a character class: [p(,)].
Given that your code is splitting on all characters in the class:
"p(80,2)" will return an array {"", "80", "2"}
I know is not very beautiful:
List<String> collect = Pattern.compile("[^\\d]+")
.splitAsStream(s)
.filter(s -> s.length() > 0)
.collect(Collectors.toList());

Since you're splitting on p and (, the first two characters of your string are resulting in splits. I would split on the comma after replacing the p, (, and ). Like this:
String x = "p(80,2)";
String [] y = x.replaceAll("[p()]", "").split(",");

Split it's not really what you need here, but if you want to use it you can do something like that:
"p(80,2)".replace("p(", "").replace(")", "").split(",")
Results with
[80, 2]

regex pattern matcher

Not too familiar with regex, but I have a block of code that does not seem to be working as expected, I think I know why, but would be looking for a solution.
Here is the string "whereClause"
where filter_2_id = 20 and acceptable_flag is true
String whereClause = report.getWhereClause();
String[] tokens = whereClause.split("filter_1_id");
Pattern p = Pattern.compile("(\\d{3})\\d+");
Matcher m = p.matcher(tokens[0]);
List<Integer> filterList = new ArrayList<Integer>();
if (m.find()) {
do {
String local = m.group();
filterList.add(Integer.parseInt(local));
} while (m.find());
}
When I am debugging, it looks like it gets to the if (m.find()){ but then it just completely skips over it. Is it because the regex pattern (\d{3}\d+) only looks for numbers greater than 3 digits? I actually need it to scan for any set of numbers, so should i just include it as 0-9 inside?
Help/advice please

You can try the regular expression "=\\s*(\\d+)" and then modify m.group() to m.group(1). This should look for an equal sign, possibly followed by some whitespace, and then a sequence of one or more digits. Putting the digits part in parentheses creates a group, which will be group 1 (group 0 is the whole match).

Splitting strings delimited by [[ ]] in java?

I have the input string of the following form "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]" and I need to extract the tokens "Animal rights" , "Anthropocentrism" and so on etc.
I tried using the split method in the String library but I am not able to find the appropriate regular expression to get the tokens, it would be great if someone could help.
I am basically trying to parse the internal links in a Wikipedia XML file you can check out the format here.

You probably shouldn't be using split() here but instead a Matcher:
String input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile("\\[\\[(.*?)\\]\\]").matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Animal rights
Anthropocentrism
Anthropology

A pattern like this should work:
\[\[(.*?)\]\]
This will match a literal [[ followed by zero or more of any character, non-greedily, captured in group 1, followed by a literal ]].
Don't forget to escape the \ in the Java string literal:
Pattern.compile("\\[\\[(.*)?\\]\\]");

It's pretty easy with regex.
\[\[(.+?)\]\]
Edit live on Debuggex
I recommend doing a .+ to make sure there is something actually in the brackets and you won't get a null if something doesn't exist when you're trying to put it in your array.
string output = new string [10];
string pattern = "\[\[(.+?)\]\]";
string input = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
Matcher m = Pattern.compile(pattern).matcher(input);
int increment= 0;
while (m.find()) {
output[increment] = m.group(1);
increment++;
}
Since you said you wanted to learn regex also i'll break it down.
\[ 2x is finding [ brackets you need a \ because it's regex's special characters
. can denote every character except newlines
+ means one or more of that character
? Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
\] is capturing the ]

Try the next:
String str = "[[Animal rights]] [[Anthropocentrism]] [[Anthropology]]";
str = str.replaceAll("(^\\[\\[|\\]\\]$)", "");
String[] array = str.split("\\]\\] \\[\\[");
System.out.println(Arrays.toString(array));
// prints "[Animal rights, Anthropocentrism, Anthropology]"

Java - Split by delimiters

Given _<A_>_<B_>_<Z_>, I want to extract A, B, C in an array.
Basically _< is the starting delimiter and _> is the ending delimiter.

You can use lookaround assertions to match only the content of the tags.
String text = "_<A_>_<B_>_<Z_>";
List<String> Result = new ArrayList<String>();
Pattern p = Pattern
.compile("(?<=_<)" + // Lookbehind assertion to ensure the opening tag before
".*?" + // Match a less as possible till the lookahead is true
"(?=_>)" // Lookahead assertion to ensure the closing tag ahead
);
Matcher m = p.matcher(text);
while(m.find()){
Result.add(m.group(0));
}

That's simple - cut out first opening and last closing , and then split it by close-open
string.replaceFirst( "^_<(.*)_>$", "$1" ).split( "_>_<" );

You extract them using capture groups.

split by _< to get 2 elements, take the 2nd and split it by _> to get 2 elements, take the 1st and split it by _>_< to get A, B, C

Java replaceAll() & split() irregularities

I know, I know, now I have two problems 'n all that, but regex here means I don't have to write two complicated loops. Instead, I have a regex that only I understand, and I'll be employed for yonks.
I have a string, say stack.overflow.questions[0].answer[1].postDate, and I need to get the [0] and the [1], preferably in an array. "Easy!" my neurons exclaimed, just use regex and the split method on your input string; so I came up with this:
String[] tokens = input.split("[^\\[\\d\\]]");
which produced the following:
[, , , , , , , , , , , , , , , , [0], , , , , , , [1]]
Oh dear. So, I thought, "what would replaceAll do in this instance?":
String onlyArrayIndexes = input.replaceAll("[^\\[\\d\\]]", "");
which produced:
[0][1]
Hmm. Why so? I'm looking for a two-element string array that contains "[0]" as the first element and "[1]" as the second. Why does split not work here, when the Javadocs declare they both use the Pattern class as per the Javadoc?
To summarise, I have two questions: why does the split() call produce that large array with seemingly random space characters and am I right in thinking the replaceAll works because the regex replaces all characters not matching "[", a number and "]"? What am I missing that means I expect them to produce similar output (OK that's three, and please don't answer "a clue?" to this one!).

well from what I can see the split does work, it gives you an array that holds the string split for each match that is not a set of brackets with a digit in the middle.
as for the replaceAll I think your assumption is right. it removes everything (replace the match with "") that is not what you want.
From the API documentation:
Splits this string around matches of
the given regular expression.
This method works as if by invoking
the two-argument split method with the
given expression and a limit argument
of zero. Trailing empty strings are
therefore not included in the
resulting array.
The string "boo:and:foo", for example,
yields the following results with
these expressions:
Regex Result
: { "boo", "and", "foo" }
o { "b", "", ":and:f" }

This is not a direct answer to your question, however I want to show you a great API that will suit your need.
Check out Splitter from Google Guava.
So for your example, you would use it like this:
Iterable<String> tokens = Splitter.onPattern("[^\\[\\d\\]]").omitEmptyStrings().trimResults().split(input);
//Now you get back an Iterable which you can iterate over. Much better than an Array.
for(String s : tokens) {
System.out.println(s);
}
This prints:
0
1

split splits on boundaries defined by the regex you provide, so it's no great surprise you're getting lots of entries — nearly all of the characters in the string match your regex and so, by definition, are boundaries on which a split should occur.
replaceAll replaces matches for your regex with the replacement you give it, which in your case is a blank string.
If you're trying to grab the 0 and the 1, it's a trivial loop:
String text = "stack.overflow.questions[0].answer[1].postDate";
Pattern pat = Pattern.compile("\\[(\\d+)\\]");
Matcher m = pat.matcher(text);
List<String> results = new ArrayList<String>();
while (m.find()) {
results.add(m.group(1)); // Or just .group() if you want the [] as well
}
String[] tokens = results.toArray(new String[0]);
Or if it's always exactly two of them:
String text = "stack.overflow.questions[0].answer[1].postDate";
Pattern pat = Pattern.compile(".*\\[(\\d+)\\].*\\[(\\d+)\\].*");
Matcher m = pat.matcher(text);
m.find();
String[] tokens = new String[2];
tokens[0] = m.group(1);
tokens[1] = m.group(2);

The problem is that split is the wrong operation here.
In ruby, I'd tell you to string.scan(/\[\d+\]/), which would give you the array ["[0]","[1]"]
Java doesn't have a single-method equivalent, but we can write a scan method as follows:
public List<String> scan(String string, String regex){
List<String> list = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while(matcher.find()) {
list.add(matcher.group());
}
return retval;
}
and we can call it as scan(string,"\\[\\d+\\]")
The equivalent Scala code is:
"""\[\d+\]""".r findAllIn string

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting a string using Regex in Java - java

(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.

Related

Java split returns white spaces in result

regex pattern matcher

Splitting strings delimited by [[ ]] in java?

Java - Split by delimiters

Java replaceAll() & split() irregularities

Categories

Resources