String splitting regex with patterns ignored - java

I have a source string that I want to split the data out:
String source = "data|junk,data|junk|junk,data,data|junk";
String[] result = source.split(",");
The above gives data|junk, data|junk|junk, data, data|junk. To further get the data out, I did this:
for (int i = 0; i < result.length; i++) {
result[i] = result[i].split("\\|")[0];
}
Which gives what I wanted data, data, data, data. I want to see if it is possible to do it in one split with the right regex:
String[] result = source.split("\\|.*?,");
The above gives data, data, data,data|junk, in which the last two data are not split. Could you please help with the correct regex to get the result I wanted?
Example string: "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf"
Expected result: "Ann, Bob, Clara, David"

You can change your regular expression to account for the "junk", then keep matching while it matches data:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTest {
public static void main(String[] args) {
String input = "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf";
Pattern p = Pattern.compile("(\\w+)(\\|\\w+)*,?");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
The regular expression looks for word characters (letters, digits, and underscores) and captures that. It then looks for a pipe symbol (escaped so that that it does not have a special meaning in the regular expression) with again word characters. This pipe plus word characters can happen any number (zero to many) of times. After that could be a comma, optionally.
This prints
Ann
Bob
Clara
David
It also captures the "junk", and you could access that with m.group(2) in the loop. If you don't want to capture that, insert a ?: into the regular expression:
Pattern.compile("(\\w+)(?:\\|\\w+)*,?");

In the string,
Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf
\\|.*?, - this will match |anynoncommastring,
but this doesn't match the final |rijfidjf since that does not end in comma. So to match that, use (,|$) instead of just ,, making the regex \\|.*?(,|$)
But the above does not match a single isolated comma, so alternating , with \\|.*?(,|$), makes the final regex (\\|.*?(,|$)|,).
The pattern (\\|.*?(,|$)|,) works,
String source = "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf";
String[] result = source.split("(\\|.*?(,|$)|,)");
for (int i = 0; i < result.length; i++) {
System.out.println(result[i]);
}
Output:
Ann
Bob
Clara
David

I came up with the following solution:
String source = "one|junk,two|junk|junk,three,four|junk|junk";
String[] result = source.split("([|](?:(.*?,(?=[^,]+[|,]|$))|.*$))|,");
System.out.println(Arrays.toString(result));
[one, two, three, four]

Related

How can I split a string without knowing the split characters a-priori?

For my project I have to read various input graphs. Unfortunately, the input edges have not the same format. Some of them are comma-separated, others are tab-separated, etc. For example:
File 1:
123,45
67,89
...
File 2
123 45
67 89
...
Rather than handling each case separately, I would like to automatically detect the split characters. Currently I have developed the following solution:
String str = "123,45";
String splitChars = "";
for(int i=0; i < str.length(); i++) {
if(!Character.isDigit(str.charAt(i))) {
splitChars += str.charAt(i);
}
}
String[] endpoints = str.split(splitChars);
Basically I pick the first row and select all the non-numeric characters, then I use the generated substring as split characters. Is there a cleaner way to perform this?
Split requires a regexp, so your code would fail for many reasons: If the separator has meaning in regexp (say, +), it'll fail. If there is more than 1 non-digit character, your code will also fail. If you code contains more than exactly 2 numbers, it will also fail. Imagine it contains hello, world - then your splitChars string becomes " , " - and your split would do nothing (that would split the string "test , abc" into two, nothing else).
Why not make a regexp to fetch digits, and then find all sequences of digits, instead of focussing on the separators?
You're using regexps whether you want to or not, so let's make it official and use Pattern, while we are at it.
private static final Pattern ALL_DIGITS = Pattern.compile("\\d+");
// then in your split method..
Matcher m = ALL_DIGITS.matcher(str);
List<Integer> numbers = new ArrayList<Integer>();
// dont use arrays, generally. List is better.
while (m.find()) {
numbers.add(Integer.parseInt(m.group(0)));
}
//d+ is: Any number of digits.
m.find() finds the next match (so, the next block of digits), returning false if there aren't any more.
m.group(0) retrieves the entire matched string.
Split the string on \\D+ which means one or more non-digit characters.
Demo:
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
// Test strings
String[] arr = { "123,45", "67,89", "125 89", "678 129" };
for (String s : arr) {
System.out.println(Arrays.toString(s.split("\\D+")));
}
}
}
Output:
[123, 45]
[67, 89]
[125, 89]
[678, 129]
Why not split with [^\d]+ (every group of nondigfit) :
for (String n : "123,456 789".split("[^\\d]+")) {
System.out.println(n);
}
Result:
123
456
789

How to get a String with Java regular expression in brackets within brackets

How can i get a String inside brackets. See code below.
String str = "C1<C2, C3<T1>>.C4<T2>.C5"
I need to get C1<C2, C3<T1>>, C4<T2>, and C5.
See code what I tried below
Pattern pat = Pattern.compile("(\\w+(<[^>]+>)?)(.\\w+(<[^>]+>)?)*");
Matcher mat = pat.matcher(str);
but the result was
C1<C2, C3<T1>
There are 2 problems that I see with your code:
It seems like you are only printing the first match instead of
looping through the results. Use while(mat.find()) to iterate
through the list of matches.
Simplify your pattern to \\w+(<[^>]+>+)? to get C1<C2, C3<T1>>, C4<T2>, and C5.
RegEx pattern explained:
w+= 1 or more alphanumeric or underscore character
()? = 0 or 1 of what is in the parenthesis
< = match the < character
[^>]+ = 1 or more sets characters until the > character
>+ = 1 or more > character (An alternative would be >{1,2} if you want to enforce only either one or two > characters.)
Your resulting code should look like the following:
public static void main(String[] args)
{
String str = "C1<C2, C3<T1>>.C4<T2>.C5";
Pattern pat = Pattern.compile("\\w+(<[^>]+>+)?");
Matcher mat = pat.matcher(str);
while(mat.find()) {
System.out.println(mat.group());
}
}
If you just want a list of the parts though, a much simpler way to accomplish this would be to use split() instead of RegEx. You can split the string on ., save the pieces in an array and then iterate through the array as so desired.
That would be accomplished with the following:
String[] parts = str.split("\\.");
Just split on dots:
String[] parts = str.split("\\.");
This does what you want using the sample input in the question.

Splitting a string between a char

I want to split a String on a delimiter.
Example String:
String str="ABCD/12346567899887455422DEFG/15479897445698742322141PQRS/141455798951";
Now I want Strings as ABCD/12346567899887455422, DEFG/15479897445698742322141 like I want
only 4 chars before /
after / any number of chars numbers and letters.
Update:
The only time I need the previous 4 characters is after a delimiter is shown, as the string may contain letters or numbers...
My code attempt:
public class StringReq {
public static void main(String[] args) {
String str = "BONL/1234567890123456789CORT/123456789012345678901234567890HOLD/123456789012345678901234567890INTC/123456789012345678901234567890OTHR/123456789012345678901234567890PHOB/123456789012345678901234567890PHON/123456789012345678901234567890REPA/123456789012345678901234567890SDVA/123456789012345678901234567890TELI/123456789012345678901234567890";
testSplitStrings(str);
}
public static void testSplitStrings(String path) {
System.out.println("splitting of sprint starts \n");
String[] codeDesc = path.split("/");
String[] codeVal = new String[codeDesc.length];
for (int i = 0; i < codeDesc.length; i++) {
codeVal[i] = codeDesc[i].substring(codeDesc[i].length() - 4,
codeDesc[i].length());
System.out.println("line" + i + "==> " + codeDesc[i] + "\n");
}
for (int i = 0; i < codeVal.length - 1; i++) {
System.out.println(codeVal[i]);
}
System.out.println("splitting of sprint ends");
}
}
You claim that after / there can appear digits and alphabets, but in your example I don't see any alphabets which should be included in result after /.
So based on that assumption you can simply split in placed which has digit before and A-Z character after it.
To do so you can split with regex which is using look-around mechanism like str.split("(?<=[0-9])(?=[A-Z])")
Demo:
String str = "BONL/1234567890123456789CORT/123456789012345678901234567890HOLD/123456789012345678901234567890INTC/123456789012345678901234567890OTHR/123456789012345678901234567890PHOB/123456789012345678901234567890PHON/123456789012345678901234567890REPA/123456789012345678901234567890SDVA/123456789012345678901234567890TELI/123456789012345678901234567890";
for (String s : str.split("(?<=[0-9])(?=[A-Z])"))
System.out.println(s);
Output:
BONL/1234567890123456789
CORT/123456789012345678901234567890
HOLD/123456789012345678901234567890
INTC/123456789012345678901234567890
OTHR/123456789012345678901234567890
PHOB/123456789012345678901234567890
PHON/123456789012345678901234567890
REPA/123456789012345678901234567890
SDVA/123456789012345678901234567890
TELI/123456789012345678901234567890
If you alphabets can actually appear in second part (after /) then you can use split which will try to find places which have four alphabetic characters and / after it like split("(?=[A-Z]{4}/)") (assuming that you are using at least Java 8, if not you will need to manually exclude case of splitting at start of the string for instance by adding (?!^) or (?<=.) at start of your regex).
you can use regex
Pattern pattern = Pattern.compile("[A-Z]{4}/[0-9]*");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
Instead of:
String[] codeDesc = path.split("/");
Just use this regex (4 characters before / and any characters after):
String[] codeDesc = path.split("(?=.{4}/)(?<=.)");
Even simpler using \d:
path.split("(?=[A-Za-z])(?<=\\d)");
EDIT:
Included condition for 4 any size letters only.
path.split("(?=[A-Za-z]{4})(?<=\\d)");
output:
BONL/1234567890123456789
CORT/123456789012345678901234567890
HOLD/123456789012345678901234567890
INTC/123456789012345678901234567890
OTHR/123456789012345678901234567890
PHOB/123456789012345678901234567890
PHON/123456789012345678901234567890
REPA/123456789012345678901234567890
SDVA/123456789012345678901234567890
TELI/123456789012345678901234567890
It is still unclear if this is authors expected result.

Splitting line based on comma, strange line

I have the following line comma separated,
LanguageID=0,LastKnownPeriod="Active",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Using split method, I can get comma seperated values but the actual problem comes when the text c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}, since comma is found within itself.
so the word after splitting should be,
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448} (comma is again found within the word)
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"} (comma is again found within the word in curly brackets)
I tried with following code but didn't work:
String arr[]=input_line.split("(.*!{),(.*!})");
for (int i=0;i<arr.length;i++)
System.out.println(arr[i]);
Please advise.
Use regular expressions instead:
([\w_]+=(?:\{[\w=_,\[\]"\|:\.\s-]*\}))|([^,]+)
This will group the line into 4 sections:
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Code:
import java.util.regex.*;
public class JavaRegEx {
public static void main(String[] args) {
String line = "LanguageID=0,LastKnownPeriod=\"Active\",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=[\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\",\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\"}";
Pattern pattern = Pattern.compile("([\\w_]+=(?:\\{[\\w=_,\\[\\]\"\\|:\\.\\s-]*\\}))|([^,]+)");
Matcher matcher = pattern.matcher(line);
while(matcher.find())
System.out.println(matcher.group(0));
}
}
First, just splitting on a comma isn't how CSV works
a,b,"c,d"
has only three values, a, b, and c,d. I recommend using a CSV parser, like opencsv. CSV is not terribly complicated, but it isn't as simple as split by comma.
Second, your CSV data is invalid because you have a quote and a comma in a field that isn't quoted.
In othe words, if you want the values a, b","c, then the CSV is
a,"b"",""c"
(Note that quotes are double-escaped.)
Otherwise, it is impossible to tell what fields you actually wanted. A CSV parser would choke on your data.
While it might be possible to do this by split(), it's much easier to match the actual tokens (where split() matches the delimiters between the tokens). Your tokens all consist of one or more of any characters other than comma or brace, optionally followed by a pair of braces enclosing some non-brace characters (which can include commas):
[^,{}]+(?:\{[^{}]+\})?
The Java code for that would be:
List<String> matchList = new ArrayList<String>();
Pattern p = Pattern.compile("[^,{}]+(?:\\{[^{}]+\\})?");
Matcher m = p.matcher(s);
while (m.find()) {
matchList.add(m.group());
}
But it looks like you can break it down further:
Pattern p = Pattern.compile("(\\w+)=([^,{}]+|\\{[^{}]+\\})");
Matcher m = p.matcher(TEST_STR);
while (m.find()) {
System.out.printf("%nname = %s%nvalue = %s%n",
m.group(1), m.group(2));
}
output:
name = LanguageID
value = 0
name = LastKnownPeriod
value = "Active"
name = c_MultiPartyCall
value = {Counter=1,TimeStamp=1394539271448}
name = LTH
value = {Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakA
ccountID|0|1000||"}

How to split a '*' String in Java

i have problem to split string with 'split_', it seem my java netbean cant split when 'split_' is used.
any idea how we can overcame this?
i refer to this solution but it can only split without the used of '*'. How to split a string in Java
String echoPHP= "test*split_*test2";
String[] strArray = echoPHP.split("*split_*");
String part1 = strArray2[0]; // 004
String part2 = strArray2[1]; // 034556
System.out.println(strArray[0]);
System.out.println(strArray[1]);
error is:
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*split_*
output supposed to be:
test
test2
Use Pattern.quote() around your split string to ensure it's taken as a literal, not a regular expression:
String[] strArray = echoPHP.split(Pattern.quote("*split_*"));
You'll have difficulties otherwise, since * is a special character in regular expressions used to match any number of occurrences of the character or group that proceeded it.
Of course, you could manually escape all the special characters used in regular expressions using \, but this is both less clear and more error prone if you don't want to use any regular expression features.
try: echoPHP.split("\\*split_\\*")
important thing to remember is that the String you are passing to the split method is really a regular expression. refer to the API for more details: http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
Here are different cases to split string in java. You can use one which may fit in your application.
case 1 : Here is code to split string by a character "." :
String imageName = "picture1.jpg";
String [] imageNameArray = imageName.split("\\.");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
And what if mistakenly there are spaces left before or after "." in such cases? It's always best practice to consider those spaces also.
String imageName = "picture1 . jpg";
String [] imageNameArray = imageName.split("\\s*.\\s*");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
Here, \\s* is there to consider the spaces and give you only required splitted strings.
Now, suppose you have placed parameters in between two special charaters like : #parameter# or parameter or even two differnt signs at a time like *paramter#. We can have list of all these parameters between those signs by this code :
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang.StringUtils;
public class Splitter {
public static void main(String[] args) {
String pattern1 = "#";
String pattern2 = "#";
String text = "(#n1_1#/#n2_2#)*2/#n1_1#*34/#n4_4#";
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
Matcher m = p.matcher(text);
while (m.find()) {
ArrayList parameters = new ArrayList<>();
parameters.add(m.group(1));
System.out.println(parameters);
ArrayList result = new ArrayList<>();
result.add(parameters);
// System.out.println(result.size());
}
}
}
Here list result will have parameters n1_1,n2_2,n4_4.
You can use split method like this
String[] strArray = echoPHP.split("\\*split_\\*");
character is the special charater.. so you should use "\" in front of * character.

Categories