Splitting line based on comma, strange line - java

I have the following line comma separated,
LanguageID=0,LastKnownPeriod="Active",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Using split method, I can get comma seperated values but the actual problem comes when the text c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}, since comma is found within itself.
so the word after splitting should be,
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448} (comma is again found within the word)
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"} (comma is again found within the word in curly brackets)
I tried with following code but didn't work:
String arr[]=input_line.split("(.*!{),(.*!})");
for (int i=0;i<arr.length;i++)
System.out.println(arr[i]);
Please advise.

Use regular expressions instead:
([\w_]+=(?:\{[\w=_,\[\]"\|:\.\s-]*\}))|([^,]+)
This will group the line into 4 sections:
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Code:
import java.util.regex.*;
public class JavaRegEx {
public static void main(String[] args) {
String line = "LanguageID=0,LastKnownPeriod=\"Active\",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=[\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\",\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\"}";
Pattern pattern = Pattern.compile("([\\w_]+=(?:\\{[\\w=_,\\[\\]\"\\|:\\.\\s-]*\\}))|([^,]+)");
Matcher matcher = pattern.matcher(line);
while(matcher.find())
System.out.println(matcher.group(0));
}
}

First, just splitting on a comma isn't how CSV works
a,b,"c,d"
has only three values, a, b, and c,d. I recommend using a CSV parser, like opencsv. CSV is not terribly complicated, but it isn't as simple as split by comma.
Second, your CSV data is invalid because you have a quote and a comma in a field that isn't quoted.
In othe words, if you want the values a, b","c, then the CSV is
a,"b"",""c"
(Note that quotes are double-escaped.)
Otherwise, it is impossible to tell what fields you actually wanted. A CSV parser would choke on your data.

While it might be possible to do this by split(), it's much easier to match the actual tokens (where split() matches the delimiters between the tokens). Your tokens all consist of one or more of any characters other than comma or brace, optionally followed by a pair of braces enclosing some non-brace characters (which can include commas):
[^,{}]+(?:\{[^{}]+\})?
The Java code for that would be:
List<String> matchList = new ArrayList<String>();
Pattern p = Pattern.compile("[^,{}]+(?:\\{[^{}]+\\})?");
Matcher m = p.matcher(s);
while (m.find()) {
matchList.add(m.group());
}
But it looks like you can break it down further:
Pattern p = Pattern.compile("(\\w+)=([^,{}]+|\\{[^{}]+\\})");
Matcher m = p.matcher(TEST_STR);
while (m.find()) {
System.out.printf("%nname = %s%nvalue = %s%n",
m.group(1), m.group(2));
}
output:
name = LanguageID
value = 0
name = LastKnownPeriod
value = "Active"
name = c_MultiPartyCall
value = {Counter=1,TimeStamp=1394539271448}
name = LTH
value = {Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakA
ccountID|0|1000||"}

Related

String splitting regex with patterns ignored

I have a source string that I want to split the data out:
String source = "data|junk,data|junk|junk,data,data|junk";
String[] result = source.split(",");
The above gives data|junk, data|junk|junk, data, data|junk. To further get the data out, I did this:
for (int i = 0; i < result.length; i++) {
result[i] = result[i].split("\\|")[0];
}
Which gives what I wanted data, data, data, data. I want to see if it is possible to do it in one split with the right regex:
String[] result = source.split("\\|.*?,");
The above gives data, data, data,data|junk, in which the last two data are not split. Could you please help with the correct regex to get the result I wanted?
Example string: "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf"
Expected result: "Ann, Bob, Clara, David"
You can change your regular expression to account for the "junk", then keep matching while it matches data:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTest {
public static void main(String[] args) {
String input = "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf";
Pattern p = Pattern.compile("(\\w+)(\\|\\w+)*,?");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
The regular expression looks for word characters (letters, digits, and underscores) and captures that. It then looks for a pipe symbol (escaped so that that it does not have a special meaning in the regular expression) with again word characters. This pipe plus word characters can happen any number (zero to many) of times. After that could be a comma, optionally.
This prints
Ann
Bob
Clara
David
It also captures the "junk", and you could access that with m.group(2) in the loop. If you don't want to capture that, insert a ?: into the regular expression:
Pattern.compile("(\\w+)(?:\\|\\w+)*,?");
In the string,
Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf
\\|.*?, - this will match |anynoncommastring,
but this doesn't match the final |rijfidjf since that does not end in comma. So to match that, use (,|$) instead of just ,, making the regex \\|.*?(,|$)
But the above does not match a single isolated comma, so alternating , with \\|.*?(,|$), makes the final regex (\\|.*?(,|$)|,).
The pattern (\\|.*?(,|$)|,) works,
String source = "Ann|xcjiajeaw,Bob|aijife|vdsjisdjfe,Clara,David|rijfidjf";
String[] result = source.split("(\\|.*?(,|$)|,)");
for (int i = 0; i < result.length; i++) {
System.out.println(result[i]);
}
Output:
Ann
Bob
Clara
David
I came up with the following solution:
String source = "one|junk,two|junk|junk,three,four|junk|junk";
String[] result = source.split("([|](?:(.*?,(?=[^,]+[|,]|$))|.*$))|,");
System.out.println(Arrays.toString(result));
[one, two, three, four]

How can I avoid splitting on a comma in brackets?

I have a string below which I want to split in String array with multiple delimiters.
The delimiters are comma (,), semicolon (;), "OR" and "AND".
But I do not want to split on a comma if it's in brackets.
Example input:
device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)
I am able to split the String with regex, but it doesn't handle commas in brackets correctly.
How can I fix this?
Pattern ptn = Pattern.compile("(,|;|OR|AND)");
String[] parts = ptn.split(query);
for(String p:parts){
System.out.println(p);
queryParams.add(p.trim());
}
You could use a negative look-ahead:.
String[] parts = input.split(",(?![^()]*\\))|;| OR | AND ")
Or an uglier (but perhaps conceptually simpler) way you could do it would be to replace any commas within brackets with a temporary placeholder, then do the split and replace the placeholders with real commas in the results.
String input = "X,Y=((A,B),C) OR Z";
Pattern pattern = Pattern.compile("\\(.*\\)");
Matcher matcher = pattern.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group().replaceAll(",", "_COMMA_"));
}
matcher.appendTail(sb);
String[] parts = sb.toString().split("(,|;| OR | AND )");
for (String part : parts) {
System.out.println(part.replace("_COMMA_", ","));
}
Prints:
X
Y=((A,B),C)
Z
Alternatively, you could write your own little tokenizer that reads the input character-by-character using charAt(index) or define a grammar for an off-the-shelf parser.
You can use negative look-ahead (?!...), which looks at the following characters, and if those characters match the pattern in brackets, the overall match will fail.
String query = "device_name==device503,device_type!=GATEWAY;site_name<site3434 OR country==India AND location==BLR; new_name=in=(Rajesh,Suresh)";
String[] parts = query.split("\\s*(,(?![^()]*\\))|;|OR|AND)\\s*");
for(String part: parts)
System.out.println(part);
Output:
device_name==device503
device_type!=GATEWAY
site_name<site3434
country==India
location==BLR
new_name=in=(Rajesh,Suresh)
So in this case we check whether the characters following the , are 0 or more characters which aren't either ( or ), followed by a ), and if this is true, the , match fails.
This won't work if you can have nested brackets.
Note:
String also has a split method (as used above), which is useful for simplicity's sake (but would be slower than reusing the same Pattern over and over again for multiple Strings).
You can add \\s* (0 or more whitespace characters) to your regex to remove any spaces before or after a delimiter.
If you're using | without anything before or after (e.g. "a|b|c"), you don't need to put it in brackets.

Converting string of lists to list of string in java

I am getting a value as list of string in string format like this: "["a", "b"]". I would like to convert them to a list of strings. I can do this by stripping the leading and trailing braces and then splitting on comma. But here the problem is that I may receive the same value as single string also "a" that too I want to convert to a list of strings. So is there any way to generalize this.
One possible solution is to use Regex.
Your expression can look like this: "(.+?)"
.+? matches any character (except for line terminators)
+? Quantifier - Matches between one and unlimited times, as few times as possible, expanding as needed.
String tokens = "[\"a\", \"b,c\", \"test\"]";
String pattern = "\"(.+?)\"";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(tokens);
List<String> tokenList = new ArrayList<String>();
while (m.find()) {
tokenList.add(m.group());
}
System.out.println(tokenList);
you can generalize the following:
String str = "\"[\"a\",\"b\"]\"";
String[] splitStrs = str.split("\"",7);
System.out.println(splitStrs[0]+" "+splitStrs[1]+" "+splitStrs[2]+" "+splitStrs[3]+" "+splitStrs[4]+" "+splitStrs[5]+" "+splitStrs[6]);
My output
[ a , b ]

How to get a String with Java regular expression in brackets within brackets

How can i get a String inside brackets. See code below.
String str = "C1<C2, C3<T1>>.C4<T2>.C5"
I need to get C1<C2, C3<T1>>, C4<T2>, and C5.
See code what I tried below
Pattern pat = Pattern.compile("(\\w+(<[^>]+>)?)(.\\w+(<[^>]+>)?)*");
Matcher mat = pat.matcher(str);
but the result was
C1<C2, C3<T1>
There are 2 problems that I see with your code:
It seems like you are only printing the first match instead of
looping through the results. Use while(mat.find()) to iterate
through the list of matches.
Simplify your pattern to \\w+(<[^>]+>+)? to get C1<C2, C3<T1>>, C4<T2>, and C5.
RegEx pattern explained:
w+= 1 or more alphanumeric or underscore character
()? = 0 or 1 of what is in the parenthesis
< = match the < character
[^>]+ = 1 or more sets characters until the > character
>+ = 1 or more > character (An alternative would be >{1,2} if you want to enforce only either one or two > characters.)
Your resulting code should look like the following:
public static void main(String[] args)
{
String str = "C1<C2, C3<T1>>.C4<T2>.C5";
Pattern pat = Pattern.compile("\\w+(<[^>]+>+)?");
Matcher mat = pat.matcher(str);
while(mat.find()) {
System.out.println(mat.group());
}
}
If you just want a list of the parts though, a much simpler way to accomplish this would be to use split() instead of RegEx. You can split the string on ., save the pieces in an array and then iterate through the array as so desired.
That would be accomplished with the following:
String[] parts = str.split("\\.");
Just split on dots:
String[] parts = str.split("\\.");
This does what you want using the sample input in the question.

java regular expression split pattern

I want to split the following string:
String line ="DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
into following tokens:
DOB
1234567890
11
07/05/12
first,last
100
is,a,good,boy
I tried using following regular expression:
import java.util.*;
import java.lang.*;
import java.util.regex.*;
import org.apache.commons.lang.StringUtils;
class SplitString{
public static final String quotes = "\".[[((a-z)|(A-Z))]+( ((a-z)|(A-Z)).,)*.((a-z)|(A-Z))].\"" ;
public static final String ISSUE_UPLOAD_FILE_PATTERN = "((a-z)|(A-Z))+ [(((a-z)|(A-Z)).,)* + ("+quotes+".,) ].((a-z)|(A-Z)) + ("+quotes+")";
public static void main(String[] args){
String line ="DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
String delimiter = ",";
Pattern p = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN);
Pattern pattern = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN);
String[] output = pattern.split(line);
System.out.println(" pattern: "+pattern);
for(String a:output){
System.out.println(" output: "+a);
}
}
}
Am I missing anything in the regular expression?
This is an updated version of your code that gives you your expected output:
public static final String ISSUE_UPLOAD_FILE_PATTERN = "(?<=(^|,))(([^\",]+)|\"([^\"]*)\")(?=($|,))";
public static void main(String[] args) {
String line = "DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
Matcher matcher = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN).matcher(line);
while (matcher.find()) {
if (matcher.group(3) != null) {
System.out.println(matcher.group(3));
} else {
System.out.println(matcher.group(4));
}
}
}
The regex works like this:
(?<=(^|,)): Check that the character before the match is start of string or a ,
(([^\",]+)|\"([^\"]*)\"): Match either "<any number of (not")>" or any number of (not" or ,)
(?=($|,)): Check that the character after the match is end of string or a ,
The result will be i either group 3 or 4 depending on which part matched.
Your regular expressions do some weird stuff with [ and ]: the use of these doesn't look at all like character ranges. For this reason, I didn't bother to decypher and fix all of your expression.
As a second note, you should make sure what your regular expressions should describe: do you want them to match the delimiter between tokens, or each individual non-delimiter token? Use of the split method implies the former, but I guess for your application, the latter is easier to achieve. In fact in a recent answer of mine I came up with a regular expression matching tokens of a csv file:
String tokenPattern = "\"[^\"]*(\"\"[^\"]*)*\"|[^,]*";
This will match
unquoted strings up to but not including the next comma
qutoed strings up to the closing quote, including embedded commas
quoted strings including double quotes
You can use this, create a matcher for your line, iterate over all matches using find and extract the token using group(). You could alkso use that loop to strip quotes and transform double quotes to single quotes, if you need the semantic value of the column.
As an alternative, you could of course also use a CSV reader as suggested in comments to your question.

Categories