java regular expression split pattern - java

I want to split the following string:
String line ="DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
into following tokens:
DOB
1234567890
11
07/05/12
first,last
100
is,a,good,boy
I tried using following regular expression:
import java.util.*;
import java.lang.*;
import java.util.regex.*;
import org.apache.commons.lang.StringUtils;
class SplitString{
public static final String quotes = "\".[[((a-z)|(A-Z))]+( ((a-z)|(A-Z)).,)*.((a-z)|(A-Z))].\"" ;
public static final String ISSUE_UPLOAD_FILE_PATTERN = "((a-z)|(A-Z))+ [(((a-z)|(A-Z)).,)* + ("+quotes+".,) ].((a-z)|(A-Z)) + ("+quotes+")";
public static void main(String[] args){
String line ="DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
String delimiter = ",";
Pattern p = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN);
Pattern pattern = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN);
String[] output = pattern.split(line);
System.out.println(" pattern: "+pattern);
for(String a:output){
System.out.println(" output: "+a);
}
}
}
Am I missing anything in the regular expression?

This is an updated version of your code that gives you your expected output:
public static final String ISSUE_UPLOAD_FILE_PATTERN = "(?<=(^|,))(([^\",]+)|\"([^\"]*)\")(?=($|,))";
public static void main(String[] args) {
String line = "DOB,1234567890,11,07/05/12,\"first,last\",100,\"is,a,good,boy\"";
Matcher matcher = Pattern.compile(ISSUE_UPLOAD_FILE_PATTERN).matcher(line);
while (matcher.find()) {
if (matcher.group(3) != null) {
System.out.println(matcher.group(3));
} else {
System.out.println(matcher.group(4));
}
}
}
The regex works like this:
(?<=(^|,)): Check that the character before the match is start of string or a ,
(([^\",]+)|\"([^\"]*)\"): Match either "<any number of (not")>" or any number of (not" or ,)
(?=($|,)): Check that the character after the match is end of string or a ,
The result will be i either group 3 or 4 depending on which part matched.

Your regular expressions do some weird stuff with [ and ]: the use of these doesn't look at all like character ranges. For this reason, I didn't bother to decypher and fix all of your expression.
As a second note, you should make sure what your regular expressions should describe: do you want them to match the delimiter between tokens, or each individual non-delimiter token? Use of the split method implies the former, but I guess for your application, the latter is easier to achieve. In fact in a recent answer of mine I came up with a regular expression matching tokens of a csv file:
String tokenPattern = "\"[^\"]*(\"\"[^\"]*)*\"|[^,]*";
This will match
unquoted strings up to but not including the next comma
qutoed strings up to the closing quote, including embedded commas
quoted strings including double quotes
You can use this, create a matcher for your line, iterate over all matches using find and extract the token using group(). You could alkso use that loop to strip quotes and transform double quotes to single quotes, if you need the semantic value of the column.
As an alternative, you could of course also use a CSV reader as suggested in comments to your question.

Related

spliting a string by space and dot and comma at the same time

How can I split a string by space, dot and comma at the same time? I want to get rid of them and get words only.
My code for space:
str=array.get(0).split(" ");
After advices i wrote this
str=array.get(0).split("[ ]|[.]|[,]|[ \t]");
but i see a new problem
String
New problem
The method split can be used with a Regex pattern, so you can match more elaborated cases to split your string.
A matching pattern for your case would be:
[ \.,]+
Regex Exaplanation:
[ .,]+ - The brackets create Character Set, that will match any character in the set.
[ .,]+ - The plus sign is a Quantifier, it will match the previous token (the character set) one or more times, this solves the problem where the tokens are following one another, creating empty strings in the array.
You can test it with the following code:
class Main {
public static void main(String[] args) {
String str = "Hello, World!, StackOverflow. Test Regex";
String[] split = str.split("[ .,]+");
for(String s : split){
System.out.println(s);
}
}
}
The output is:
Hello
World!
StackOverflow
Test
Regex
Using .split() can lead to having empty entries in your array.
Try this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String text = "This is... a real sentence, actually.";
Pattern reg = Pattern.compile("\\w+");
Matcher m = reg.matcher(text);
while (m.find()) {
System.out.println(m.group());
}

Java regex only bashslash(\\) not working

I am incorporating a pattern with has a backslash(\) with an escape sequence once.But that is not working at all.I am getting result as no match.
package com.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestClassRegex {
private static final String VALIDATION = "^[0-9\\-]+$";
public static void main(String[] args) {
String line = "1234\56";
Pattern r = Pattern.compile(VALIDATION);
Matcher m = r.matcher(line);
if (m.matches()) {
System.out.println("match");
}
else {
System.out.println("no match !!");
}
}
}
How can I write a pattern which can recognize backslash literally.
I have actually seen another post :
Java regular expression value.split("\\."), "the back slash dot" divides by character?
which doesn't answer my question completely.Hence needs some heads up here.
"1234\56" will not produce "123456" but instead "1234."
Why?
The \ in a String is used to refer to the octal value of a character in the ASCII table. Here, you're calling \056 which is the character number 46 in the ASCII table and is represented by .
That's exactly the reason why you're not getting a match here.
Solution
You should first of all change your regex to ^[0-9\\\\-]+$ because in Java you need to escape the \ in a String. Even if your initial RegEx does not do it.
Your input needs to look like 1234\\56 for the same reason as above.

Replacing digits separated with commas using String.replace("","");

I have a string which looks like following:
Turns 13,000,000 years old
Now i want to convert the digits to words in English, I have a function ready for that however I am finding problems to detect the original numbers (13,000,000) in this case, because it is separated by commas.
Currently I am using the following regex to detect a number in a string:
stats = stats.replace((".*\\d.*"), (NumberToWords.start(Integer.valueOf(notification_data_greet))));
But the above seems not to work, any suggestions?
You need to extract the number using a RegEx wich allows for the commas. The most robust one I can think of right now is
\d{1,3}(,?\d{3})*
Wich matches any unsigned Integer both with correctly placed commas and without commas (and weird combinations thereof like 100,000000)
Then replace all , from the match by the empty String and you can parse as usual:
Pattern p = Pattern.compile("\\d{1,3}(,?\\d{3})*"); // You can store this as static final
Matcher m = p.matcher(input);
while (m.find()) { // Go through all matches
String num = m.group().replace(",", "");
int n = Integer.parseInt(num);
// Do stuff with the number n
}
Working example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) throws InterruptedException {
String input = "1,300,000,000";
Pattern p = Pattern.compile("\\d{1,3}(,?\\d{3})*"); // You can store this as static final
Matcher m = p.matcher(input);
while (m.find()) { // Go through all matches
String num = m.group().replace(",", "");
System.out.println(num);
int n = Integer.parseInt(num);
System.out.println(n);
}
}
}
Gives output
1300000000
1300000000
Try this regex:
[0-9][0-9]?[0-9]?([0-9][0-9][0-9](,)?)*
This matches numbers that are seperated by a comma for each 1000. So it will match
10,000,000
but not
10,1,1,1
You can do it with the help of DecimalFormat instead of a regular expression
DecimalFormat format = (DecimalFormat) DecimalFormat.getInstance();
System.out.println(format.parse("10,000,000"));
Try the below regex to match the comma separted numbers,
\d{1,3}(,\d{3})+
Make the last part as optional to match also the numbers which aren't separated by commas,
\d{1,3}(,\d{3})*

Splitting line based on comma, strange line

I have the following line comma separated,
LanguageID=0,LastKnownPeriod="Active",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Using split method, I can get comma seperated values but the actual problem comes when the text c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}, since comma is found within itself.
so the word after splitting should be,
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448} (comma is again found within the word)
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"} (comma is again found within the word in curly brackets)
I tried with following code but didn't work:
String arr[]=input_line.split("(.*!{),(.*!})");
for (int i=0;i<arr.length;i++)
System.out.println(arr[i]);
Please advise.
Use regular expressions instead:
([\w_]+=(?:\{[\w=_,\[\]"\|:\.\s-]*\}))|([^,]+)
This will group the line into 4 sections:
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Code:
import java.util.regex.*;
public class JavaRegEx {
public static void main(String[] args) {
String line = "LanguageID=0,LastKnownPeriod=\"Active\",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=[\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\",\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\"}";
Pattern pattern = Pattern.compile("([\\w_]+=(?:\\{[\\w=_,\\[\\]\"\\|:\\.\\s-]*\\}))|([^,]+)");
Matcher matcher = pattern.matcher(line);
while(matcher.find())
System.out.println(matcher.group(0));
}
}
First, just splitting on a comma isn't how CSV works
a,b,"c,d"
has only three values, a, b, and c,d. I recommend using a CSV parser, like opencsv. CSV is not terribly complicated, but it isn't as simple as split by comma.
Second, your CSV data is invalid because you have a quote and a comma in a field that isn't quoted.
In othe words, if you want the values a, b","c, then the CSV is
a,"b"",""c"
(Note that quotes are double-escaped.)
Otherwise, it is impossible to tell what fields you actually wanted. A CSV parser would choke on your data.
While it might be possible to do this by split(), it's much easier to match the actual tokens (where split() matches the delimiters between the tokens). Your tokens all consist of one or more of any characters other than comma or brace, optionally followed by a pair of braces enclosing some non-brace characters (which can include commas):
[^,{}]+(?:\{[^{}]+\})?
The Java code for that would be:
List<String> matchList = new ArrayList<String>();
Pattern p = Pattern.compile("[^,{}]+(?:\\{[^{}]+\\})?");
Matcher m = p.matcher(s);
while (m.find()) {
matchList.add(m.group());
}
But it looks like you can break it down further:
Pattern p = Pattern.compile("(\\w+)=([^,{}]+|\\{[^{}]+\\})");
Matcher m = p.matcher(TEST_STR);
while (m.find()) {
System.out.printf("%nname = %s%nvalue = %s%n",
m.group(1), m.group(2));
}
output:
name = LanguageID
value = 0
name = LastKnownPeriod
value = "Active"
name = c_MultiPartyCall
value = {Counter=1,TimeStamp=1394539271448}
name = LTH
value = {Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakA
ccountID|0|1000||"}

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Categories