java: regex for string - java

I'm trying to use regex to split string into field, but unfortunately it's not working 100% and is skipping some part which should be split. Here is part of program processing string:
void parser(String s) {
String REG1 = "(',\\d)|(',')|(\\d,')|(\\d,\\d)";
Pattern p1 = Pattern.compile(REG1);
Matcher m1 = p1.matcher(s);
while (m1.find() ) {
System.out.println(counter + ": "+s.substring(end, m1.end()-1)+" "+end+ " "+m1.end());
end =m1.end();
counter++;
}
}
The string is:
s= 3101,'12HQ18U0109','11YX27X0041','XX21','SHV7-P Hig, Hig','','GW1','MON','E','A','ASEXPORT-1',1,101,0,'0','1500','V','','',0,'mb-master1'
and the problem is that it doesn't split ,1, or ,0,
Rules for parsing are: String is enclosed by ,' ', for example ,'ASEXPORT-1',
int is enclosed only by , ,
expected output =
3101 | 12HQ18U0109 | 11YX27X0041 | XX21 | SHV7-P Hig, Hig| |GW1 |MON |E | A| ASEXPORT-1| 1 |101 |0 | 0 |1500 | V| | | 0 |mb-master1
Altogether 21 elements.

You can split it with this regex
,(?=([^']*'[^']*')*[^']*$)
It splits at , only if there are even number of ' ahead
So for
3101,'12HQ18,U0109','11YX27X0041'
output would be
3101
'12HQ18,U0109'
'11YX27X0041'
Note
it wont work for nested strings like 'hello 'h,i'world'..If there are any such cases you should use the following regex
(?<='),(?=')|(?<=\d),(?=\d|')|(?<=\d|'),(?=\d)

If you also (for some bizarre reason) need to know each matches start and end index in the original string (like you have it in your sample output), you can use the following pattern:
String regex = "('[^']*'|\\d+)";
which would match an unquoted integer or asingle-quoted string.
You can optionally remove the leading and trailing ' using a "second-pass" on the matching substring:
match = match.replaceAll("\\A'|'\\Z", "");
which replaces a leading and trailing ' with nothing.
The code could look like this:
Pattern pat = Pattern.compile("('[^']*'|\\d+)");
Matcher m = pat.matcher(str);
int counter = 0, start = 0;
while (m.find()) {
String match = m.group(1);
int end = start + match.length();
match = match.replaceAll("\\A'|'\\Z", ""); // <-- comment out for NOT replacing
// leading and trailing quotes
System.out.format("%d: %s [%d - %d]%n", ++counter, match, start, end);
start = end + 1; // <-- the "+1" is to account for the ',' separator
}
See, also, this short demo.

Related

Removing whitespaces at the beginning of the string with Regex gives null Java

I would like to get groups from a string that is loaded from txt file. This file looks something like this (notice the space at the beginning of file):
as431431af,87546,3214| 5a341fafaf,3365,54465 | 6adrT43 , 5678 , 5655
First part of string until first comma can be digits and letter, second part of string are only digits and third are also only digits. After | its all repeating.
First, I load txt file into string :String readFile3 = readFromTxtFile("/resources/file.txt");
Then I remove all whitespaces with regex :
String no_whitespace = readFile3.replaceAll("\\s+", "");
After that i try to get groups :
Pattern p = Pattern.compile("[a-zA-Z0-9]*,\\d*,\\d*", Pattern.MULTILINE);
Matcher m = p.matcher(ue_No_whitespace);
int lastMatchPos = 0;
while (m.find()) {
System.out.println(m.group());
lastMatchPos = m.end();
}
if (lastMatchPos != ue_No_whitespace.length())
System.out.println("Invalid string!");
Now I would like, for each group remove "," and add every value to its variable, but I am getting this groups : (notice this NULL)
nullas431431af,87546,3214
5a341fafaf,3365,54465
6adrT43,5678,5655
What am i doing wrong? Even when i physicaly remove space from the beginning of the txt file , same result occurs.
Is there any easier way to get groups in this string with regex and add each string part, before "," , to its variable?
You can split with | enclosed with optional whitespaces and then split the obtained items with , enclosed with optional whitespaces:
String str = "as431431af,87546,3214| 5a341fafaf,3365,54465 | 6adrT43 , 5678 , 5655";
String[] items = str.split("\\s*\\|\\s*");
List<String[]> res = new ArrayList<>();
for(String i : items) {
String[] parts = i.split("\\s*,\\s*");
res.add(parts);
System.out.println(parts[0] + " - " + parts[1] + " - " + parts[2]);
}
See the Java demo printing
as431431af - 87546 - 3214
5a341fafaf - 3365 - 54465
6adrT43 - 5678 - 5655
The results are in the res list.
Note that
\s* - matches zero or more whitespaces
\| - matches a pipe char
The pattern that you tried only has optional quantifiers * which could also match only comma's.
You also don't need Pattern.MULTILINE as there are no anchors in the pattern.
You can use 3 capture groups and use + as the quantifier to match at least 1 or more occurrence, and after each part either match a pipe | or assert the end of the string $
([a-zA-Z0-9]+),([0-9]+),([0-9]+)(?:\||$)
Regex demo | Java demo
For example
String readFile3 = "as431431af,87546,3214| 5a341fafaf,3365,54465 | 6adrT43 , 5678 , 5655";
String no_whitespace = readFile3.replaceAll("\\s+", "");
Pattern p = Pattern.compile("([a-zA-Z0-9]+),([0-9]+),([0-9]+)(?:\\||$)");
Matcher matcher = p.matcher(no_whitespace);
while (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
System.out.println("--------------------------------");
}
Output
as431431af
87546
3214
--------------------------------
5a341fafaf
3365
54465
--------------------------------
6adrT43
5678
5655
--------------------------------

Regex to replace comments with number of new lines

I want to replace all Java-style comments (/* */) with the number of new lines for that comment. So far, I can only come up with something that replaces comments with an empty string
String.replaceAll("/\\*[\\s\\S]*?\\*/", "")
Is it possible to replace the matching regexes instead with the number of new lines it contains? If this is not possible with just regex matching, what's the best way for it to be done?
For example,
/* This comment
has 2 new lines
contained within */
will be replaced with a string of just 2 new lines.
Since Java supports the \G construct, just do it all in one go.
Use a global regex replace function.
Find
"/(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1"
https://regex101.com/r/l1VraO/1
Expanded
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (1 start)
(?: \r? \n )?
) # (1 end)
(?: \* / )?
==================================================
==================================================
IF you should ever care about comment block delimiters started within
quoted strings like this
String comment = "/* this is a comment*/"
Here is a regex (addition) that parses the quoted string as well as the comment.
Still done in a single regex all at once in a global find / replace.
Find
"/(\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\")|(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\")(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1$2"
https://regex101.com/r/tUwuAI/1
Expanded
( # (1 start)
"
[^"\\]*
(?:
\\ [\S\s]
[^"\\]*
)*
"
) # (1 end)
|
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! " )
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (2 start)
(?: \r? \n )?
) # (2 end)
(?: \* / )?
You can do it with a regex "replacement loop".
Most easily done in Java 9+:
String result = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input)
.replaceAll(r -> r.group().replaceAll(".*", ""));
The main regex has been optimized for performance. The lambda has not been optimized.
For all Java versions:
Matcher m = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
StringBuffer buf = new StringBuffer();
while (m.find())
m.appendReplacement(buf, m.group().replaceAll(".*", ""));
String result = m.appendTail(buf).toString();
Test
final String input = "Line 1\n"
+ "/* Inline comment */\n"
+ "Line 3\n"
+ "/* One-line\n"
+ " comment */\n"
+ "Line 6\n"
+ "/* This\n"
+ " comment\n"
+ " has\n"
+ " 4\n"
+ " lines */\n"
+ "Line 12";
Matcher m = Pattern.compile("(?s)/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
String result = m.replaceAll(r -> r.group().replaceAll(".*", ""));
// Show input/result side-by-side
String[] inLines = input.split("\n", -1);
String[] resLines = result.split("\n", -1);
int lineCount = Math.max(inLines.length, resLines.length);
System.out.println("input |result");
System.out.println("-------------------------+-------------------------");
for (int i = 0; i < lineCount; i++) {
System.out.printf("%-25s|%s%n", (i < inLines.length ? inLines[i] : ""),
(i < resLines.length ? resLines[i] : ""));
}
Output
input |result
-------------------------+-------------------------
Line 1 |Line 1
/* Inline comment */ |
Line 3 |Line 3
/* One-line |
comment */ |
Line 6 |Line 6
/* This |
comment |
has |
4 |
lines */ |
Line 12 |Line 12
Maybe, this expression,
\/\*.*?\*\/
on s mode might be close to what you have in mind.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class re{
public static void main(String[] args){
final String regex = "\\/\\*.*?\\*\\/";
final String string = "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within */\n\n"
+ "Some codes here 1\n\n"
+ "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within \n"
+ "*/\n\n\n"
+ "Some codes here 2";
final String subst = "\n\n";
final Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
}
}
Output
Some codes here 1
Some codes here 2
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

How to delimit both "=" and "==" in Java when reading

I want to be able to output both "==" and "=" as tokens.
For example, the input text file is:
biscuit==cookie apple=fruit+-()
The output:
biscuit
=
=
cookie
apple
=
fruit
+
-
(
)
What I want the output to be:
biscuit
==
cookie
apple
=
fruit
+
-
(
)
Here is my code:
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("input.txt")));
s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");
while (s.hasNext()) {
String next = s.next();
System.out.println(next);
}
} finally {
if (s != null) {
s.close();
}
}
Thank you.
Edit: I want to be able to keep the current regex.
Just split the input string according to the below regex .
String s = "biscuit==cookie apple=fruit";
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit]
Explanation:
\\s+ Matches one or more space characters.
| OR
\\b(?==+) Matches a word boundary only if it's followed by a = symbol.
| OR
(?<==) Lookafter to = symbol.
(?!=) And match the boundary only if it's not followed by a = symbol.
Update:
String s = "biscuit==cookie apple=fruit+-()";
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
You might be able to qualify those punctuations with some additional assertions.
# "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"
\s
| (?<= == )
| (?<= \p{Punct} )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} )
(?!
(?<= = )
(?= = )
)
Info update
If some characters aren't covered in \p{Punct} just add them as a separate class within
the punctuation subexpressions.
For engines that don't do certain properties well inside classes, use this ->
# Raw: \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
For engines that handle properties well inside classes, this is a better one ->
# Raw: \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
| (?= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
In other words you want to split on
one or more whitespaces
place which has = after it and non-= before it (like foo|= where | represents this place)
place which has = before it it and non-= after it (like =|foo where | represents this place)
In other words
s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
// ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases: 1) 2) 3)
Since it looks like you are building parser I would suggest using tool which will let you build correct grammar like http://www.antlr.org/. But if you must stick with regex then other improvement which will let you build regex easier would be using Matcher#find instead of delimiter from Scanner. This way your regex and code could look like
String data = "biscuit==cookie apple=fruit+-()";
String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group());
Output:
biscuit
==
cookie apple
=
fruit
+
-
(
)
You can make this regex more general by using
String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
// ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
// ^^ ^^ ^^------------------------- special cases
Also this approach would require reading data from file first, and storing it in single String which you would parse. You can find many ways of how to read text from file for instance in this question:
Reading a plain text file in Java
so you can use something like
String data = new String(Files.readAllBytes(Paths.get("input.txt")));
You can specify encoding which String should use while reading bytes from file by using constructor String(bytes, encoding). So you can write it as new String(butes,"UTF-8") or to avoid typos while selecting encoding use one of stored in StandardCharsets class like new String(bytes, StandardCharsets.UTF_8).
(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)
You can try this.Se demo.
http://regex101.com/r/wQ1oW3/18

regex, string to float number, get rid off other chars

With Stings like 123.456mm I would like to get one String with the number and the other with the measurement. So in the above case, one String with 123.456 and the other String with mm. So far I have this:
String str = "123.456mm";
String length = str.replaceAll("[\\D|\\.*]+","");
String lengthMeasurement = str.replaceAll("[\\W\\d]+","");
println(length, lengthMeasurement);
The output is:
123456 mm
The dot is gone and I can't get it back.
How can I keep the dots?
You can use:
String str = "123.456mm";
String length = str.replaceAll("[^\\d.]+",""); // 123.456
String lengthMeasurement = str.replaceAll("[\\d.]+",""); // mm
Try,
String str = "123.456mm";
String str1 = str.replaceAll("[a-zA-Z]", "");
String str2 = str.replaceAll("\\d|\\.", "");
System.out.println(str1);
System.out.println(str2);
Output:
123.456
mm
Try with Pattern and Matcher using below regex and get the matched group from index 1 and 2.
(\d+\.?\d*)(\D+)
Online demo
Try below sample code:
String str = "123.456mm";
Pattern p = Pattern.compile("(\\d+\\.?\\d*)(\\D+)");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println("Length: " + m.group(1));
System.out.println("Measurement : " + m.group(2));
}
output:
Length:123.456
Measurement :mm
Pattern description:
( group and capture to \1:
\d+ digits (0-9) (1 or more times)
\.? '.' (optional (0 or 1 time))
\d* digits (0-9) (0 or more times)
) end of \1
( group and capture to \2:
\D+ non-digits (all but 0-9) (1 or more times)
) end of \2

Java String tokens

I have a string line
String user_name = "id=123 user=aron name=aron app=application";
and I have a list that contains: {user,cuser,suser}
And i have to get the user part from string. So i have code like this
List<String> userName = Config.getConfig().getList(Configuration.ATT_CEF_USER_NAME);
String result = null;
for (String param: user_name .split("\\s", 0)){
for(String user: userName ){
String userParam = user.concat("=.*");
if (param.matches(userParam )) {
result = param.split("=")[1];
}
}
}
But the problem is that if the String contains spaces in the user_name, It do not work.
For ex:
String user_name = "id=123 user=aron nicols name=aron app=application";
Here user has a value aron nicols which contain spaces. How can I write a code that can get me exact user value i.e. aron nicols
If you want to split only on spaces that are right before tokens which have = righ after it such as user=... then maybe add look ahead condition like
split("\\s(?=\\S*=)")
This regex will split on
\\s space
(?=\\S*=) which has zero or more * non-space \\S characters which ends with = after it. Also look-ahead (?=...) is zero-length match which means part matched by it will not be included in in result so split will not split on it.
Demo:
String user_name = "id=123 user=aron nicols name=aron app=application";
for (String s : user_name.split("\\s(?=\\S*=)"))
System.out.println(s);
output:
id=123
user=aron nicols
name=aron
app=application
From your comment in other answer it seems that = which are escaped with \ shouldn't be treated as separator between key=value but as part of value. In that case you can just add negative-look-behind mechanism to see if before = is no \, so (?<!\\\\) right before will require = to not have \ before it.
BTW to create regex which will match \ we need to write it as \\ but in Java we also need to escape each of \ to create \ literal in String that is why we ended up with \\\\.
So you can use
split("\\s(?=\\S*(?<!\\\\)=)")
Demo:
String user_name = "user=Dist\\=Name1, xyz src=activedirectorydomain ip=10.1.77.24";
for (String s : user_name.split("\\s(?=\\S*(?<!\\\\)=)"))
System.out.println(s);
output:
user=Dist\=Name1, xyz
src=activedirectorydomain
ip=10.1.77.24
Do it like this:
First split input string using this regex:
" +(?=\\w+(?<!\\\\)=)"
This will give you 4 name=value tokens like this:
id=123
user=aron nicols
name=aron
app=application
Now you can just split on = to get your name and value parts.
Regex Demo
Regex Demo with escaped =
CODE FISH, this simple regex captures the user in Group 1: user=\\s*(.*?)\s+name=
It will capture "Aron", "Aron Nichols", "Aron Nichols The Benevolent", and so on.
It relies on the knowledge that name= always follows user=
However, if you're not sure that the token following user is name, you can use this:
user=\s*(.*?)(?=$|\s+\w+=)
Here is how to use the second expression (for the first, just change the string in Pattern.compile:
String ResultString = null;
try {
Pattern regex = Pattern.compile("user=\\s*(.*?)(?=$|\\s+\\w+=)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

Categories