I have to extract tokens from a text which I need to match using regex. An example text would be something like this.
data.orderType.`order.created.time`
Right now I'm using the following regex to tokenize this string.
`(.*?)`|[^.]+
This regex tokenizes the string partially, and gives tokens as
data,orderType,`order.created.time`
the problem here is when the tokens are taken backtick also gets included. How can I dump the backtick and just get the following?
data,orderType,order.created.time
You already captured the part between backticks, just grab matcher.group(1) if it participated in the match (=if it matched):
Java demo:
String s = "data.orderType.`order.created.time`";
String regex = "`([^`]*)`|[^.`]+";
List<String> result = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
if (m.group(1) != null) {
result.add(m.group(1));
} else {
result.add(m.group());
}
}
System.out.println(result);
// => [data, orderType, order.created.time]
Note I also added a backtick to the negated character class, [^.`]+ as I assume the backticks can only be paired.
Related
I am a PHP developer and it has been only a few months since I started learning Java. Here, I have a function in PHP to retrieve the hashtags from a string.
{
preg_match_all('/(^|[^a-z0-9_])#([a-z0-9_]+)/i', $text, $matchedHashtags);
$hashtag = '';
if (!empty($matchedHashtags[0])) {
foreach ($matchedHashtags[0] as $match) {
$hashtag .= preg_replace("/[^a-z0-9]+/i", "", $match) . ',';
}
}
return rtrim($hashtag, ',');
}
This function returns a new strings containing the hashtags, separated by a comma. My question is, how to achieve this exact function in java? Regards.
One option might be to use a single pattern, instead of first matching and then replacing.
Then you can either concatenate the results in a string, or add the results to a list and use String.join with a comma.
\B#\w*[A-Za-z]\w*
\B Match at a position where \b does not match
#\w*[A-Za-z]\w* Match # and at least a single char A-Za-z (if you don't want to match only digits or underscores according to the comments)
Java demo | Regex demo
For example
List<String> matches = new ArrayList<String>();
Matcher m = Pattern.compile("\\B#\\w*[A-Za-z]\\w*")
.matcher("#test ##$#test a#test,#2021, #__");
while (m.find()) {
matches.add(m.group());
}
System.out.println(String.join(",", matches));
Output
#test1,#test2
I'm trying to replace a url string to lowercase but wanted to keep the certain pattern string as it is.
eg: for input like:
http://BLABLABLA?qUERY=sth¯o1=${MACRO_STR1}¯o2=${macro_str2}
The expected output would be lowercased url but the multiple macros are original:
http://blablabla?query=sth¯o1=${MACRO_STR1}¯o2=${macro_str2}
I was trying to capture the strings using regex but didn't figure out a proper way to do the replacement. Also it seemed using replaceAll() doesn't do the job. Any hint please?
It looks like you want to change any uppercase character which is not inside ${...} to its lowercase form.
With construct
Matcher matcher = ...
StringBuffer buffer = new StringBuffer();
while (matcher.find()){
String matchedPart = ...
...
matcher.appendReplacement(buffer, replacement);
}
matcher.appendTail(buffer);
String result = buffer.toString();
or since Java 9 we can use Matcher#replaceAll​(Function<MatchResult,String> replacer) and rewrite it like
String replaced = matcher.replaceAll(m -> {
String matchedPart = m.group();
...
return replacement;
});
you can dynamically build replacement based on matchedPart.
So you can let your regex first try to match ${...} and later (when ${..} will not be matched because regex cursor will not be placed before it) let it match [A-Z]. While iterating over matches you can decide based on match result (like its length or if it starts with $) if you want to use use as replacement its lowercase form or original form.
BTW regex engine allows us to place in replacement part $x (where x is group id) or ${name} (where name is named group) so we could reuse those parts of match. But if we want to place ${..} as literal in replacement we need to escape \$. To not do it manually we can use Matcher.quoteReplacement.
Demo:
String yourUrlString = "http://BLABLABLA?qUERY=sth¯o1=${MACRO_STR1}¯o2=${macro_str2}";
Pattern p = Pattern.compile("\\$\\{[^}]+\\}|[A-Z]");
Matcher m = p.matcher(yourUrlString);
StringBuffer sb = new StringBuffer();
while(m.find()){
String match = m.group();
if (match.length() == 1){
m.appendReplacement(sb, match.toLowerCase());
} else {
m.appendReplacement(sb, Matcher.quoteReplacement(match));
}
}
m.appendTail(sb);
String replaced = sb.toString();
System.out.println(replaced);
or in Java 9
String replaced = Pattern.compile("\\$\\{[^}]+\\}|[A-Z]")
.matcher(yourUrlString)
.replaceAll(m -> {
String match = m.group();
if (match.length() == 1)
return match.toLowerCase();
else
return Matcher.quoteReplacement(match);
});
System.out.println(replaced);
Output: http://blablabla?query=sth¯o1=${MACRO_STR1}¯o2=${macro_str2}
This regex will match all the characters before the first ¯o, and put everything between http:// and the first ¯o in its own group so you can modify it.
http://(.*?)¯o
Tested here
UPDATE: If you don't want to use groups, this regex will match only the characters between http:// and the first ¯o
(?<=http://)(.*?)(?=¯o)
Tested here
I have the following text: some unknown random stuff here: (ben, tim, sam toben, suzei)
I need to use regex only to pull out each of the items from the (unknown sized) comma separated list: ben, tim, sam toben, suzei into the matched groups. Leading/trailing whitespace doesn't matter.
I tried the following: (?:\(([^,]+)) but it will only pull out ben as a group.
Any ideas?
This isn't something you can do in the way you want with regexes only. It is, however, possible to do it using regexes and String.split().
Something like this:
String[] getList(String text) {
Pattern p = Pattern.compile(".+\\((.+)\\)"); // Note the doubled backslashes
Matcher m = p.matcher(text);
if (m.find()) {
return m.group(1).split(",");
} else {
return new String[0];
}
}
Demo of the regex on Regex101
What this does is grab the contents of the brackets, then split those contents into an array on the comma character.
I have the following line comma separated,
LanguageID=0,LastKnownPeriod="Active",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Using split method, I can get comma seperated values but the actual problem comes when the text c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}, since comma is found within itself.
so the word after splitting should be,
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448} (comma is again found within the word)
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"} (comma is again found within the word in curly brackets)
I tried with following code but didn't work:
String arr[]=input_line.split("(.*!{),(.*!})");
for (int i=0;i<arr.length;i++)
System.out.println(arr[i]);
Please advise.
Use regular expressions instead:
([\w_]+=(?:\{[\w=_,\[\]"\|:\.\s-]*\}))|([^,]+)
This will group the line into 4 sections:
LanguageID=0
LastKnownPeriod="Active"
c_MultiPartyCall={Counter=1,TimeStamp=1394539271448}
LTH={Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||"}
Code:
import java.util.regex.*;
public class JavaRegEx {
public static void main(String[] args) {
String line = "LanguageID=0,LastKnownPeriod=\"Active\",c_MultiPartyCall={Counter=1,TimeStamp=1394539271448},LTH={Data=[\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\",\"1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||\"}";
Pattern pattern = Pattern.compile("([\\w_]+=(?:\\{[\\w=_,\\[\\]\"\\|:\\.\\s-]*\\}))|([^,]+)");
Matcher matcher = pattern.matcher(line);
while(matcher.find())
System.out.println(matcher.group(0));
}
}
First, just splitting on a comma isn't how CSV works
a,b,"c,d"
has only three values, a, b, and c,d. I recommend using a CSV parser, like opencsv. CSV is not terribly complicated, but it isn't as simple as split by comma.
Second, your CSV data is invalid because you have a quote and a comma in a field that isn't quoted.
In othe words, if you want the values a, b","c, then the CSV is
a,"b"",""c"
(Note that quotes are double-escaped.)
Otherwise, it is impossible to tell what fields you actually wanted. A CSV parser would choke on your data.
While it might be possible to do this by split(), it's much easier to match the actual tokens (where split() matches the delimiters between the tokens). Your tokens all consist of one or more of any characters other than comma or brace, optionally followed by a pair of braces enclosing some non-brace characters (which can include commas):
[^,{}]+(?:\{[^{}]+\})?
The Java code for that would be:
List<String> matchList = new ArrayList<String>();
Pattern p = Pattern.compile("[^,{}]+(?:\\{[^{}]+\\})?");
Matcher m = p.matcher(s);
while (m.find()) {
matchList.add(m.group());
}
But it looks like you can break it down further:
Pattern p = Pattern.compile("(\\w+)=([^,{}]+|\\{[^{}]+\\})");
Matcher m = p.matcher(TEST_STR);
while (m.find()) {
System.out.printf("%nname = %s%nvalue = %s%n",
m.group(1), m.group(2));
}
output:
name = LanguageID
value = 0
name = LastKnownPeriod
value = "Active"
name = c_MultiPartyCall
value = {Counter=1,TimeStamp=1394539271448}
name = LTH
value = {Data=["1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakAccountID|0|1000||","1|MTC|01.01.1970 15:00:00|0.0|7|-1|OnPeakA
ccountID|0|1000||"}
I have query about java regular expressions. Actually, I am new to regular expressions.
So I need help to form a regex for the statement below:
Statement: a-alphanumeric&b-digits&c-digits
Possible matching Examples: 1) a-90485jlkerj&b-34534534&c-643546
2) A-RT7456ffgt&B-86763454&C-684241
Use case: First of all I have to validate input string against the regular expression. If the input string matches then I have to extract a value, b value and c value like
90485jlkerj, 34534534 and 643546 respectively.
Could someone please share how I can achieve this in the best possible way?
I really appreciate your help on this.
you can use this pattern :
^(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)$
In the case what you try to match is not the whole string, just remove the anchors:
(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)
explanations:
(?i) make the pattern case-insensitive
[0-9]++ digit one or more times (possessive)
[0-9a-z]++ the same with letters
^ anchor for the string start
$ anchor for the string end
Parenthesis in the two patterns are capture groups (to catch what you want)
Given a string with the format a-XXX&b-XXX&c-XXX, you can extract all XXX parts in one simple line:
String[] parts = str.replaceAll("[abc]-", "").split("&");
parts will be an array with 3 elements, being the target strings you want.
The simplest regex that matches your string is:
^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)
With your target strings in groups 1, 2 and 3, but you need lot of code around that to get you the strings, which as shown above is not necessary.
Following code will help you:
String[] texts = new String[]{"a-90485jlkerj&b-34534534&c-643546", "A-RT7456ffgt&B-86763454&C-684241"};
Pattern full = Pattern.compile("^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)");
Pattern patternA = Pattern.compile("(?i)([\\da-z]+)&[bc]");
Pattern patternB = Pattern.compile("(\\d+)");
for (String text : texts) {
if (full.matcher(text).matches()) {
for (String part : text.split("-")) {
Matcher m = patternA.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()).split("&")[0]);
}
m = patternB.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()));
}
}
}
}