Using a variable in a \x hex regular expression pattern

Using a variable in a \x hex regular expression pattern - java

I try to split some String by the byte value.
Like "first\x00second" by 0x00 splitter. I found that compiler cannot combine \x token with variable.
static public ArrayList split_by_byte(String value, byte spliter) {
if (spliter < 0)
throw new IllegalArgumentException("Отрицательное значение разделителя: " + spliter);
ArrayList<String> result = new ArrayList();
String[] groups = value.split("[\\x" + spliter + "]");
for (String group : groups) {
result.add(group);
}
return result;
}
How can i use variable value in patterns like \xNN?

In regex you cannot use \x in a single-quoted / non-interpolated string. It must be seen by the lexer.
because tilde isn’t a meta-character.
Add use regex "debug" and you will see what is actually happening.
you can also use pattern and matcher classes and split method...

Related

Java: How to replace consecutive characters with a single character?

How can I replace consecutive characters with a single character in java?
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
String newDelimiter = "!";
for (int i = 0; i < oldDelimiters.length(); i++){
Character character = oldDelimiters.charAt(i);
fileContent = fileContent.replace(String.valueOf(character), newDelimiter);
}
Current output: def!!mnop!UVW
Desired output: def!mnop!UVW
Notice the two spaces are replaced with two exclamation marks. How can I replace consecutive delimiters with one delimiter?

Since you want to match consecutive characters from the old delimiter, a regex solution doesn't seem to be feasible here. You can instead match char by char if it belongs to one of the old delimiter chars and then set it with the new one as shown below.
import java.util.*;
public class Main{
public static void main(String[] args) {
String fileContent = "def mnop.UVW";
String oldDelimiters = " .";
// add all old delimiters in a set for fast checks
Set<Character> set = new HashSet<>();
for(int i=0;i<oldDelimiters.length();++i) set.add(oldDelimiters.charAt(i));
/*
match all consecutive chars at once, check if it belongs to an old delimiter
and replace it with the new one
*/
String newDelimiter = "!";
StringBuilder res = new StringBuilder("");
for(int i=0;i<fileContent.length();++i){
if(set.contains(fileContent.charAt(i))){
while(i + 1 < fileContent.length() && fileContent.charAt(i) == fileContent.charAt(i+1)) i++;
res.append(newDelimiter);
}else{
res.append(fileContent.charAt(i));
}
}
System.out.println(res.toString());
}
}
Demo: https://onlinegdb.com/r1BC6qKP8

s = s.replaceAll("([ \\.])[ \\.]+", "$1");
Or if only several same delimiters have to be replaced:
s = s.replaceAll("([ \\.])\\1+", "$1");
[....] is a group of alternative characters
First (...) is group 1, $1
\\1 is the text of the first group

While not using regex, I thought a solution with StreamS was needed, because everyone loves streams:
private static class StatefulFilter implements Predicate<String> {
private final String needle;
private String last = null;
public StatefulFilter(String needle) {
this.needle = needle;
}
#Override
public boolean test(String value) {
boolean duplicate = last != null && last.equals(value) && value.equals(needle);
last = value;
return !duplicate;
}
}
public static void main(String[] args) {
System.out.println(
"def mnop.UVW"
.codePoints()
.sequential()
.mapToObj(c -> String.valueOf((char) c))
.filter(new StatefulFilter(" "))
.map(x -> x.equals(" ") ? "!" : x)
.collect(Collectors.joining(""))
);
}
Runnable example: https://onlinegdb.com/BkY0R2twU
Explanation:
Theoretically, you aren't really supposed to have a stateful filter, but technically, as long as the stream is not parallelized, it works fine:
.codePoints() - splits the String into a Stream
.sequential() - since we care about the order of characters, our Stream may not be processed in parallel
.mapToObj(c -> String.valueOf((char) c)) - the comparison in the filter is more intuitive if we convert to String, but it's not really needed
.filter(new StatefulFilter(" ")) - here we filter out any space that comes after another space
.map(x -> x.equals(" ") ? "!" : x) - now we can replace the remaining spaces with exclamation marks
.collect(Collectors.joining("")) - and finally we can join the characters together to reconstitute a String
The StatefulFilter itself is pretty straight forward - it checks whether a) we have a previous character at all, b) whether the previous character is the same as the current character and c) whether the current character is the delimiter (space). It returns false (meaning the character gets deleted) only if all a, b and c are true.

The biggest difficulty to using a regex for this, is to create an expression from your oldDelimiters string. For example:
String oldDelimiters = " .";
String expression = "\\" + String.join("+|\\", oldDelimiters.split("")) + "+";
String text = "def mnop.UVW;abc .df";
String result = text.replaceAll(expression, "!");
(Edit: since characters in the expression are now escaped anyway, I removed the character classes and edited the following text to reflect that change.)
Where the generated expression looks like \ +|\.+, i.e. each character is quantified and constitutes one alternative of the expression. The engine will match and replace one alternative at a time if it can be matched. result now contains:
def!mnop!UVW;abc!!df
Not sure how backwards compatible this is due to split() behaviour in previous versions of Java (producing a leading space in splitting on the empty string), but with current versions this should be fine.
Edit: As it is, this breaks if the delimiting characters contain digits or characters representing unescaped regex tokens (i.e. 1, b, etc.).

Get all matches within a string using complie and regex

I'm trying to get all matches which starts with _ and ends with = from a URL which looks like
?_field1=param1,param2,paramX&_field2=param1,param2,paramX
In that case I'm looking for any instance of _fieldX=
A method which I use to get it looks like
public static List<String> getAllMatches(String url, String regex) {
List<String> matches = new ArrayList<String>();
Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(url);
while(m.find()) {
matches.add(m.group(1));
}
return matches;
}
called as
List<String> fieldsList = getAllMatches(url, "_.=");
but somehow is not finding anything what I have expected.
Any suggestions what I have missed?

A regex like (?=(_.=)) matches all occurrences of overlapping matches that start with _, then have any 1 char (other than a line break char) and then =.
You need no overlapping matches in the context of the string you provided.
You may just use a lazy dot matching pattern, _(.*?)=. Alternatively, you may use a negated character class based regex: _([^=]+)= (it will capture into Group 1 any one or more chars other than = symbol).

Since you are passing a regex to the method, it seems you want a generic function.
If so, you may use this method:
public static List<String> getAllMatches(String url, String start, String end) {
List<String> matches = new ArrayList<String>();
Matcher m = Pattern.compile(start + "(.*?)" + end).matcher(url);
while(m.find()) {
matches.add(m.group(1));
}
return matches;
}
and call it as:
List<String> fieldsList = getAllMatches(url, "_", "=");

Split a string based on pattern and merge it back

I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.

You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.

I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word

public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);

I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.

Recursive replace with Java regular expression?

I can replace ABC(10,5) with (10)%(5) using:
replaceAll("ABC\\(([^,]*)\\,([^,]*)\\)", "($1)%($2)")
but I'm unable to figure out how to do it for ABC(ABC(20,2),5) or ABC(ABC(30,2),3+2).
If I'm able to convert to ((20)%(2))%5 how can I convert back to ABC(ABC(20,2),5)?
Thanks,
j

I am going to answer about the first question. I was not able to do the task in a single replaceAll. I don't think it is even achievable. However if I use loop then this should do the work for you:
String termString = "([0-9+\\-*/()%]*)";
String pattern = "ABC\\(" + termString + "\\," + termString + "\\)";
String [] strings = {"ABC(10,5)", "ABC(ABC(20,2),5)", "ABC(ABC(30,2),3+2)"};
for (String str : strings) {
while (true) {
String replaced = str.replaceAll(pattern, "($1)%($2)");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
System.out.println(str);
}
I am assuming you are writing parser for numeric expressions, thus the definition of term termString = "([0-9+\\-*/()%]*)". It outputs this:
(10)%(5)
((20)%(2))%(5)
((30)%(2))%(3+2)
EDIT As per the OP request I add the code for decoding the strings. It is a bit more hacky than the forward scenario:
String [] encoded = {"(10)%(5)", "((20)%(2))%(5)", "((30)%(2))%(3+2)"};
String decodeTerm = "([0-9+\\-*ABC\\[\\],]*)";
String decodePattern = "\\(" + decodeTerm + "\\)%\\(" + decodeTerm + "\\)";
for (String str : encoded) {
while (true) {
String replaced = str.replaceAll(decodePattern, "ABC[$1,$2]");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
str = str.replaceAll("\\[", "(");
str = str.replaceAll("\\]", ")");
System.out.println(str);
}
And the output is:
ABC(10,5)
ABC(ABC(20,2),5)
ABC(ABC(30,2),3+2)

You can start evaluating the inner most reducable expressions first, till no more redux exists. However you have to take care of other ,, ( and ). The solution of #BorisStrandjev is better, more bullet proof.
String infix(String expr) {
// Use place holders for '(' and ')' to use regex [^,()].
expr = expr.replaceAll("(?!ABC)\\(", "<<");
expr = expr.replaceAll("(?!ABC)\\)", ">>");
for (;;) {
String expr2 = expr.replaceAll("ABC\\(([^,()]*)\\,([^,()]*)\\)",
"<<$1>>%<<$2>>");
if (expr2 == expr)
break;
expr = expr2;
}
expr = expr.replaceAll("<<", ")");
expr = expr.replaceAll(">>", ")");
return expr;
}

You could use this Regular Expressions library https://github.com/florianingerl/com.florianingerl.util.regex , that also supports Recursive Regular Expressions.
Converting ABC(ABC(20,2),5) to ((20)%(2))%(5) looks like this:
Pattern pattern = Pattern.compile("(?<abc>ABC\\((?<arg1>(?:(?'abc')|[^,])+)\\,(?<arg2>(?:(?'abc')|[^)])+)\\))");
Matcher matcher = pattern.matcher("ABC(ABC(20,2),5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("abc".equals(node.getGroupName())) {
return "(" + replace(node.getChildren().get(0)) + ")%(" + replace(node.getChildren().get(1)) + ")";
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("((20)%(2))%(5)", replacement);
Converting back again, i.e. from ((20)%(2))%(5) to ABC(ABC(20,2),5) looks like this:
Pattern pattern = Pattern.compile("(?<fraction>(?<arg>\\(((?:(?'fraction')|[^)])+)\\))%(?'arg'))");
Matcher matcher = pattern.matcher("((20)%(2))%(5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("fraction".equals(node.getGroupName())) {
return "ABC(" + replace(node.getChildren().get(0)) + "," + replace(node.getChildren().get(1)) + ")";
} else if ("arg".equals(node.getGroupName())) {
return replace(node.getChildren().get(0));
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("ABC(ABC(20,2),5)", replacement);

You can try to rewrite the string using the Polish notation and then replace any % X Y with ABC(X,Y).
Here's the wiki link for the Polish notation.
The problem is that you need to find out which rewrite of ABC(X,Y) occurred first when you recursively replaced them in your string. The Polish notation is useful for "deciphering" the order that these rewrites occur and is widely used in expression evaluation.
You can do this by using a stack and recording which replace occurred first: find the inner-most set of parentheses, push only that expression onto the stack, then remove that from your string. When you want to reconstruct the expression original expression, just start at the top of the stack and apply the reverse transformation (X)%(Y) -> ABC(X,Y).
This is somewhat a form of the Polish notation, with the only difference being that you don't store the entire expression as a string, but rather store it in a stack for easier processing.
In short, when replacing, start with the inner-most terms (the ones that have no parentheses in them) and apply the reverse replace.
It may be helpful to use (X)%(Y) -> ABC{X,Y} as an intermediary rewrite rule, then rewrite the curly brackets as round brackets. This way it will be easier to determine which is the inner-most term, as the new terms won't use round brackets. Also it is easier to implement, but not as elegant.

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?
For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:
"hello[world]this[[is]me"
The output should be:
token[0] = "world"
token[1] = "[is"
(Note: the second token has a 'start' string in it)

I think you can use the Apache Commons Lang feature that exists in StringUtils:
substringsBetween(java.lang.String str,
java.lang.String open,
java.lang.String close)
The API docs say it:
Searches a String for substrings
delimited by a start and end tag,
returning all matching substrings in
an array.
The Commons Lang substringsBetween API can be found here:
http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

Here is the way I would go to avoid dependency on commons lang.
public static String escapeRegexp(String regexp){
String specChars = "\\$.*+?|()[]{}^";
String result = regexp;
for (int i=0;i<specChars.length();i++){
Character curChar = specChars.charAt(i);
result = result.replaceAll(
"\\"+curChar,
"\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
}
return result;
}
public static List<String> findGroup(String content, String pattern, int group) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
List<String> result = new ArrayList<String>();
while (m.find()) {
result.add(m.group(group));
}
return result;
}
public static List<String> tokenize(String content, String firstToken, String lastToken){
String regexp = lastToken.length()>1
?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
:escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
return findGroup(content, regexp, 1);
}
Use it like this :
String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.
Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.

Try a regular expression like:
(.*?\[(.*?)\])
The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:
public List extractTokens(String txt, String str, String end) {
int so=0,eo;
List lst=new ArrayList();
while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
so+=str.length();
if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
lst.add(txt.substring(so,eo);
so=eo+end.length();
}
}
return lst;
}

The regular expression \\[[\\[\\w]+\\] gives us
[world] and
[[is]

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using a variable in a \x hex regular expression pattern - java

In regex you cannot use \x in a single-quoted / non-interpolated string. It must be seen by the lexer. because tilde isn’t a meta-character. Add use regex "debug" and you will see what is actually happening. you can also use pattern and matcher classes and split method...

Related

Java: How to replace consecutive characters with a single character?

Get all matches within a string using complie and regex

Split a string based on pattern and merge it back

Recursive replace with Java regular expression?

Finding tokens in a Java String

Categories

Resources