I am looking for a way to incrementally apply a regular expression pattern, i.e. I am looking for a matcher which I can update with characters as they come in and which tells me on each character whether it is still matching or not.
Here is an illustration in code (MagicMatcherIAmLookingFor is the thing I am looking for, characterSource is something which I can query for new character, say an InputStreamReader for that matter):
final Pattern pattern = Pattern.compile("[0-9]+");
final MagicMatcherIAmLookingFor incrementalMatcher = pattern.magic();
final StringBuilder stringBuilder = new StringBuilder();
char character;
while (characterSource.isNotEOF()) {
character = characterSource.getNextCharacter();
incrementalMatcher.add(character);
if (incrementalMatcher.matches()) {
stringBuilder.append(character);
} else {
return result(
stringBuilder.toString(),
remaining(character, characterSource)
);
}
}
I did not find a way to utilize the existing java.util.regex.Pattern like that, but maybe I just did not find it. Or is there an alternative library to the built in regular expressions which provides such a feature?
I did not have any luck searching the web for it - all the results are completely swamped with how to use java regular expressions in the first place.
I am targeting Java 8+
Is this the kind of object you are looking for ?
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MagicMatcher {
private Pattern pattern;
private Matcher matcher;
private String stringToCheck;
public MagicMatcher(Pattern p , String s) {
pattern = p;
stringToCheck = s;
updateMatcher();
}
public boolean matches() {
return matcher.matches();
}
private void updateMatcher() {
matcher = pattern.matcher(stringToCheck);
}
public void setStringToCheck(String s) {
stringToCheck = s;
updateMatcher();
}
public String getStringToCheck() {
return stringToCheck;
}
public void addCharacterToCheck(char c) {
stringToCheck += c;
updateMatcher();
}
public void addStringToCheck(String s) {
stringToCheck += s;
updateMatcher();
}
}
Related
I have a set of IP adresses with special format, which I have to check if it matches the needed regex pattern. My pattern right now looks like this:
private static final String FIRST_PATTERN = "([0-9]{1,3}\\\\{2}?.[0-9]{1,3}\\\\{2}?.[0-9]{1,3}\\\\{2}?.[0-9]{1,3})";
This pattern allows me to check strict IP adresses and recognize the pattern, when IP adresses are static, for example: "65\\.33\\.42\\.12" or "112\\.76\\.39\\.104, 188\\.35\\.122\\.148".
I should, however, also be able to look for some non static IP's, like this:
"155\\.105\\.178\\.(8[1-9]|9[0-5])"
or this:
"93\\.23\\.75\\.(1(1[6-9]|2[0-6])),
113\\.202\\.167\\.(1(2[8-9]|[3-8][0-9]|9[0-2]))"
I have tried to do it in several ways, but it always gives "false", when try to match those IP's. I searched for this solution for a decent amount of time and I cannot find it and also cannot wrap my head around of how to do it myself. Is there anyone who can help me?
UPDATE Whole code snippet:
public class IPAdressValidator {
Pattern pattern;
Matcher matcher;
private static final String FIRST_PATTERN = "([0-9]{1,3}\\\\{2}?.[0-9]{1,3}\\\\{2}?.[0-9]{1,3}\\\\{2}?.[0-9]{1,3})";
public IPAdressValidator() {
pattern = Pattern.compile(FIRST_PATTERN);
}
public CharSequence validate(String ip) {
matcher = pattern.matcher(ip);
boolean found = matcher.find();
if (found) {
for (int i = 0; i <= matcher.groupCount(); i++) {
int groupStart = matcher.start(i);
int groupEnd = matcher.end(i);
return ip.subSequence(groupStart, groupEnd);
}
}
return null;
}
}
and my Main:
public class Main {
public static void main(String[] args) {
IPAdressValidator validator = new IPAdressValidator();
String[] ips =
"53\\\\.22\\\\.14\\\\.43",
"123\\\\.55\\\\.19\\\\.137",
"93\\.152\\.199\\.1",
"(93\\.199\\.(?:1(?:0[6-7]))\\.(?:[0-9]|[1-9][0-9]|1(?:[0-9][0-9])|2(?:[0-4][0-9]|5[0-5])))",
"193\\\\.163\\\\.100\\\\.(8[1-9]|9[0-5])",
"5\\\\.56\\\\.188\\\\.130, 188\\\\.194\\\\.180\\\\.138, 182\\\\.105\\\\.24\\\\.15",
"188\\\\.56\\\\.147\\\\.193,41\\\\.64\\\\.202\\\\.19"
};
for (String ip : ips) {
System.out.printf("%20s: %b%n", ip, validator.validate(ip));
}
}
}
Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!
If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789
Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.
Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.
I would like to do some simple String replace with a regular expression in Java, but the replace value is not static and I would like it to be dynamic like it happens on JavaScript.
I know I can make:
"some string".replaceAll("some regex", "new value");
But i would like something like:
"some string".replaceAll("some regex", new SomeThinkIDontKnow() {
public String handle(String group) {
return "my super dynamic string group " + group;
}
});
Maybe there is a Java way to do this but i am not aware of it...
You need to use the Java regex API directly.
Create a Pattern object for your regex (this is reusable), then call the matcher() method to run it against your string.
You can then call find() repeatedly to loop through each match in your string, and assemble a replacement string as you like.
Here is how such a replacement can be implemented.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExCustomReplacementExample
{
public static void main(String[] args)
{
System.out.println(
new ReplaceFunction() {
public String handle(String group)
{
return "«"+group.substring(1, group.length()-1)+"»";
}
}.replace("A simple *test* string", "\\*.*?\\*"));
}
}
abstract class ReplaceFunction
{
public String replace(String source, String regex)
{
final Pattern pattern = Pattern.compile(regex);
final Matcher m = pattern.matcher(source);
boolean result = m.find();
if(result) {
StringBuilder sb = new StringBuilder(source.length());
int p=0;
do {
sb.append(source, p, m.start());
sb.append(handle(m.group()));
p=m.end();
} while (m.find());
sb.append(source, p, source.length());
return sb.toString();
}
return source;
}
public abstract String handle(String group);
}
Might look a bit complicated at the first time but that doesn’t matter as you need it only once. The subclasses implementing the handle method look simpler. An alternative is to pass the Matcher instead of the match String (group 0) to the handle method as it offers access to all groups matched by the pattern (if the pattern created groups).
Recentrly I found very helpful method in StringUtils library which is
StringUtils.stripAccents(String s)
I found it really helpful with removing any special characters and converting it to some ASCII "equivalent", for instace ç=c etc.
Now I am working for a German customer who really needs to do such a thing but only for non-German characters. Any umlauts should stay untouched. I realised that strinAccents won't be useful in that case.
Does anyone has some experience around that stuff?
Are there any useful tools/libraries/classes or maybe regular expressions?
I tried to write some class which is parsing and replacing such characters but it can be very difficult to build such map for all languages...
Any suggestions appriciated...
Best built a custom function. It can be like the following. If you want to avoid the conversion of a character, you can remove the relationship between the two strings (the constants).
private static final String UNICODE =
"ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII =
"AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";
public static String toAsciiString(String str) {
if (str == null) {
return null;
}
StringBuilder sb = new StringBuilder();
for (int index = 0; index < str.length(); index++) {
char c = str.charAt(index);
int pos = UNICODE.indexOf(c);
if (pos > -1)
sb.append(PLAIN_ASCII.charAt(pos));
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args) {
System.out.println(toAsciiString("Höchstalemannisch"));
}
My gut feeling tells me the easiest way to do this would be to just list allowed characters and strip accents from everything else. This would be something like
import java.util.regex.*;
import java.text.*;
public class Replacement {
public static void main(String args[]) {
String from = "aoeåöäìé";
String result = stripAccentsFromNonGermanCharacters(from);
System.out.println("Result: " + result);
}
private static String patternContainingAllValidGermanCharacters =
"a-zA-Z0-9äÄöÖéÉüÜß";
private static Pattern nonGermanCharactersPattern =
Pattern.compile("([^" + patternContainingAllValidGermanCharacters + "])");
public static String stripAccentsFromNonGermanCharacters(
String from) {
return stripAccentsFromCharactersMatching(
from, nonGermanCharactersPattern);
}
public static String stripAccentsFromCharactersMatching(
String target, Pattern myPattern) {
StringBuffer myStringBuffer = new StringBuffer();
Matcher myMatcher = myPattern.matcher(target);
while (myMatcher.find()) {
myMatcher.appendReplacement(myStringBuffer,
stripAccents(myMatcher.group(1)));
}
myMatcher.appendTail(myStringBuffer);
return myStringBuffer.toString();
}
// pretty much the same thing as StringUtils.stripAccents(String s)
// used here so I can demonstrate the code without StringUtils dependency
public static String stripAccents(String text) {
return Normalizer.normalize(text,
Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
}
(I realize the pattern doesn't probably contain all the characters needed, but add whatever is missing)
This might give you a work around. here you can detect the language and get the specific text only.
EDIT:
You can have the raw string as an input, put the language detection to German and then it will detect the German characters and will discard the remaining.
I am having some trouble converting a php pregmatch to java. I thought I had it all correct but it doesn't seem to be working. Here is the code:
Original PHP:
/* Pattern for 44 Character UUID */
$pattern = "([0-9A-F\-]{44})";
if (preg_match($pattern,$content)){
/*DO ACTION*/
}
My Java code:
final String pattern = "([0-9A-F\\-]{44})";
public static boolean pregMatch(String pattern, String content) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(content);
boolean b = m.matches();
return b;
}
if (pregMatch(pattern, line)) {
//DO ACTION
}
So my test input is:
DBA40365-7346-4DB4-A2CF-52ECA8C64091-0
Using a series of System.outs I get that b = false.
To implement a function as you did in your code:
final String pattern = "[0-9A-F\\-]{44}";
public static boolean pregMatch(String pattern, String content) {
return content.matches(pattern);
}
And then you can call it as:
if (pregMatch(pattern, line)) {
//DO ACTION
}
You don't need the parenthesis in your pattern because that just creates a match group, which you are not using. If you need access to back references, you would need the parenthesis an a more advanced regex code using Pattern and Matcher classes.
You could just use String.matches()
if (line.matches("[0-9A-F-]{44}")) {
// do action
}