Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!
If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789
Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.
Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.
Related
I am looking for a way to incrementally apply a regular expression pattern, i.e. I am looking for a matcher which I can update with characters as they come in and which tells me on each character whether it is still matching or not.
Here is an illustration in code (MagicMatcherIAmLookingFor is the thing I am looking for, characterSource is something which I can query for new character, say an InputStreamReader for that matter):
final Pattern pattern = Pattern.compile("[0-9]+");
final MagicMatcherIAmLookingFor incrementalMatcher = pattern.magic();
final StringBuilder stringBuilder = new StringBuilder();
char character;
while (characterSource.isNotEOF()) {
character = characterSource.getNextCharacter();
incrementalMatcher.add(character);
if (incrementalMatcher.matches()) {
stringBuilder.append(character);
} else {
return result(
stringBuilder.toString(),
remaining(character, characterSource)
);
}
}
I did not find a way to utilize the existing java.util.regex.Pattern like that, but maybe I just did not find it. Or is there an alternative library to the built in regular expressions which provides such a feature?
I did not have any luck searching the web for it - all the results are completely swamped with how to use java regular expressions in the first place.
I am targeting Java 8+
Is this the kind of object you are looking for ?
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MagicMatcher {
private Pattern pattern;
private Matcher matcher;
private String stringToCheck;
public MagicMatcher(Pattern p , String s) {
pattern = p;
stringToCheck = s;
updateMatcher();
}
public boolean matches() {
return matcher.matches();
}
private void updateMatcher() {
matcher = pattern.matcher(stringToCheck);
}
public void setStringToCheck(String s) {
stringToCheck = s;
updateMatcher();
}
public String getStringToCheck() {
return stringToCheck;
}
public void addCharacterToCheck(char c) {
stringToCheck += c;
updateMatcher();
}
public void addStringToCheck(String s) {
stringToCheck += s;
updateMatcher();
}
}
I have a utility class to resolve a string input with certain patterns as shown in the example below. All variables are surrounded by { and }. If my string is something like Language is {lang} and version 2 is {version}. Home located at {java.home} the output is Language is java and version 2 is 1.8. Home located at C:/java and if my string is like Language is {lang} and version 2 is {version}. Home located at {{lang}.home} the output is Language is java and version 2 is 1.8. Home located at {java.home}. All I am trying to find is a way to resolve nested properties recursively but ran into several issues. Can any logic be inserted into the code so that resolving of inner properties happen dynamically?
import java.util.*;
import java.util.regex.*;
public class MyClass {
public static void main(String args[]) {
System.setProperty("lang" , "java");
System.setProperty("version" , "1.8");
System.setProperty("java.home" , "C:/java");
System.out.println(resolve("Language is {lang} and version 2 is {version}. Home located at {java.home}"));
System.out.println(resolve("Language is {lang} and version 2 is {version}. Home located at {{lang}.home}"));
}
public static String resolve(String input) {
List<String> tokens = matchers("[{]\\S+[}]", input);
String value;
for(String token : tokens) {
value = getProperty(token);
if (null != value) {
input = input.replace(token, value);
}
value = "";
}
return input;
}
private static String getProperty(String key) {
key = key.substring(1, key.length()-1);
return System.getProperty(key);
}
public static List<String> matchers(String regex, String text) {
List<String> matches = new ArrayList<String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
matches.add(matcher.group());
}
return matches;
}
public static boolean contains(String regex, String text) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
return matcher.find();
}
}
You just have to ask for the pattern to get only the value without an inner { or } with [^{}]. No "curly bracket" means no inner values. So you can safely do the replace.
First, we create a Pattern, we need to escape those {}... and we add a capture group for later.
Pattern p = Pattern.compile("\\{([^{}]+)\\}");
Then we check with the current value:
Matcher m = p.matcher(s);
Now, we just have to check if there is a match and loop on it.
while( m.find() ){
...
}
In there, we will need the value captured, so we get the first group and get its value (let assume it will always be present) :
String key = m.group(1);
String value = properties.get(key); //add some fail safe.
Using the Matcher.replaceFirst, we will safely replace only the current match (the one we get the value from). If you use replaceAll, it will replace every pattern with the same value.
s = m.replaceFirst(properties.get(key));
Now, since we have updated the String, we need to call check the regex again :
m = p.matcher(s);
Here is a full example:
Map<String, String> properties = new HashMap<>();
properties.put("lang", "java");
properties.put("java.version", "1.8");
String s = "This is {{lang}.version}.";
Pattern p = Pattern.compile("\\{([^{}]+)\\}");
Matcher m = p.matcher(s);
while(m.find()){
String key = m.group(1);
s = m.replaceFirst(properties.get(key));
System.out.println(s);
m = p.matcher(s); //Reset the matcher
}
This is {java.version}.
This is 1.8.
This has one problem, it will required to a lot of Matcher initialisation, so it might not be optimal. Of course, it is most likely not optimized (not the point here)
FYI : Using the Matcher.replaceFirst instead of the String.replaceFirst prevent a new Pattern compilation to be done. Here is the String.replaceFirst code :
public String replaceFirst(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceFirst(replacement);
}
We already have a Matcher to do that, so use it.
There are lots of ways you could achieve this.
You need some way to communicate to the caller either whether a replacement is necessary, or whether one was made.
A simple option:
public boolean hasPlaceholder(String s) {
// return true if s contains a {} placeholder, else false
}
Using this you can repeatedly replace until done:
while(hasPlaceholder(s)) {
s = replacePlaceholders(s);
}
This does scan through the string more times than is strictly necessary, but you shouldn't optimise prematurely.
A more sophisticated option is for the replacePlaceholders() method to report back whether it succeeded. For that you'll need a response class that wraps the result String and the wasReplaced() boolean:
ReplacementResult replacePlaceholders(String s) {
// process string into newString, counting placeholders replaced
return new ReplacementResult(count > 0, newString);
}
(Implementation of ReplacementResult left as an exercise)
Using this you can do:
ReplacementResult result = replacePlaceholders(s);
while(result.wasReplaced()) {
result = replacePlaceholders(result.string());
}
So, each time you call replacePlaceholders() it will either make at least one replacement, or it will report false having verified that there are no more replacements to make.
You mention recursion in the question. This can of course be done, and it would mean avoiding scanning through the whole string each time -- as you can look at just the replacement fragment. This is untested Java-like pseudocode:
String replaceRecursively(String s) {
StringBuilder result = new StringBuilder();
while(Token token = takeTokenFrom(s)) {
if(token.isPlaceholder()) {
String rawReplacement = lookupReplacement(token);
String processedReplacement = replaceRecursively(rawReplacement);
result.append(processedReplacement);
} else {
result.append(token.text());
}
}
return result.toString();
}
For all of these solutions, you should beware of infinite loops or stack-blowing recursion. What if you replace "{foo}" with "{foo}"? (or worse, what if you replace "{foo}" with "{foo}{foo}"!?).
Of course the simplest way is to be in control of the configuration, and simply not trigger that problem. Detecting the problem programatically is entirely possible, but complex enough that it would warrant another SO question if you want it.
I need to match if filenames have exactly 2 underscores and extension 'txt'.
For example:
asdf_assss_eee.txt -> true
asdf_assss_eee_txt -> false
asdf_assss_.txt -> false
private static final String FILENAME_PATTERN = "/^[A-Za-z0-9]+_[A-Za-z0-9]+_[A- Za-z0-9]\\.txt";
does not working.
You just need to add + after the third char class and you must remove the first forward slash.
private static final String FILENAME_PATTERN = "^[A-Za-z0-9]+_[A-Za-z0-9]+_[A-Za-z0-9]+\\.txt$";
You can use a regex like this with insensitive flag:
[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Or with inline insensitive flag
(?i)[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Working demo
In case you want to shorten it a little, you could do:
([a-z\d]+_){2}[a-z\d]+\.txt
Update
So lets assume you want to at least one or more characters after the second underscore, before the file extension.
Regex is still not "needed" for this. You could split the String by the underscore and you should have 3 elements from the split. If the 3rd element is just ".txt" then it's not valid.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
String[] pieces = str.split("_");
return pieces.length == 3 && !pieces[2].equalsIgnoreCase(".txt");
}
Results:
true
false
false
Old Answer
Not sure I understand why your third example is false, but this is something that can easily be done without regex.
Start with checking to see if the String ends with ".txt", then check if it contains only two underscores.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
return str.chars().filter(c -> c == '_').count() == 2;
}
Results:
true
false
true
Use this Pattern:
Pattern p = Pattern.compile("_[^_]+_[^_]+\\.txt")
and use .find() instead of .match() in the Matcher:
Matcher m = p.matcher(filename);
if (m.find()) {
// found
}
Recentrly I found very helpful method in StringUtils library which is
StringUtils.stripAccents(String s)
I found it really helpful with removing any special characters and converting it to some ASCII "equivalent", for instace ç=c etc.
Now I am working for a German customer who really needs to do such a thing but only for non-German characters. Any umlauts should stay untouched. I realised that strinAccents won't be useful in that case.
Does anyone has some experience around that stuff?
Are there any useful tools/libraries/classes or maybe regular expressions?
I tried to write some class which is parsing and replacing such characters but it can be very difficult to build such map for all languages...
Any suggestions appriciated...
Best built a custom function. It can be like the following. If you want to avoid the conversion of a character, you can remove the relationship between the two strings (the constants).
private static final String UNICODE =
"ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII =
"AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";
public static String toAsciiString(String str) {
if (str == null) {
return null;
}
StringBuilder sb = new StringBuilder();
for (int index = 0; index < str.length(); index++) {
char c = str.charAt(index);
int pos = UNICODE.indexOf(c);
if (pos > -1)
sb.append(PLAIN_ASCII.charAt(pos));
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args) {
System.out.println(toAsciiString("Höchstalemannisch"));
}
My gut feeling tells me the easiest way to do this would be to just list allowed characters and strip accents from everything else. This would be something like
import java.util.regex.*;
import java.text.*;
public class Replacement {
public static void main(String args[]) {
String from = "aoeåöäìé";
String result = stripAccentsFromNonGermanCharacters(from);
System.out.println("Result: " + result);
}
private static String patternContainingAllValidGermanCharacters =
"a-zA-Z0-9äÄöÖéÉüÜß";
private static Pattern nonGermanCharactersPattern =
Pattern.compile("([^" + patternContainingAllValidGermanCharacters + "])");
public static String stripAccentsFromNonGermanCharacters(
String from) {
return stripAccentsFromCharactersMatching(
from, nonGermanCharactersPattern);
}
public static String stripAccentsFromCharactersMatching(
String target, Pattern myPattern) {
StringBuffer myStringBuffer = new StringBuffer();
Matcher myMatcher = myPattern.matcher(target);
while (myMatcher.find()) {
myMatcher.appendReplacement(myStringBuffer,
stripAccents(myMatcher.group(1)));
}
myMatcher.appendTail(myStringBuffer);
return myStringBuffer.toString();
}
// pretty much the same thing as StringUtils.stripAccents(String s)
// used here so I can demonstrate the code without StringUtils dependency
public static String stripAccents(String text) {
return Normalizer.normalize(text,
Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
}
(I realize the pattern doesn't probably contain all the characters needed, but add whatever is missing)
This might give you a work around. here you can detect the language and get the specific text only.
EDIT:
You can have the raw string as an input, put the language detection to German and then it will detect the German characters and will discard the remaining.
What is the most elegant way to convert a hyphen separated word (e.g. "do-some-stuff") to the lower camel-case variation (e.g. "doSomeStuff") in Java?
Use CaseFormat from Guava:
import static com.google.common.base.CaseFormat.*;
String result = LOWER_HYPHEN.to(LOWER_CAMEL, "do-some-stuff");
With Java 8 there is finally a one-liner:
Arrays.stream(name.split("\\-"))
.map(s -> Character.toUpperCase(s.charAt(0)) + s.substring(1).toLowerCase())
.collect(Collectors.joining());
Though it takes splitting over 3 actual lines to be legible ツ
(Note: "\\-" is for kebab-case as per question, for snake_case simply change to "_")
The following method should handle the task quite efficient in O(n). We just iterate over the characters of the xml method name, skip any '-' and capitalize chars if needed.
public static String toJavaMethodName(String xmlmethodName) {
StringBuilder nameBuilder = new StringBuilder(xmlmethodName.length());
boolean capitalizeNextChar = false;
for (char c:xmlMethodName.toCharArray()) {
if (c == '-') {
capitalizeNextChar = true;
continue;
}
if (capitalizeNextChar) {
nameBuilder.append(Character.toUpperCase(c));
} else {
nameBuilder.append(c);
}
capitalizeNextChar = false;
}
return nameBuilder.toString();
}
Why not try this:
split on "-"
uppercase each word, skipping the first
join
EDIT: On second thoughts... While trying to implement this, I found out there is no simple way to join a list of strings in Java. Unless you use StringUtil from apache. So you will need to create a StringBuilder anyway and thus the algorithm is going to get a little ugly :(
CODE: Here is a sample of the above mentioned aproach. Could someone with a Java compiler (sorry, don't have one handy) test this? And benchmark it with other versions found here?
public static String toJavaMethodNameWithSplits(String xmlMethodName)
{
String[] words = xmlMethodName.split("-"); // split on "-"
StringBuilder nameBuilder = new StringBuilder(xmlMethodName.length());
nameBuilder.append(words[0]);
for (int i = 1; i < words.length; i++) // skip first
{
nameBuilder.append(words[i].substring(0, 1).toUpperCase());
nameBuilder.append(words[i].substring(1));
}
return nameBuilder.toString(); // join
}
If you don't like to depend on a library you can use a combination of a regex and String.format. Use a regex to extract the starting characters after the -. Use these as input for String.format. A bit tricky, but works without a (explizit) loop ;).
public class Test {
public static void main(String[] args) {
System.out.println(convert("do-some-stuff"));
}
private static String convert(String input) {
return String.format(input.replaceAll("\\-(.)", "%S"), input.replaceAll("[^-]*-(.)[^-]*", "$1-").split("-"));
}
}
Here is a slight variation of Andreas' answer that does more than the OP asked for:
public static String toJavaMethodName(final String nonJavaMethodName){
final StringBuilder nameBuilder = new StringBuilder();
boolean capitalizeNextChar = false;
boolean first = true;
for(int i = 0; i < nonJavaMethodName.length(); i++){
final char c = nonJavaMethodName.charAt(i);
if(!Character.isLetterOrDigit(c)){
if(!first){
capitalizeNextChar = true;
}
} else{
nameBuilder.append(capitalizeNextChar
? Character.toUpperCase(c)
: Character.toLowerCase(c));
capitalizeNextChar = false;
first = false;
}
}
return nameBuilder.toString();
}
It handles a few special cases:
fUnnY-cASe is converted to funnyCase
--dash-before-and--after- is converted to dashBeforeAndAfter
some.other$funky:chars? is converted to someOtherFunkyChars
For those who has com.fasterxml.jackson library in the project and don't want to add guava you can use the jaskson namingStrategy method:
new PropertyNamingStrategy.SnakeCaseStrategy.translate(String);
get The Apache commons jar for StringUtils. Then you can use the capitalize method
import org.apache.commons.lang.StringUtils;
public class MyClass{
public String myMethod(String str) {
StringBuffer buff = new StringBuffer();
String[] tokens = str.split("-");
for (String i : tokens) {
buff.append(StringUtils.capitalize(i));
}
return buff.toString();
}
}
As I'm not a big fan of adding a library just for one method, I implemented my own solution (from camel case to snake case):
public String toSnakeCase(String name) {
StringBuilder buffer = new StringBuilder();
for(int i = 0; i < name.length(); i++) {
if(Character.isUpperCase(name.charAt(i))) {
if(i > 0) {
buffer.append('_');
}
buffer.append(Character.toLowerCase(name.charAt(i)));
} else {
buffer.append(name.charAt(i));
}
}
return buffer.toString();
}
Needs to be adapted depending of the in / out cases.
In case you use Spring Framework, you can use provided StringUtils.
import org.springframework.util.StringUtils;
import java.util.Arrays;
import java.util.stream.Collectors;
public class NormalizeUtils {
private static final String DELIMITER = "_";
private NormalizeUtils() {
throw new IllegalStateException("Do not init.");
}
/**
* Take name like SOME_SNAKE_ALL and convert it to someSnakeAll
*/
public static String fromSnakeToCamel(final String name) {
if (StringUtils.isEmpty(name)) {
return "";
}
final String allCapitalized = Arrays.stream(name.split(DELIMITER))
.filter(c -> !StringUtils.isEmpty(c))
.map(StringUtils::capitalize)
.collect(Collectors.joining());
return StringUtils.uncapitalize(allCapitalized);
}
}
Iterate through the string. When you find a hypen, remove it, and capitalise the next letter.