Given the following string:
"foo bar-baz-zzz"
I want to split it at the characters " " and "-", preserving their value, but get all combinations of inputs.
i want to get a two-dimensional array containing
{{"foo", "bar", "baz", "zzz"}
,{"foo bar", "baz", "zzz"}
,{"foo", "bar-baz", "zzz"}
,{"foo bar-baz", "zzz"}
,{"foo", "bar", "baz-zzz"}
,{"foo bar", "baz-zzz"}
,{"foo", "bar-baz-zzz"}
,{"foo bar-baz-zzz"}}
Is there any built-in method in Java to split the string this way? Maybe in a library like Apache Commons? Or do I have to write a wall of for-loops?
Here is a recursive solution that works. I used a List<List<String>> rather than a 2-dimensional array to make things easier. The code is a bit ugly and could probably be tidied up a little.
Sample output:
$ java Main foo bar-baz-zzz
Processing: foo bar-baz-zzz
[foo, bar, baz, zzz]
[foo, bar, baz-zzz]
[foo, bar-baz, zzz]
[foo, bar-baz-zzz]
[foo bar, baz, zzz]
[foo bar, baz-zzz]
[foo bar-baz, zzz]
[foo bar-baz-zzz]
Code:
import java.util.*;
public class Main {
public static void main(String[] args) {
// First build a single string from the command line args.
StringBuilder sb = new StringBuilder();
Iterator<String> it = Arrays.asList(args).iterator();
while (it.hasNext()) {
sb.append(it.next());
if (it.hasNext()) {
sb.append(' ');
}
}
process(sb.toString());
}
protected static void process(String str) {
System.err.println("Processing: " + str);
List<List<String>> results = new LinkedList<List<String>>();
// Invoke the recursive method that does the magic.
process(str, 0, results, new LinkedList<String>(), new StringBuilder());
for (List<String> result : results) {
System.err.println(result);
}
}
protected static void process(String str, int pos, List<List<String>> resultsSoFar, List<String> currentResult, StringBuilder sb) {
if (pos == str.length()) {
// Base case: Reached end of string so add buffer contents to current result
// and add current result to resultsSoFar.
currentResult.add(sb.toString());
resultsSoFar.add(currentResult);
} else {
// Step case: Inspect character at pos and then make recursive call.
char c = str.charAt(pos);
if (c == ' ' || c == '-') {
// When we encounter a ' ' or '-' we recurse twice; once where we treat
// the character as a delimiter and once where we treat it as a 'normal'
// character.
List<String> copy = new LinkedList<String>(currentResult);
copy.add(sb.toString());
process(str, pos + 1, resultsSoFar, copy, new StringBuilder());
sb.append(c);
process(str, pos + 1, resultsSoFar, currentResult, sb);
} else {
sb.append(c);
process(str, pos + 1, resultsSoFar, currentResult, sb);
}
}
}
}
Here's a much shorter version, written in a recursive style. I apologize for only being able to write it in Python. I like how concise it is; surely someone here will be able to make a Java version.
def rec(h,t):
if len(t)<2: return [[h+t]]
if (t[0]!=' ' and t[0]!='-'): return rec(h+t[0], t[1:])
return rec(h+t[0], t[1:]) + [ [h]+x for x in rec('',t[1:])]
and the result:
>>> rec('',"foo bar-baz-zzz")
[['foo bar-baz-zzz'], ['foo bar-baz', 'zzz'], ['foo bar', 'baz-zzz'], ['foo bar'
, 'baz', 'zzz'], ['foo', 'bar-baz-zzz'], ['foo', 'bar-baz', 'zzz'], ['foo', 'bar
', 'baz-zzz'], ['foo', 'bar', 'baz', 'zzz']]
Here is a class that will lazily return lists of split values:
public class Split implements Iterator<List<String>> {
private Split kid; private final Pattern pattern;
private String subsequence; private final Matcher matcher;
private boolean done = false; private final String sequence;
public Split(Pattern pattern, String sequence) {
this.pattern = pattern; matcher = pattern.matcher(sequence);
this.sequence = sequence;
}
#Override public List<String> next() {
if (done) { throw new IllegalStateException(); }
while (true) {
if (kid == null) {
if (matcher.find()) {
subsequence = sequence.substring(matcher.end());
kid = new Split(pattern, sequence.substring(0, matcher.start()));
} else { break; }
} else {
if (kid.hasNext()) {
List<String> next = kid.next();
next.add(subsequence);
return next;
} else { kid = null; }
}
}
done = true;
List<String> list = new ArrayList<String>();
list.add(sequence);
return list;
}
#Override public boolean hasNext() { return !done; }
#Override public void remove() { throw new UnsupportedOperationException(); }
}
(Forgive the code formatting - it is to avoid nested scrollbars).
For the sample invocation:
Pattern pattern = Pattern.compile(" |-");
String str = "foo bar-baz-zzz";
Split split = new Split(pattern, str);
while (split.hasNext()) {
System.out.println(split.next());
}
...it will emit:
[foo, bar-baz-zzz]
[foo, bar, baz-zzz]
[foo bar, baz-zzz]
[foo, bar-baz, zzz]
[foo, bar, baz, zzz]
[foo bar, baz, zzz]
[foo bar-baz, zzz]
[foo bar-baz-zzz]
I imagine the implementation could be improved upon.
Why do you need that?
Notice that for a given string of N tokens you want to get an array of ca N*2^N strings. This (can) consume tons of memory if not done in a safe way...
I guess that probably you will need to iterate trough it all, right? If so than its better to create some class that will keep the original string and just give you different ways of splitting a row each time you ask it. This way you will save tons of memory and get better scalability.
There is no library method.
To accomplish that, you should tokenize the string (in your case using " -") by preserving the separators, and then you should think of separators as associated to binary flags and build all combination based on the value of the flags.
In your case, you have 3 separators: " ", "-" and "-", so you have 3 binary flags. You will end up with 2^3 = 8 values in the string.
Related
I am trying to find out if there is the same number of occurrences "dog" and "cat" are in the given String.
It should return true if they are equal, or false otherwise. How can I find out this without while, for etc. loops?
This is my current process
class Main {
public static boolean catsDogs(String s) {
String cat = "cat";
String dog = "dog";
if (s.contains(cat) && s.contains(dog)) {
return true;
}
return false;
}
public static void main(String[] args) {
boolean r = catsDogs("catdog");
System.out.println(r); // => true
System.out.println(catsDogs("catcat")); // => false
System.out.println(catsDogs("1cat1cadodog")); // => true
}
}
With java9+ the regex matcher has a count method:
public static boolean catsDogs(String s) {
Pattern pCat = Pattern.compile("cat");
Pattern pDog = Pattern.compile("dog");
Matcher mCat = pCat.matcher(s);
Matcher mDog = pDog.matcher(s);
return (mCat.results().count() == mDog.results().count());
}
You can use the following example by replacing the string (in case you don't want the split to be placed) :
public static boolean catsDogs(String s) {
return count(s,"cat") == count(s,"dog");
}
public static int count(String s, String catOrDog) {
return (s.length() - s.replace(catOrDog, "").length()) / catOrDog.length();
}
public static void main(String[] args) {
boolean r = catsDogs("catdog");
System.out.println(r); // => true
System.out.println(catsDogs("catcat")); // => false
System.out.println(catsDogs("1cat1cadodog")); // => true
}
Here's a couple of single-line solutions based on Java 9 Matcher.result() which produces a stream of MatchResult corresponding to each matching subsequence in the given string.
We can also make this method more versatile by providing a pair of regular expressions as arguments instead of hard-coding them.
teeing() + summingInt()
We can turn the stream of MatchResesult into a stream of strings by generating matching groups. And collect the data using collector teeing() expecting as its arguments two downstream collectors and a function producing the result based on the values returned by each collector.
public static boolean hasSameFrequency(String str,
String regex1,
String regex2) {
return Pattern.compile(regex1 + "|" + regex2).matcher(str).results()
.map(MatchResult::group)
.collect(Collectors.teeing(
Collectors.summingInt(group -> group.matches(regex1) ? 1 : 0),
Collectors.summingInt(group -> group.matches(regex2) ? 1 : 0),
Objects::equals
));
}
collectingAndThen() + partitioningBy()
Similarly, we can use a combination of collectors collectingAndThen() and partitioningBy().
The downside of this approach in comparison to the one introduced above is that partitioningBy() materializes stream elements as the values of the map (meanwhile we're interested only their quantity), but it performs fewer comparisons.
public static boolean hasSameFrequency(String str,
String regex1,
String regex2) {
return Pattern.compile(regex1 + "|" + regex2).matcher(str).results()
.map(MatchResult::group)
.collect(Collectors.collectingAndThen(
Collectors.partitioningBy(group -> group.matches(regex1)),
map -> map.get(true).size() == map.get(false).size()
));
}
Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!
If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789
Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.
Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.
I need to match if filenames have exactly 2 underscores and extension 'txt'.
For example:
asdf_assss_eee.txt -> true
asdf_assss_eee_txt -> false
asdf_assss_.txt -> false
private static final String FILENAME_PATTERN = "/^[A-Za-z0-9]+_[A-Za-z0-9]+_[A- Za-z0-9]\\.txt";
does not working.
You just need to add + after the third char class and you must remove the first forward slash.
private static final String FILENAME_PATTERN = "^[A-Za-z0-9]+_[A-Za-z0-9]+_[A-Za-z0-9]+\\.txt$";
You can use a regex like this with insensitive flag:
[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Or with inline insensitive flag
(?i)[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Working demo
In case you want to shorten it a little, you could do:
([a-z\d]+_){2}[a-z\d]+\.txt
Update
So lets assume you want to at least one or more characters after the second underscore, before the file extension.
Regex is still not "needed" for this. You could split the String by the underscore and you should have 3 elements from the split. If the 3rd element is just ".txt" then it's not valid.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
String[] pieces = str.split("_");
return pieces.length == 3 && !pieces[2].equalsIgnoreCase(".txt");
}
Results:
true
false
false
Old Answer
Not sure I understand why your third example is false, but this is something that can easily be done without regex.
Start with checking to see if the String ends with ".txt", then check if it contains only two underscores.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
return str.chars().filter(c -> c == '_').count() == 2;
}
Results:
true
false
true
Use this Pattern:
Pattern p = Pattern.compile("_[^_]+_[^_]+\\.txt")
and use .find() instead of .match() in the Matcher:
Matcher m = p.matcher(filename);
if (m.find()) {
// found
}
Say I had the string "foo1bar2" and I wanted to replace to perform the following replacements in parallel with an expected output of "bar1foo2".
foo => bar
bar => foo
The string cannot be tokenized as the substrings might occur anywhere, any number of times.
A naive approach would to be to replace like this, however it would fail as the 2nd replacement would undo the first.
String output = input.replace("foo", "bar").replace("bar", "foo");
=> foo1foo2
or
String output = input.replace("bar", "foo").replace("foo", "bar");
=> bar1bar2
I'm not sure regex can help me here either? This isn't homework by the way, just geeky interest. I've tried googling this but unsure how to describe the problem.
Try first replacing "foo" with something else that won't occur anywhere else in the String. Then replace "bar" with "foo" then replace the temporary replacement from step 1 with "bar".
I actually like Code-Guru's answer better, but since you said it's just a curiosity, here's a recursive solution. The idea is to isolate just the piece of the string that you are replacing and recurse on the rest so we don't accidentally replace something that we already did. Now if two of your rules have a common prefix, you may have to do some ordering of your rules to get the desired results, but here goes:
public class ParallelReplace
{
public String replace(String s, Rule... rules)
{
return runRule(s, 0, rules);
}
private String runRule(String s, int curRule, Rule... rules)
{
if (curRule == rules.length)
{
return s;
}
else
{
Rule r = rules[curRule];
int index = s.indexOf(r.lhs);
if (index != -1)
{
return runRule(s.substring(0, index), curRule + 1, rules) + r.rhs
+ runRule(s.substring(index + r.rhs.length()), curRule + 1, rules);
}
else
{
return runRule(s, curRule + 1, rules);
}
}
}
public static class Rule
{
public String lhs;
public String rhs;
public Rule(String lhs, String rhs)
{
this.lhs = lhs;
this.rhs = rhs;
}
}
public static void main(String[] args)
{
String s = "foo1bar2";
ParallelReplace pr = new ParallelReplace();
System.out.println(pr.replace(s, new Rule("foo", "bar"), new Rule("bar", "foo")));
}
}
What is the most elegant way to convert a hyphen separated word (e.g. "do-some-stuff") to the lower camel-case variation (e.g. "doSomeStuff") in Java?
Use CaseFormat from Guava:
import static com.google.common.base.CaseFormat.*;
String result = LOWER_HYPHEN.to(LOWER_CAMEL, "do-some-stuff");
With Java 8 there is finally a one-liner:
Arrays.stream(name.split("\\-"))
.map(s -> Character.toUpperCase(s.charAt(0)) + s.substring(1).toLowerCase())
.collect(Collectors.joining());
Though it takes splitting over 3 actual lines to be legible ツ
(Note: "\\-" is for kebab-case as per question, for snake_case simply change to "_")
The following method should handle the task quite efficient in O(n). We just iterate over the characters of the xml method name, skip any '-' and capitalize chars if needed.
public static String toJavaMethodName(String xmlmethodName) {
StringBuilder nameBuilder = new StringBuilder(xmlmethodName.length());
boolean capitalizeNextChar = false;
for (char c:xmlMethodName.toCharArray()) {
if (c == '-') {
capitalizeNextChar = true;
continue;
}
if (capitalizeNextChar) {
nameBuilder.append(Character.toUpperCase(c));
} else {
nameBuilder.append(c);
}
capitalizeNextChar = false;
}
return nameBuilder.toString();
}
Why not try this:
split on "-"
uppercase each word, skipping the first
join
EDIT: On second thoughts... While trying to implement this, I found out there is no simple way to join a list of strings in Java. Unless you use StringUtil from apache. So you will need to create a StringBuilder anyway and thus the algorithm is going to get a little ugly :(
CODE: Here is a sample of the above mentioned aproach. Could someone with a Java compiler (sorry, don't have one handy) test this? And benchmark it with other versions found here?
public static String toJavaMethodNameWithSplits(String xmlMethodName)
{
String[] words = xmlMethodName.split("-"); // split on "-"
StringBuilder nameBuilder = new StringBuilder(xmlMethodName.length());
nameBuilder.append(words[0]);
for (int i = 1; i < words.length; i++) // skip first
{
nameBuilder.append(words[i].substring(0, 1).toUpperCase());
nameBuilder.append(words[i].substring(1));
}
return nameBuilder.toString(); // join
}
If you don't like to depend on a library you can use a combination of a regex and String.format. Use a regex to extract the starting characters after the -. Use these as input for String.format. A bit tricky, but works without a (explizit) loop ;).
public class Test {
public static void main(String[] args) {
System.out.println(convert("do-some-stuff"));
}
private static String convert(String input) {
return String.format(input.replaceAll("\\-(.)", "%S"), input.replaceAll("[^-]*-(.)[^-]*", "$1-").split("-"));
}
}
Here is a slight variation of Andreas' answer that does more than the OP asked for:
public static String toJavaMethodName(final String nonJavaMethodName){
final StringBuilder nameBuilder = new StringBuilder();
boolean capitalizeNextChar = false;
boolean first = true;
for(int i = 0; i < nonJavaMethodName.length(); i++){
final char c = nonJavaMethodName.charAt(i);
if(!Character.isLetterOrDigit(c)){
if(!first){
capitalizeNextChar = true;
}
} else{
nameBuilder.append(capitalizeNextChar
? Character.toUpperCase(c)
: Character.toLowerCase(c));
capitalizeNextChar = false;
first = false;
}
}
return nameBuilder.toString();
}
It handles a few special cases:
fUnnY-cASe is converted to funnyCase
--dash-before-and--after- is converted to dashBeforeAndAfter
some.other$funky:chars? is converted to someOtherFunkyChars
For those who has com.fasterxml.jackson library in the project and don't want to add guava you can use the jaskson namingStrategy method:
new PropertyNamingStrategy.SnakeCaseStrategy.translate(String);
get The Apache commons jar for StringUtils. Then you can use the capitalize method
import org.apache.commons.lang.StringUtils;
public class MyClass{
public String myMethod(String str) {
StringBuffer buff = new StringBuffer();
String[] tokens = str.split("-");
for (String i : tokens) {
buff.append(StringUtils.capitalize(i));
}
return buff.toString();
}
}
As I'm not a big fan of adding a library just for one method, I implemented my own solution (from camel case to snake case):
public String toSnakeCase(String name) {
StringBuilder buffer = new StringBuilder();
for(int i = 0; i < name.length(); i++) {
if(Character.isUpperCase(name.charAt(i))) {
if(i > 0) {
buffer.append('_');
}
buffer.append(Character.toLowerCase(name.charAt(i)));
} else {
buffer.append(name.charAt(i));
}
}
return buffer.toString();
}
Needs to be adapted depending of the in / out cases.
In case you use Spring Framework, you can use provided StringUtils.
import org.springframework.util.StringUtils;
import java.util.Arrays;
import java.util.stream.Collectors;
public class NormalizeUtils {
private static final String DELIMITER = "_";
private NormalizeUtils() {
throw new IllegalStateException("Do not init.");
}
/**
* Take name like SOME_SNAKE_ALL and convert it to someSnakeAll
*/
public static String fromSnakeToCamel(final String name) {
if (StringUtils.isEmpty(name)) {
return "";
}
final String allCapitalized = Arrays.stream(name.split(DELIMITER))
.filter(c -> !StringUtils.isEmpty(c))
.map(StringUtils::capitalize)
.collect(Collectors.joining());
return StringUtils.uncapitalize(allCapitalized);
}
}
Iterate through the string. When you find a hypen, remove it, and capitalise the next letter.