Collect HashSet / Java 8 / Regex Pattern / Stream API

Collect HashSet / Java 8 / Regex Pattern / Stream API - java

Recently I change version of the JDK 8 instead 7 of my project and now I overwrite some code snippets using new features that came with Java 8.
final Matcher mtr = Pattern.compile(regex).matcher(input);
HashSet<String> set = new HashSet<String>() {{
while (mtr.find()) add(mtr.group().toLowerCase());
}};
How I can write this code using Stream API ?

A Matcher-based spliterator implementation can be quite simple if you reuse the JDK-provided Spliterators.AbstractSpliterator:
public class MatcherSpliterator extends AbstractSpliterator<String[]>
{
private final Matcher m;
public MatcherSpliterator(Matcher m) {
super(Long.MAX_VALUE, ORDERED | NONNULL | IMMUTABLE);
this.m = m;
}
#Override public boolean tryAdvance(Consumer<? super String[]> action) {
if (!m.find()) return false;
final String[] groups = new String[m.groupCount()+1];
for (int i = 0; i <= m.groupCount(); i++) groups[i] = m.group(i);
action.accept(groups);
return true;
}
}
Note that the spliterator provides all matcher groups, not just the full match. Also note that this spliterator supports parallelism because AbstractSpliterator implements a splitting policy.
Typically you will use a convenience stream factory:
public static Stream<String[]> matcherStream(Matcher m) {
return StreamSupport.stream(new MatcherSpliterator(m), false);
}
This gives you a powerful basis to concisely write all kinds of complex regex-oriented logic, for example:
private static final Pattern emailRegex = Pattern.compile("([^,]+?)#([^,]+)");
public static void main(String[] args) {
final String emails = "kid#gmail.com, stray#yahoo.com, miks#tijuana.com";
System.out.println("User has e-mail accounts on these domains: " +
matcherStream(emailRegex.matcher(emails))
.map(gs->gs[2])
.collect(joining(", ")));
}
Which prints
User has e-mail accounts on these domains: gmail.com, yahoo.com, tijuana.com
For completeness, your code will be rewritten as
Set<String> set = matcherStream(mtr).map(gs->gs[0].toLowerCase()).collect(toSet());

Marko's answer demonstrates how to get matches into a stream using a Spliterator. Well done, give that man a big +1! Seriously, make sure you upvote his answer before you even consider upvoting this one, since this one is entirely derivative of his.
I have only a small bit to add to Marko's answer, which is that instead of representing the matches as an array of strings (with each array element representing a match group), the matches are better represented as a MatchResult which is a type invented for this purpose. Thus the result would be a Stream<MatchResult> instead of Stream<String[]>. The code gets a little simpler, too. The tryAdvance code would be
if (m.find()) {
action.accept(m.toMatchResult());
return true;
} else {
return false;
}
The map call in his email-matching example would change to
.map(mr -> mr.group(2))
and the OP's example would be rewritten as
Set<String> set = matcherStream(mtr)
.map(mr -> mr.group(0).toLowerCase())
.collect(toSet());
Using MatchResult gives a bit more flexibility in that it also provides offsets of match groups within the string, which could be useful for certain applications.

I don't think you can turn this into a Stream without writing your own Spliterator, but, I don't know why you would want to.
Matcher.find() is a state changing operation on the Matcher object so running each find() in a parallel stream would produce inconsistent results. Running the stream in serial wouldn't have better performance that the Java 7 equivalent and would be harder to understand.

What about Pattern.splitAsStream ?
Stream<String> stream = Pattern.compile(regex).splitAsStream(input);
and then a collector to get a set.
Set<String> set = stream.map(String::toLowerCase).collect(Collectors.toSet());

What about
public class MakeItSimple {
public static void main(String[] args) throws FileNotFoundException {
Scanner s = new Scanner(new File("C:\\Users\\Admin\\Desktop\\TextFiles\\Emails.txt"));
HashSet<String> set = new HashSet<>();
while ( s.hasNext()) {
String r = s.next();
if (r.matches("([^,]+?)#([^,]+)")) {
set.add(r);
}
}
set.stream().map( x -> x.toUpperCase()).forEach(x -> print(x));
s.close();
}
}

Here is the implementation using Spliterator interface.
// To get the required set
Set<String> result = (StreamSupport.stream(new MatcherGroupIterator(pattern,input ),false))
.map( s -> s.toLowerCase() )
.collect(Collectors.toSet());
...
private static class MatcherGroupIterator implements Spliterator<String> {
private final Matcher matcher;
public MatcherGroupIterator(Pattern p, String s) {
matcher = p.matcher(s);
}
#Override
public boolean tryAdvance(Consumer<? super String> action) {
if (!matcher.find()){
return false;
}
action.accept(matcher.group());
return true;
}
#Override
public Spliterator<String> trySplit() {
return null;
}
#Override
public long estimateSize() {
return Long.MAX_VALUE;
}
#Override
public int characteristics() {
return Spliterator.NONNULL;
}
}

Related

Java - Abstract Syntax Tree with grammar

i am building a simple grammar parser, with regex. It works but now i want to add Abstract Syntax Tree. But i still dont understand how to set it up. i included the parser.
The parser gets a string and tokeniaze it with the lexer.
The tokens include the value and a type.
Any idea how to setup nodes to build a AST?
public class Parser {
lexer lex;
Hashtable<String, Integer> data = new Hashtable<String, Integer>();
public Parser( String str){
ArrayList<Token> token = new ArrayList<Token>();
String[] strpatt = { "[0-9]*\\.[0-9]+", //0
"[a-zA-Z_][a-zA-Z0-9_]*",//1
"[0-9]+",//2
"\\+",//3
"\\-",//4
"\\*",//5
"\\/",//6
"\\=",// 7
"\\)",// 8
"\\("//9
};
lex = new lexer(strpatt, "[\\ \t\r\n]+");
lex.set_data(str);
}
public int peek() {
//System.out.println(lex.peek().type);
return lex.peek().type;
}
public boolean peek( String[] regex) {
return lex.peek(regex);
}
public void set_data( String s) {
lex.set_data(s);
}
public Token scan() {
return lex.scan();
}
public int goal() {
int ret = 0;
while(peek() != -1) {
ret = expr();
}
return ret;
}
}

Currently, you are simply evaluating as you parse:
ret = ret * term()
The easiest way to think of an AST is that it is just a different kind of evaluation. Instead of producing a numeric result from numeric sub-computations, as above, you produce a description of the computation from descriptions of the sub-computations. The description is represented as small structure which contains the essential information:
ret = BuildProductNode(ret, term());
Or, perhaps
ret = BuildBinaryNode(Product, ret, term());
It's a tree because the Node objects which are being passed around refer to other Node objects without there ever being a cycle or a node with two different parents.
Clearly there are a lot of details missing from the above, particularly the precise nature of the Node object. But it's a rough outline.

Refactoring a nested foreach

private List getEnumFromType(List vars, List enums) {
List enumList = new ArrayList<>();
for (Bean.Var var : vars) {
String typeWithoutTypeIdentifierPrefix = var.getType().substring(1,var.getType().length());
for (Enum enumVal : enums) {
if (typeWithoutTypeIdentifierPrefix.equals(enumVal.getName())) {
if (!enumList.contains(enumVal)) {
enumList.add(enumVal);
}
}
}
}
return enumList;
}

You have chained two terminal stream operators.
.forEach() returns void, hence the second .forEach() complains that it can't find a stream to work with.
You may want to read some of the Java 8 Stream documentation before continuing.

Don't do this.
Don't get the idea that the Java 8 Stream API should be used every time you are looping through a collection. It's not a wildcard that you can use to replace all enhanced for loops, especially nested ones.
Your error occurs because you are trying to call forEach on the return value of forEach. Since your for loops are nested, the calls to forEach should also be nested in the stream version. The second for loop should be put in a place like this:
.forEach(countries -> countries.getFromZone().getCountries().stream().filter(country ->country.getCode().equals(selectedCountry).forEach(...))
But seriously, Don't do this.
Your code is very messy in the stream version. It is far less readable than the for loops, mainly because you have a nested for loop. Instead of trying to rewrite your code using streams, you should try to abstract out the logic of your current code. Extract some methods for example:
for (Rate rate : product.getrates()) {
if (rateMatches(value)) { // I tried to guess what you are trying to do here. If you have better names please use yours
for (Countrys countrys : rate.getFromCountry().getCountries()) {
if (countrys.getCode().equals(selectedCountry)) {
updateDisplay(value);
break;
}
}
}
}
This way it's much more clearer.

Don't complicate too much, think of it on simple terms. Keep in mind streams are also about making easier to follow code:
find all Rate/Countrys pairs that match your criteria
For each of them, update value accordingly.
Java streams approach (there are more alternatives):
public void yourMethod() {
X product = ...;
Y value = ...;
Z selectedCountry = ...;
if (product.getRates() == null || product.getRates().isEmpty()) {
return;
}
product.getRates().stream()
.filter(r -> matchesValueRate(r, value))
.flatMap(this::rateCountrysPairStream)
.filter(p -> matchesSelectedCountry(p, selectedCountry))
.forEach(p -> updateValue(p, v));
}
public boolean matchesValueRate(Rate candidate, Y value) {
return value.getAtrribute().getRateType().getCode().equalsIgnoreCase(candidate.getRateType().getCode()) && ...; // add your tzone filter also
}
public Stream<Pair<Rate, Countrys>> rateCountrysPairStream(Rate rate) {
return rate.getFromCountry().getCountries().stream().map(c -> Pair.of(rate, c));
}
public boolean matchesSelectedCountry(Pair<Rate, Countrys> candidate, Z selectedCountry) {
return selectedCountry.equals(candidate.second().getCode());
}
public void updateValue(Pair<Rate, Countrys> rateCountry, Y value) {
Rate rate = rateCountry.first();
Countrys country = rateCountry.second();
// do your display stuff here
}
public static class Pair<K, V> {
private final K first;
private final V second;
private Pair(K first, V second) {
this.first = first;
this.second = second;
}
public static <K, V> Pair<K, V> of(K first, V second) {
return new Pair<>(first, second);
}
public K first() {
return first;
}
public V second() {
return second;
}
}

using java streams in parallel with collect(supplier, accumulator, combiner) not giving expected results

I'm trying to find number of words in given string. Below is sequential algorithm for it which works fine.
public int getWordcount() {
boolean lastSpace = true;
int result = 0;
for(char c : str.toCharArray()){
if(Character.isWhitespace(c)){
lastSpace = true;
}else{
if(lastSpace){
lastSpace = false;
++result;
}
}
}
return result;
}
But, when i tried to 'parallelize' this with Stream.collect(supplier, accumulator, combiner) method, i am getting wordCount = 0. I am using an immutable class (WordCountState) just to maintain the state of word count.
Code :
public class WordCounter {
private final String str = "Java8 parallelism helps if you know how to use it properly.";
public int getWordCountInParallel() {
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.collect(WordCountState::new,
WordCountState::accumulate,
WordCountState::combine);
return finalState.getCounter();
}
}
public class WordCountState {
private final boolean lastSpace;
private final int counter;
private static int numberOfInstances = 0;
public WordCountState(){
this.lastSpace = true;
this.counter = 0;
//numberOfInstances++;
}
public WordCountState(boolean lastSpace, int counter){
this.lastSpace = lastSpace;
this.counter = counter;
//numberOfInstances++;
}
//accumulator
public WordCountState accumulate(Character c) {
if(Character.isWhitespace(c)){
return lastSpace ? this : new WordCountState(true, counter);
}else{
return lastSpace ? new WordCountState(false, counter + 1) : this;
}
}
//combiner
public WordCountState combine(WordCountState wordCountState) {
//System.out.println("Returning new obj with count : " + (counter + wordCountState.getCounter()));
return new WordCountState(this.isLastSpace(),
(counter + wordCountState.getCounter()));
}
I've observed two issues with above code :
1. Number of objects (WordCountState) created are greater than number of characters in the string.
2. Result is always 0.
3. As per accumulator/consumer documentation, shouldn't the accumulator return void? Even though my accumulator method is returning an object, compiler doesn't complain.
Any clue where i might have gone off track?
UPDATE :
Used solution as below -
public int getWordCountInParallel() {
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.reduce(new WordCountState(),
WordCountState::accumulate,
WordCountState::combine);
return finalState.getCounter();
}

You can always invoke a method and ignore its return value, so it’s logical to allow the same when using method references. Therefore, it’s no problem creating a method reference to a non-void method when a consumer is required, as long as the parameters match.
What you have created with your immutable WordCountState class, is a reduction operation, i.e. it would support a use case like
Stream<Character> charStream = IntStream.range(0, str.length())
.mapToObj(i -> str.charAt(i));
WordCountState finalState = charStream.parallel()
.map(ch -> new WordCountState().accumulate(ch))
.reduce(new WordCountState(), WordCountState::combine);
whereas the collect method supports the mutable reduction, where a container instance (may be identical to the result) gets modified.
There is still a logical error in your solution as each WordCountState instance starts with assuming to have a preceding space character, without knowing the actual situation and no attempt to fix this in the combiner.
A way to fix and simplify this, still using reduction, would be:
public int getWordCountInParallel() {
return str.codePoints().parallel()
.mapToObj(WordCountState::new)
.reduce(WordCountState::new)
.map(WordCountState::getResult).orElse(0);
}
public class WordCountState {
private final boolean firstSpace, lastSpace;
private final int counter;
public WordCountState(int character){
firstSpace = lastSpace = Character.isWhitespace(character);
this.counter = 0;
}
public WordCountState(WordCountState a, WordCountState b) {
this.firstSpace = a.firstSpace;
this.lastSpace = b.lastSpace;
this.counter = a.counter + b.counter + (a.lastSpace && !b.firstSpace? 1: 0);
}
public int getResult() {
return counter+(firstSpace? 0: 1);
}
}
If you are worrying about the number of WordCountState instances, note how many Character instances this solution does not create, compared to your initial approach.
However, this task is indeed suitable for mutable reduction, if you rewrite your WordCountState to a mutable result container:
public int getWordCountInParallel() {
return str.codePoints().parallel()
.collect(WordCountState::new, WordCountState::accumulate, WordCountState::combine)
.getResult();
}
public class WordCountState {
private boolean firstSpace, lastSpace=true, initial=true;
private int counter;
public void accumulate(int character) {
boolean white=Character.isWhitespace(character);
if(lastSpace && !white) counter++;
lastSpace=white;
if(initial) {
firstSpace=white;
initial=false;
}
}
public void combine(WordCountState b) {
if(initial) {
this.initial=b.initial;
this.counter=b.counter;
this.firstSpace=b.firstSpace;
this.lastSpace=b.lastSpace;
}
else if(!b.initial) {
this.counter += b.counter;
if(!lastSpace && !b.firstSpace) counter--;
this.lastSpace = b.lastSpace;
}
}
public int getResult() {
return counter;
}
}
Note how using int to represent unicode characters consistently, allows to use the codePoint() stream of a CharSequence, which is not only simpler, but also handles characters outside the Basic Multilingual Plane and is potentially more efficient, as it doesn’t need boxing to Character instances.

When you implemented stream().collect(supplier, accumulator, combiner) they do return void (combiner and accumulator). The problem is that this:
collect(WordCountState::new,
WordCountState::accumulate,
WordCountState::combine)
In your case actually means (just the accumulator, but same goes for the combiner):
(wordCounter, character) -> {
WordCountState state = wc.accumulate(c);
return;
}
And this is not trivial to get indeed. Let's say we have two methods:
public void accumulate(Character c) {
if (!Character.isWhitespace(c)) {
counter++;
}
}
public WordCountState accumulate2(Character c) {
if (Character.isWhitespace(c)) {
return lastSpace ? this : new WordCountState(true, counter);
} else {
return lastSpace ? new WordCountState(false, counter + 1) : this;
}
}
For the them the below code will work just fine, BUT only for a method reference, not for lambda expressions.
BiConsumer<WordCountState, Character> cons = WordCountState::accumulate;
BiConsumer<WordCountState, Character> cons2 = WordCountState::accumulate2;
You can imagine it slightly different, via an class that implementes BiConsumer for example:
BiConsumer<WordCountState, Character> clazz = new BiConsumer<WordCountState, Character>() {
#Override
public void accept(WordCountState state, Character character) {
WordCountState newState = state.accumulate2(character);
return;
}
};
As such your combine and accumulate methods needs to change to:
public void combine(WordCountState wordCountState) {
counter = counter + wordCountState.getCounter();
}
public void accumulate(Character c) {
if (!Character.isWhitespace(c)) {
counter++;
}
}

First of all, would it not be easier to just use something like input.split("\\s+").length to get the word count?
In case this is an exercise in streams and collectors, let's discuss your implementation. The biggest mistake was pointed out by you already: Your accumulator and combiner should not return new instances. The signature of collect tells you that it expects BiConsumer, which do not return anything. Because you create new object in the accumulator, you never increase the count of the WordCountState objects your collector actually uses. And by creating a new object in the combiner you would discard any progress you would have made. This is also why you create more objects than characters in your input: one per character, and then some for the return values.
See this adapted implementation:
public static class WordCountState
{
private boolean lastSpace = true;
private int counter = 0;
public void accumulate(Character character)
{
if (!Character.isWhitespace(character))
{
if (lastSpace)
{
counter++;
}
lastSpace = false;
}
else
{
lastSpace = true;
}
}
public void combine(WordCountState wordCountState)
{
counter += wordCountState.counter;
}
}
Here, we do not create new objects in every step, but change the state of the ones we have. I think you tried to create new objects because your Elvis operators forced you to return something and/or you couldn't change the instance fields as they are final. They do not need to be final, though, and you can easily change them.
Running this adapted implementation sequentially now works fine, as we nicely look at the chars one by one and end up with 11 words.
In parallel, though, it fails. It seems it creates a new WordCountState for every char, but does not count all of them, and ends up at 29 (at least for me). This shows a basic flaw with your algorithm: Splitting on every character doesn't work in parallel. Imagine the input abc abc, which should result in 2. If you do it in parallel and do not specify how to split the input, you might end up with these chunks: ab, c a, bc, which would add up to 4.
The problem is that by parallelizing between characters (i.e. in the middle of words), you make your separate WordCountStates dependent on each other (because they would need to know which one come before them and whether it ended with a whitespace char). This defeats the parallelism and results in errors.
Aside from all that, it might be easier to implement the Collector interface instead of providing the three methods:
public static class WordCountCollector
implements Collector<Character, SimpleEntry<AtomicInteger, Boolean>, Integer>
{
#Override
public Supplier<SimpleEntry<AtomicInteger, Boolean>> supplier()
{
return () -> new SimpleEntry<>(new AtomicInteger(0), true);
}
#Override
public BiConsumer<SimpleEntry<AtomicInteger, Boolean>, Character> accumulator()
{
return (count, character) -> {
if (!Character.isWhitespace(character))
{
if (count.getValue())
{
String before = count.getKey().get() + " -> ";
count.getKey().incrementAndGet();
System.out.println(before + count.getKey().get());
}
count.setValue(false);
}
else
{
count.setValue(true);
}
};
}
#Override
public BinaryOperator<SimpleEntry<AtomicInteger, Boolean>> combiner()
{
return (c1, c2) -> new SimpleEntry<>(new AtomicInteger(c1.getKey().get() + c2.getKey().get()), false);
}
#Override
public Function<SimpleEntry<AtomicInteger, Boolean>, Integer> finisher()
{
return count -> count.getKey().get();
}
#Override
public Set<java.util.stream.Collector.Characteristics> characteristics()
{
return new HashSet<>(Arrays.asList(Characteristics.CONCURRENT, Characteristics.UNORDERED));
}
}
We use a pair (SimpleEntry) to keep the count and the knowledge about the last space. This way, we do not need to implement the state in the collector itself or write a param object for it. You can use this collector like this:
return charStream.parallel().collect(new WordCountCollector());
This collector parallelizes nicer than the initial implementation, but still varies in results (mostly between 14 and 16) because of the mentioned weaknesses in your approach.

Java Pattern.split() with overlapping delimiters

Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!

If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789

Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.

Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.

Java 8 lambda expression callback type when parsing strings with regex

I would like to write a function, that can match a string against regex and execute a callback with all group matches as parameters.
I came up with this and it works:
private static void parse(String source, String regex,
Consumer<String[]> callback) {
//convert regex groups 1..n into integer array 0..n-1 and call
//consumer callback with this array
Matcher m = Pattern.compile(regex).matcher(source);
String[] ret = new String[m.groupCount()];
if (m.matches()) {
for (int i=0; i<m.groupCount(); i++) {
ret[i] = m.group(1+i);
}
callback.accept(ret);
}
}
You can then do
parse("Add 42,43", "Add (\\d+?),(\\d+?)", p -> processData(p[0],p[1]));
What I would like to be able to do ideally is this
parse("Add 42,43", "Add (\\d+?),(\\d+?)", (x,y) -> processData(x,y));
What would be the most elegant way? The only one I can think of is to declare multiple functional interfaces with 1..n parameters and use overrides to handle it. Any better ideas, maybe with reflection?

As I understand the question is if there is a syntax sugar for tuple initialization from an array, i.e.:
val (hour, minutes, seconds) = new String[]{"12", "05", "44"};
... except for it is hidden inside a lambda arguments declaration.
As far as I know, there is no such syntax in Java 8 and your approach seems the most convenient. There is in Scala, however: Is there a way to initialize multiple variables from array or List in Scala?.
There are similar instructions in Scala as well:
scala> val s = "Add 42,43"
scala> val r = "Add (\\d+?),(\\d+?)".r
scala> val r(x,y) = s
x: String = 42
y: String = 43

Since I solved it for myself by now, I will post a solution I came up with here. If someone proposes a better one, maybe with method chaining or more generic, I will gladly grant an answer.
You can use the class below like this:
Sring msg = "add:42,34";
ParseUtils.parse(msg, "add:(\\d+),(\\d+)", (int x,int y) -> simulator.add(x, y));
ParseUtils.parse(msg, "add:(\\d+),(\\d+)", simulator::add); //IntBiConsumer match
ParseUtils.parse(msg, "add:(.*?),", System.out::println);
And here is the class (I omitted trivial error processing and boolean returns if no match):
public class ParseUtils {
#FunctionalInterface
public interface Consumer { void accept(String s); }
#FunctionalInterface
public interface BiConsumer { void accept(String a, String b); }
//... you can add TriConsumer etc. if you need to ...
#FunctionalInterface //conveniently parses integers
public interface IntBiConsumer { void accept(int x, int y); }
// implementations -----
public static void parse(String src, String regex, Consumer callback) {
callback.accept(parse(src, regex)[0]);
}
public static void parse(String src, String regex, BiConsumer callback) {
String[] p = parse(src, regex);
callback.accept(p[0],p[1]);
}
public static void parse(String src, String regex,
IntBiConsumer callback) {
String[] p = parse(src, regex);
callback.accept(Integer.parseInt(p[0]), Integer.parseInt(p[1]));
}
public static String[] parse(String source, String pattern) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
String[] ret = new String[m.groupCount()];
if (m.matches()) {
for (int i=0; i<m.groupCount(); i++) {
ret[i] = m.group(1 + i);
}
}
return ret;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Collect HashSet / Java 8 / Regex Pattern / Stream API - java

What about Pattern.splitAsStream ? Stream<String> stream = Pattern.compile(regex).splitAsStream(input); and then a collector to get a set. Set<String> set = stream.map(String::toLowerCase).collect(Collectors.toSet());

Related

Java - Abstract Syntax Tree with grammar

Refactoring a nested foreach

using java streams in parallel with collect(supplier, accumulator, combiner) not giving expected results

Java Pattern.split() with overlapping delimiters

Java 8 lambda expression callback type when parsing strings with regex

Categories

Resources