Java 8 Stream of sentences - java

I’d like to use Java 8 streams to take a stream of strings (for example read from a plain text file) and produce a stream of sentences. I assume sentences can cross line boundaries.
So for example, I want to go from:
"This is the", "first sentence. This is the", "second sentence."
to:
"This is the first sentence.", "This is the second sentence."
I can see that it’s possible to get a stream of parts of sentences as follows:
Pattern p = Pattern.compile("\\.");
Stream<String> lines
= Stream.of("This is the", "first sentence. This is the", "second sentence.");
Stream<String> result = lines.flatMap(s -> p.splitAsStream(s));
But then I’m not sure how to produce a stream to join the fragments into sentences. I want to do this in a lazy way so that only what is needed from the original stream is read. Any ideas?

Breaking text into sentences is not that easy as just looking for dots. E.g., you don’t want to split in between “Mr.Smith”…
Thankfully, there is already a JRE class which takes care of that, the BreakIterator. What it doesn’t have, is Stream support, so in order to use it with streams, some support code around it is required:
public class SentenceStream extends Spliterators.AbstractSpliterator<String>
implements Consumer<CharSequence> {
public static Stream<String> sentences(Stream<? extends CharSequence> s) {
return StreamSupport.stream(new SentenceStream(s.spliterator()), false);
}
Spliterator<? extends CharSequence> source;
CharBuffer buffer;
BreakIterator iterator;
public SentenceStream(Spliterator<? extends CharSequence> source) {
super(Long.MAX_VALUE, ORDERED|NONNULL);
this.source = source;
iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);
buffer=CharBuffer.allocate(100);
buffer.flip();
}
#Override
public boolean tryAdvance(Consumer<? super String> action) {
for(;;) {
int next=iterator.next();
if(next!=BreakIterator.DONE && next!=buffer.limit()) {
action.accept(buffer.subSequence(0, next-buffer.position()).toString());
buffer.position(next);
return true;
}
if(!source.tryAdvance(this)) {
if(buffer.hasRemaining()) {
action.accept(buffer.toString());
buffer.position(0).limit(0);
return true;
}
return false;
}
iterator.setText(buffer.toString());
}
}
#Override
public void accept(CharSequence t) {
buffer.compact();
if(buffer.remaining()<t.length()) {
CharBuffer bigger=CharBuffer.allocate(
Math.max(buffer.capacity()*2, buffer.position()+t.length()));
buffer.flip();
bigger.put(buffer);
buffer=bigger;
}
buffer.append(t).flip();
}
}
With that support class, you can simply say, e.g.:
Stream<String> lines = Stream.of(
"This is the ", "first sentence. This is the ", "second sentence.");
sentences(lines).forEachOrdered(System.out::println);

This is a sequential, stateful problem, which Stream's designer is not too fond of.
In a more general sense, you are implementing a lexer, which converts a sequence of tokens to a sequence of another type of tokens. While you might use Stream to solve it with tricks and hacks, there is really no reason to. Just because Stream is there doesn't mean we have to use it for everything.
That being said, an answer to your question is to use flatMap() with a stateful function that holds intermediary data and emits the whole sentence when a dot is encountered. There is also the issue of EOF - you'll need a sentinel value for EOF in the source stream so that the function can react to it.

My StreamEx library has a collapse method which is designed to solve such tasks. First let's change your regexp to look-behind one, to leave the ending dots, so we can later use them:
StreamEx.of(input).flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)
Here the input is array, list, JDK stream or just comma-separated strings.
Next we collapse two strings if the first one does not end with dot. The merging function should join both parts into single string adding a space between them:
.collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)
Finally we should trim the leading and trailing spaces if any:
.map(String::trim);
The whole code is here:
List<String> lines = Arrays.asList("This is the", "first sentence. This is the",
"second sentence. Third sentence. Fourth", "sentence. Fifth sentence.", "The last");
Stream<String> stream = StreamEx.of(lines)
.flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)
.collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)
.map(String::trim);
stream.forEach(System.out::println);
The output is the following:
This is the first sentence.
This is the second sentence.
Third sentence.
Fourth sentence.
Fifth sentence.
The last
Update: since StreamEx 0.3.4 version you can safely do the same with parallel stream.

Related

Java 8 Streams Remove Duplicate Letter

I'm trying to apply my knowledge of streams to some leetcode algorithm questions. Here is a general summary of the question:
Given a string which contains only lowercase letters, remove duplicate
letters so that every letter appears once and only once. You must make
sure your result is the smallest in lexicographical order among all
possible results.
Example:
Input: "bcabc"
Output: "abc"
Another example:
Input: "cbacdcbc"
Output: "acdb"
This seemed like a simple problem, just stream the values into a new list from the string, sort the values, find the distinct values, and then throw it back into a list, and append the list's value to a string. Here is what I came up with:
public String removeDuplicateLetters(String s)
{
char[] c = s.toCharArray();
List<Character> list = new ArrayList<>();
for(char ch : c)
{
list.add(ch);
}
List<Character> newVal = list.stream().distinct().collect(Collectors.toList());
String newStr = "";
for(char ch : newVal)
{
newStr += ch;
}
return newStr;
}
The first example is working perfectly, but instead of "acdb" for the second output, I'm getting "abcd". Why would abcd not be the least lexicographical order? Thanks!
As I had pointed out in the comments using a LinkedHashSet would be best here, but for the Streams practice you could do this:
public static String removeDuplicateLetters(String s) {
return s.chars().sorted().distinct().collect(
StringBuilder::new,
StringBuilder::appendCodePoint,
StringBuilder::append
).toString();
}
Note: distinct() comes after sorted() in order to optimize the stream, see Holger's explanation in the comments as well as this answer.
Lot of different things here so I'll number them:
You can stream the characters of a String using String#chars() instead of making a List where you add all the characters.
To ensure that the resulting string is smallest in lexographical order, we can sort the IntStream.
We can convert the IntStream back to a String by performing a mutable reduction with a StringBuilder. We then convert this StringBuilder to our desired string.
A mutable reduction is the Stream way of doing the equivalent of something like:
for (char ch : newVal) {
newStr += ch;
}
However, this has the added benefit of using a StringBuilder for concatenation instead of a String. See this answer as to why this is more performant.
For the actual question you have about the conflict of expected vs. observed output: I believe abcd is the right answer for the second output, since it is the smallest in lexographical order.
public static void main(String[] args) {
String string = "cbacdcbc";
string.chars()
.mapToObj(item -> (char) item)
.collect(Collectors.toSet()).forEach(System.out::print);
}
the output:abcd,hope help you!

Java 8 Streams - how to compare elements?

I want to find anagrams in .txt file using Java Stream. Here what I have:
try (InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt").openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Stream<String> stream = reader.lines()) {
And the method for anagrams:
public boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
How to check if a word in unixdict.txt is anagram using Java 8 Stream? Is there any way to compare one word to all words in the stream?
When you want to find all anagrams, it’s not recommended to try to compare one word with all other words, as you’ll end up comparing every word with every other word, which is known as quadratic time complexity. For processing 1,000 words, you would need one millions comparisons, for processing 100,000 words, you would need 10,000,000,000 comparisons and so on.
You may change your isAnagram method to provide a lookup key for data structures like HashMap:
static CharBuffer getAnagramKey(String s) {
char[] word1 = s.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
return CharBuffer.wrap(word1);
}
The class CharBuffer wraps a char[] array and provides the necessary equals and hashCode methods without copying the array contents, which makes it preferable to constructing a new String.
As a side note, .replaceAll("[\\s]", "") could be simplified to .replaceAll("\\s", ""), both would eliminate all space characters, but the example input of your question has no space characters at all. To remove all non-word characters like apostrophes and ampersands, you could use s.replaceAll("\\W", "").
Then, you may process all words to find anagrams in a single linear pass like
URL srcURL = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt");
try(InputStream is = srcURL.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Stream<String> stream = reader.lines()) {
stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
.values().stream()
.filter(l -> l.size() > 1)
.forEach(System.out::println);
}
With this solution, the printing likely becomes the more expensive part for larger word lists. So you might change the stream’s operation, e.g. the following prints the top ten of anagram combinations:
stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
.values().stream()
.filter(l -> l.size() > 1)
.sorted(Collections.reverseOrder(Comparator.comparingInt(List::size)))
.limit(10)
.forEach(System.out::println);
This works. I first did all the sorts in the stream but this is much more efficient.
InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt")
.openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String word = "germany";
final String sword = sortedWord(word);
reader.lines().filter(w -> sortedWord(w).compareTo(sword) == 0).forEach(
System.out::println);
static String sortedWord(String w) {
char[] chs = w.toCharArray();
Arrays.sort(chs);
return String.valueOf(chs);
}
A possible improvement would be to filter the lengths of the words first. And you might want to try this word list as it has more words in it.
I think your best option might be to use the multimap collector to convert the stream into a Guava multimap using the sorted version of the string as the key to the map. See Cleanest way to create a guava MultiMap from a java8 stream for an example of how to do this. If you only want the resulting sets of anagrams, you could then use
multimap.asMap().entrySet().stream()... to filter and collect the results per your needs.

How to write JUnit test case for finding first non-repeating character in stream?

Since users are focused more on minor loop holes than requirement, I am giving actual working code (replacing) for which I need junit test case.
import java.util.*;
public class FirstNonRepeatingCharacterStream {
List<Character> chars = new ArrayList<>();
boolean[] repeated = new boolean[256];
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
FirstNonRepeatingCharacterStream tester =
new FirstNonRepeatingCharacterStream();
while (true) {
Character ch = new Character(sc.next().charAt(0));
Character output = tester.firstNonRepeatingCharStream1(ch);
System.out.println(output);
}
}
public Character firstNonRepeatingCharStream1(Character x) {
if (x == null) {
return x;
}
Character output = null;
if (!repeated[x]) {
if (!chars.contains(x)) {
chars.add(x);
} else {
chars.remove(new Character(x));
repeated[x] = true;
}
}
if (chars.size() != 0) {
output = new Character(chars.get(0));
}
return output;
}
}
User enters one character at a time.
input a -> output a
input b -> that means input ab as it's stream -> output a
input a -> that means input aba as it's stream -> output b
input c -> that means input abac as it's stream -> output b
input b -> that means input abacb as it's stream -> output c
input a -> that means input abacba as it's stream -> output c
input d -> that means input abacbad as it's stream -> output c
Please let me know how to write unit test which should comply with main method. Not necessary to have while loop in junit test case.
thanks in advance.
This sounds like it would mostly boil down to coming up with "mean" test strings to try and hit various edge cases in your code:
String testStrings[] {
null,
"", // empty string
"a", // easiest possible string with a match
"aa", // easiest possible string with no match
"aba", // slightly less easy string with a match
"aaaaa", // no match on N instances of a character
"aaaaab", // match on end of N instances of a character
"baaaaa", // match at beginning of N instances of a character
"aabaaa", // match in the middle of N instances of a character
"abcdefghijklmnopqrstuvwxyzyxwvutsrqponmlkjihgfedcba", // harder string where the unique letter is in the middle (z)
"abcdefghijklmnopqrstuvwxyzzyxwvutsrqponmlkjihgfedcb", // harder string where the unique character is at the front (a)
"bcdefghijklmnopqrstuvwxyzzyxwvutsrqponmlkjihgfedcba", // harder string where the unique character is at the back
"abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy" // other harder string... etc.
};
You then develop a similar array of expected output, and then you blaze through all of your cases in a single for(int i = 0;...) loop.
Another way of developing test input would be to generate test strings algorithmically, such that all the characters occur in pairs in one of three strategies (aabbcc, abccba, abcabc), then poke out the first character, then the last character, then one in the middle (you check which character you're removing first and that becomes your test value).
Lots of different ways to skin this cat.
PS: Your current code will break with null or any character whose "integer" value is greater than 255. Lots of kooky Unicode characters out there... I seem to recall hearing that several dead and fictional languages are now in Unicode (Ancient Egyptian, JJR Tolkien's Elvish, Klingon, etc), to say nothing of the CJKV range (Chinese, Japanese, Korean, Vietnamese). Lots of "code points" out there beyond 256.
I solved it myself.
class FirstNonRepeatingCharacterStreamTest3 {
private static FirstNonRepeatingCharacterStream tester;
#BeforeAll
static void setUpBeforeClass() throws Exception {
tester = new FirstNonRepeatingCharacterStream();
}
#DisplayName("MyTest")
#ParameterizedTest(name = "{index} => input={0}, output= {1}")
#CsvSource({
"'a', 'a'",
"'b', 'a'",
"'a', 'b'",
"'c', 'b'",
"'b', 'c'",
"'a', 'c'",
"'d', 'c'"
})
public void testFirstNonRepeatingCharStream101(Character input, Character output) {
Character actual = tester.firstNonRepeatingCharStream1(input);
assertEquals(output, actual);
}
}

convert first character of string to uppercase using java 8 lambdas only

I want to create a basic program of converting first character of string to uppercase through lambdas
Input
singhakash
Output
Singhakash
I tried
String st = "singhakash";
//approach 1
System.out.print(st.substring(0, 1).toUpperCase());
st.substring(1).codePoints()
.forEach(e -> System.out.print((char) e));
System.out.println();
//approach 2
System.out.print(st.substring(0, 1).toUpperCase());
IntStream.range(0, st.length())
.filter(i -> i > 0)
.mapToObj(st::charAt)
.forEach(System.out::print);
But for both the cases I have to print the first character seperately.Is there any way I can do that without having a seperate print statement?
Note: I can do that normally by loop or any other approach but I am looking for lambdas only solution.
Thanks
You could do it like this:
String st = "singhakash";
IntStream.range(0, st.length())
.mapToObj(i -> i == 0 ? Character.toUpperCase(st.charAt(i)) : st.charAt(i))
.forEach(System.out::print);
The simplest way to do it would be
String result = Character.toUpperCase(st.charAt(0))+st.substring(1);
If you feel like you have to optimize it, i.e. reduce the number of copying operations (instead of letting the JVM do it), you may use:
StringBuilder sb=new StringBuilder(st);
sb.setCharAt(0, Character.toUpperCase(sb.charAt(0)));
String result=sb.toString();
But if it really has to be done using the fancy new Java 8 feature, you can use
String result=IntStream.concat(
IntStream.of(st.codePointAt(0)).map(Character::toUpperCase), st.codePoints().skip(1) )
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
This solution will even handle supplementary code points correctly, so it has even an advantage over the simple solutions (though it would not be too hard to makes these supplementary code point aware too).
If you want to print directly, you can use
IntStream.concat(
IntStream.of(st.codePointAt(0)).map(Character::toUpperCase), st.codePoints().skip(1))
.forEach(cp -> System.out.print(Character.toChars(cp)));
String is immutable in Java. Just uppercase the first character, and append the rest. Something like,
System.out.println(Character.toUpperCase(st.charAt(0)) + st.substring(1));
st.replaceFirst(st.subSequence(0,1).toString(),st.subSequence(0,1).toString().toUpperCase().codePoints().forEach(e -> System.out.print((char)e));

For Loop Replacement: For Loops to Filters

I'm working on an assignment for a Computer Science III class (Java programming), and in it we have to encode a file based on Huffman coding.
import java.util.Scanner;
import java.util.ArrayList;
import java.io.*;
import java.util.Collections;
import java.util.StringTokenizer;
public class Client {
public static void main(String[] args) throws IOException {
// TODO code application logic here
Scanner in = new Scanner(System.in);
System.out.println("Enter a filename to read from.");
String filename = in.nextLine();
File file = new File(filename);
Scanner inputFile = new Scanner(file);
String line, word;
StringTokenizer token;
ArrayList<Character> chars = new ArrayList<>();
while(inputFile.hasNext()){
line = inputFile.nextLine();
ArrayList<Character> lineChar = new ArrayList<>();
for (int i=0; i<line.length(); i++){
if (line.charAt(i)!=' '){
lineChar.add(line.charAt(i));
}
}
chars.addAll(lineChar);
}
ArrayList<Character> prob = new ArrayList<Character>();
for (int i=0; i<chars.size(); i++){
if (!prob.contains(chars.get(i))){
prob.add(chars.get(i));
}
}
for (int i=0; i<prob.size(); i++){
System.out.print("Frequency of " + prob.get(i));
System.out.println(": " + ((double)Collections.frequency(chars, prob.get(i)))/chars.size());
}
I was working on it in my NetBeans IDE and followed some suggestions. It changed the last two for loops to:
chars.stream().filter((char1) -> (!prob.contains(char1))).forEach((char1) -> {
prob.add(char1);
});
prob.stream().map((prob1) -> {
System.out.print("Frequency of " + prob1);
return prob1;
}).forEach((prob1) -> {
System.out.println(": " + ((double) Collections.frequency(chars, prob1)) / chars.size());
});
I am really, really, really intrigued by this, but I find it difficult to trace everything. It obviously operates in the same way as my for loops and after testing I see that it -does- work, but I want to understand why and how. Can anybody give me any insight?
Your IDE replaced some of your code with new Java 8 features - Streams and lambda expressions. You should read about them.
Streams allow you to perform operations on a collection in a pipeline, where only the final (terminal) operation does the actual iteration over the elements (for only as many elements as it requires).
Lambda expressions allow you to write less code when passing anonymous class instances implementing functional interfaces (=interfaces with a single method) to methods.
Here's an attempt to explain what the new code does :
chars.stream() // creates a Stream<Character> from your chars List
.filter((char1) -> (!prob.contains(char1))) // keeps only Characters not contained
// in prob List
.forEach((char1) -> {prob.add(char1);}); // iterates over all the elements of
// the Stream (i.e. those that weren't
// filtered out) and adds them to prob
prob.stream() // creates a Stream<Character> of the prob List
.map((prob1) -> {
System.out.print("Frequency of " + prob1);
return prob1;
}) // prints "Frequency of " + character for the current Character in the Stream
.forEach((prob1) -> { // prints the frequency of each character in the Stream
System.out.println(": " + ((double) Collections.frequency(chars, prob1)) / chars.size());
});
The map operation on the second Stream is a bit strange. Usually map is used to convert a Stream of one type to a Stream of another type. Here it is used to print output and it returns the same Stream. I wouldn't use map for that. You can simply move the printing to the forEach.
prob.stream() // creates a Stream<Character> of the prob List
.forEach((prob1) -> { // prints the frequency of each character in the Stream
System.out.print("Frequency of " + prob1);
System.out.println(": " + ((double) Collections.frequency(chars, prob1)) / chars.size());
});
Actually, you don't need a Stream for that, since Collections also have a forEach method in Java 8 :
prob.forEach((prob1) -> { // prints the frequency of each character in the Stream
System.out.print("Frequency of " + prob1);
System.out.println(": " + ((double) Collections.frequency(chars, prob1)) / chars.size());
});
Netbeans did what it could to refactor your code to use java 8 streams, but it can actually be done much better. For example, it appears that prob is supposed to contain a distinct list of Characters. In java 8, you can do it like this:
List<Character> prob = chars.stream()
.distinct()
.collect(Collectors.toList());
But all you are using prob for is to then calculate how many times each Character appears in chars. With streams, you can do it without first making a prob list:
Map<Character, Long> freq = chars.stream()
.collect(
Collectors.groupingBy(
x->x,
Collectors.counting()
)
);
The static methods in Collections class are usually just imported statically, so the above would be written as:
Map<Character, Long> freq = chars.stream()
.collect(groupingBy(x->x, counting());
That means, take my stream of chars and make a map. The key of the map is the char itself (that's what x->x does) and the value of the map is the count of how many times that char occurs in chars.
But that's not all! The first half of your method goes over the lines of the file and collects the chars. That can be rewritten with streams as well:
Stream<Character> charStream = Files.lines(Paths.get(filename))
.flatMap(line -> line.chars().mapToObj(i->(char) i));
File.lines(..) gives us a stream of lines. The flatMap part is a bit cryptic, but it unrolls every string into a stream of individual chars and joins the streams so that we have one big stream of chars.
And now we put it all together:
public static void main(String[] args) throws IOException {
Scanner in = new Scanner(System.in);
System.out.println("Enter a filename to read from.");
String filename = in.nextLine();
Map<Character, Long> freq = Files.lines(Paths.get(filename))
.flatMap(line -> line.chars().mapToObj(i -> (char) i))
.collect(groupingBy(x -> x, counting()));
long total = freq.values().stream().mapToLong(x->x).sum();
freq.forEach((chr, count) ->
System.out.format("Frequency of %s: %s%n", chr, ((double) count) / total)
);
}
Edit:
To output frequencies in sorted order, do this (using import static java.util.Comparator.*):
freq.entrySet().stream()
.sorted(comparing(e->e.getValue(), reverseOrder()))
.forEach(e -> System.out.format("Frequency of %s: %s%n", e.getKey(), (double) e.getValue() / total));
We take the map of Character to count, stream its entries, sort them by values in reverse order and print each one out.
This to me looks like NetBeans refactored your code to use Java 8's lambda or functional programming operations using the map - reduce from the Stream interface.
For more information on map() / reduce()/ stream interface refer to this link
Please read the suggestions that the IDE provides before you apply them :)
First, you should read about the java.util.Stream package, to get a first impression of how the API is designed and for what purpoeses.
Here is what your first loop does, in word form:
Iterate over the values from 0 to chars.size()-1 and add the corresponding element from chars to prob, but only if it's not already there.
With the Stream API added to Java with Java 8 such tasks can be written in a functional programming style which focuses on the "how is it done" not on the "with what is ist done".
chars.stream()
.filter(char1 -> !prob.contains(char1))
.forEach(char1 -> {
prob.add(char1);
});
ArrayList implements Collection<T> and therefore the method stream().
This stream (all elements from the Collection in a pipeline) is being filtered (by your former if-statement)
On the stream of the remaining elements, execute the final operation prop.add
This might be a bit too much for now, but you can change the last operation (.forEach) to be even clearer:
//...
.forEach(prop::add);
For better insight or debuggin purposes you might find Stream#peek interesting which let.

Categories