I want to find anagrams in .txt file using Java Stream. Here what I have:
try (InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt").openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Stream<String> stream = reader.lines()) {
And the method for anagrams:
public boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
How to check if a word in unixdict.txt is anagram using Java 8 Stream? Is there any way to compare one word to all words in the stream?
When you want to find all anagrams, it’s not recommended to try to compare one word with all other words, as you’ll end up comparing every word with every other word, which is known as quadratic time complexity. For processing 1,000 words, you would need one millions comparisons, for processing 100,000 words, you would need 10,000,000,000 comparisons and so on.
You may change your isAnagram method to provide a lookup key for data structures like HashMap:
static CharBuffer getAnagramKey(String s) {
char[] word1 = s.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
return CharBuffer.wrap(word1);
}
The class CharBuffer wraps a char[] array and provides the necessary equals and hashCode methods without copying the array contents, which makes it preferable to constructing a new String.
As a side note, .replaceAll("[\\s]", "") could be simplified to .replaceAll("\\s", ""), both would eliminate all space characters, but the example input of your question has no space characters at all. To remove all non-word characters like apostrophes and ampersands, you could use s.replaceAll("\\W", "").
Then, you may process all words to find anagrams in a single linear pass like
URL srcURL = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt");
try(InputStream is = srcURL.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Stream<String> stream = reader.lines()) {
stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
.values().stream()
.filter(l -> l.size() > 1)
.forEach(System.out::println);
}
With this solution, the printing likely becomes the more expensive part for larger word lists. So you might change the stream’s operation, e.g. the following prints the top ten of anagram combinations:
stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
.values().stream()
.filter(l -> l.size() > 1)
.sorted(Collections.reverseOrder(Comparator.comparingInt(List::size)))
.limit(10)
.forEach(System.out::println);
This works. I first did all the sorts in the stream but this is much more efficient.
InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt")
.openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String word = "germany";
final String sword = sortedWord(word);
reader.lines().filter(w -> sortedWord(w).compareTo(sword) == 0).forEach(
System.out::println);
static String sortedWord(String w) {
char[] chs = w.toCharArray();
Arrays.sort(chs);
return String.valueOf(chs);
}
A possible improvement would be to filter the lengths of the words first. And you might want to try this word list as it has more words in it.
I think your best option might be to use the multimap collector to convert the stream into a Guava multimap using the sorted version of the string as the key to the map. See Cleanest way to create a guava MultiMap from a java8 stream for an example of how to do this. If you only want the resulting sets of anagrams, you could then use
multimap.asMap().entrySet().stream()... to filter and collect the results per your needs.
Related
I'm trying to apply my knowledge of streams to some leetcode algorithm questions. Here is a general summary of the question:
Given a string which contains only lowercase letters, remove duplicate
letters so that every letter appears once and only once. You must make
sure your result is the smallest in lexicographical order among all
possible results.
Example:
Input: "bcabc"
Output: "abc"
Another example:
Input: "cbacdcbc"
Output: "acdb"
This seemed like a simple problem, just stream the values into a new list from the string, sort the values, find the distinct values, and then throw it back into a list, and append the list's value to a string. Here is what I came up with:
public String removeDuplicateLetters(String s)
{
char[] c = s.toCharArray();
List<Character> list = new ArrayList<>();
for(char ch : c)
{
list.add(ch);
}
List<Character> newVal = list.stream().distinct().collect(Collectors.toList());
String newStr = "";
for(char ch : newVal)
{
newStr += ch;
}
return newStr;
}
The first example is working perfectly, but instead of "acdb" for the second output, I'm getting "abcd". Why would abcd not be the least lexicographical order? Thanks!
As I had pointed out in the comments using a LinkedHashSet would be best here, but for the Streams practice you could do this:
public static String removeDuplicateLetters(String s) {
return s.chars().sorted().distinct().collect(
StringBuilder::new,
StringBuilder::appendCodePoint,
StringBuilder::append
).toString();
}
Note: distinct() comes after sorted() in order to optimize the stream, see Holger's explanation in the comments as well as this answer.
Lot of different things here so I'll number them:
You can stream the characters of a String using String#chars() instead of making a List where you add all the characters.
To ensure that the resulting string is smallest in lexographical order, we can sort the IntStream.
We can convert the IntStream back to a String by performing a mutable reduction with a StringBuilder. We then convert this StringBuilder to our desired string.
A mutable reduction is the Stream way of doing the equivalent of something like:
for (char ch : newVal) {
newStr += ch;
}
However, this has the added benefit of using a StringBuilder for concatenation instead of a String. See this answer as to why this is more performant.
For the actual question you have about the conflict of expected vs. observed output: I believe abcd is the right answer for the second output, since it is the smallest in lexographical order.
public static void main(String[] args) {
String string = "cbacdcbc";
string.chars()
.mapToObj(item -> (char) item)
.collect(Collectors.toSet()).forEach(System.out::print);
}
the output:abcd,hope help you!
I’d like to use Java 8 streams to take a stream of strings (for example read from a plain text file) and produce a stream of sentences. I assume sentences can cross line boundaries.
So for example, I want to go from:
"This is the", "first sentence. This is the", "second sentence."
to:
"This is the first sentence.", "This is the second sentence."
I can see that it’s possible to get a stream of parts of sentences as follows:
Pattern p = Pattern.compile("\\.");
Stream<String> lines
= Stream.of("This is the", "first sentence. This is the", "second sentence.");
Stream<String> result = lines.flatMap(s -> p.splitAsStream(s));
But then I’m not sure how to produce a stream to join the fragments into sentences. I want to do this in a lazy way so that only what is needed from the original stream is read. Any ideas?
Breaking text into sentences is not that easy as just looking for dots. E.g., you don’t want to split in between “Mr.Smith”…
Thankfully, there is already a JRE class which takes care of that, the BreakIterator. What it doesn’t have, is Stream support, so in order to use it with streams, some support code around it is required:
public class SentenceStream extends Spliterators.AbstractSpliterator<String>
implements Consumer<CharSequence> {
public static Stream<String> sentences(Stream<? extends CharSequence> s) {
return StreamSupport.stream(new SentenceStream(s.spliterator()), false);
}
Spliterator<? extends CharSequence> source;
CharBuffer buffer;
BreakIterator iterator;
public SentenceStream(Spliterator<? extends CharSequence> source) {
super(Long.MAX_VALUE, ORDERED|NONNULL);
this.source = source;
iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);
buffer=CharBuffer.allocate(100);
buffer.flip();
}
#Override
public boolean tryAdvance(Consumer<? super String> action) {
for(;;) {
int next=iterator.next();
if(next!=BreakIterator.DONE && next!=buffer.limit()) {
action.accept(buffer.subSequence(0, next-buffer.position()).toString());
buffer.position(next);
return true;
}
if(!source.tryAdvance(this)) {
if(buffer.hasRemaining()) {
action.accept(buffer.toString());
buffer.position(0).limit(0);
return true;
}
return false;
}
iterator.setText(buffer.toString());
}
}
#Override
public void accept(CharSequence t) {
buffer.compact();
if(buffer.remaining()<t.length()) {
CharBuffer bigger=CharBuffer.allocate(
Math.max(buffer.capacity()*2, buffer.position()+t.length()));
buffer.flip();
bigger.put(buffer);
buffer=bigger;
}
buffer.append(t).flip();
}
}
With that support class, you can simply say, e.g.:
Stream<String> lines = Stream.of(
"This is the ", "first sentence. This is the ", "second sentence.");
sentences(lines).forEachOrdered(System.out::println);
This is a sequential, stateful problem, which Stream's designer is not too fond of.
In a more general sense, you are implementing a lexer, which converts a sequence of tokens to a sequence of another type of tokens. While you might use Stream to solve it with tricks and hacks, there is really no reason to. Just because Stream is there doesn't mean we have to use it for everything.
That being said, an answer to your question is to use flatMap() with a stateful function that holds intermediary data and emits the whole sentence when a dot is encountered. There is also the issue of EOF - you'll need a sentinel value for EOF in the source stream so that the function can react to it.
My StreamEx library has a collapse method which is designed to solve such tasks. First let's change your regexp to look-behind one, to leave the ending dots, so we can later use them:
StreamEx.of(input).flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)
Here the input is array, list, JDK stream or just comma-separated strings.
Next we collapse two strings if the first one does not end with dot. The merging function should join both parts into single string adding a space between them:
.collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)
Finally we should trim the leading and trailing spaces if any:
.map(String::trim);
The whole code is here:
List<String> lines = Arrays.asList("This is the", "first sentence. This is the",
"second sentence. Third sentence. Fourth", "sentence. Fifth sentence.", "The last");
Stream<String> stream = StreamEx.of(lines)
.flatMap(Pattern.compile("(?<=\\.)")::splitAsStream)
.collapse((a, b) -> !a.endsWith("."), (a, b) -> a + ' ' + b)
.map(String::trim);
stream.forEach(System.out::println);
The output is the following:
This is the first sentence.
This is the second sentence.
Third sentence.
Fourth sentence.
Fifth sentence.
The last
Update: since StreamEx 0.3.4 version you can safely do the same with parallel stream.
So basically i'm trying to take two text files (one with many jumbled words and one with many dictionary words.) I am supposed to take these two text files and convert them to two seperate arrays.
Following that, I need to compare jumbled strings from the first array and match the dictionary word in the second array up to it's jumbled counterpart. (ex. aannab(in the first array) to banana(in the second array))
I know how to set one array from a string, however I don't know how to do two from two seperate text files.
Use HashMap for matching. Where first text file data will be the key of Map and second text file data will be value. Then, by using key, you will get matching value.
you can read each file into an array like this:
String[] readFile(String filename) throws IOException {
List<String> stringList = new ArrayList<>();
try {
FileInputStream fis = new FileInputStream(new File(filename));
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String line = null;
while ((line = br.readLine()) != null) {
stringList.add(line);
}
} finally {
br.close();
}
return stringList.toArray(new String[stringList.size()]);
}
Next, try to do the matching:
String[] jumbles = readFile("jumbles.txt");
String[] dict = readfile("dict.txt);
for (String jumble : jumbles) {
for (String word : dict) {
// can only be a match if the same length
if (jumble.length() == word.length()) {
//next loop through each letter of jumble and see if it
//appears in word.
}
}
}
I know how to set one array from a string, however I don't know how to do two from two seperate text files
I would encourage you to divide your problems don't knows and knows.
Search don't knows over internet you will get lot of ways to do it.
Then search for what you know,to explore whether it can be done in a better way.
To help you here,
Your Don't knows:
Reading file in Java.
Processing the content of read file.
Your known part :
String to array representation ( Search whether there are better ways in your use case)
Combine both :-)
I am reading a newline-separated text file into a String-array.
Since I know the delimiter will always be \n, I should be able to append each word to a StringBuilder, then split it using the delimiter.
Simply put, which method should I use and why?
Method A:
1. Create an ArrayList (or another more suited Collection)
2. Append each row to the list
3. Return list.toArray()
Method B:
1. Create a StringBuilder
2. Append each row to the builder
3. Return builder.split("\n")
Not sure it makes much of a difference, the toArray method is most likely faster as there is less String processing. The split would have to process the entire data with regex; the toArray method would just need to loop over the Collection.
If you amend your method B so that you don't read the file line-by-line into the StringBuilder but use Files.readAllBytes to get the entire file as a String then split you will probably find performance more or less identical.
If you have Java 8:
final Path path = /*some path*/
final String[] lines = Files.lines(path).toArray(String[]::new);
Note, your method A can be improved by using Files.readAllLines:
final String[] lines = Files.readAllLines(path, StandardCharsets.UTF_8).
toArray(new String[0]);
There's probably very little difference. I don't think you're working with very large files anyway, so it shouldn't matter. You can profile the different ways if you really are interested in it, but the choice you make is quite irrelevant.
I would go with the ArrayList way if it was my choice, since concatenation just for splitting afterwards seems redundant.
Wait, if you read a file in this format:
A
B
C
D
E
F
Why not just read it and save in the same time?
Something like:
BufferedReader bufferedReader = new BufferedReader(new FileReader("test.txt"));
List<String> lines = new ArrayList<String>();
for (String line; (line = bufferedReader.readLine()) != null; )
{
lines.add(line);
}
System.out.println(lines);
And you will have [A, B, C, D, E, F, G] in your lines List.
I want to read in a list of words. Then I want alphabetize each of the characters within each word such that I have an entire list of words where each letter is alphabetized. For example if I wanted to read in "cat" "dog" "mouse" from a text file I would have [a,c,t], [d,g,o], and [e,m,o,s,u].
I'm implementing this in Java. I thought about a linked list or some other Collection but I'm not really sure how to implement those with respect to this. I know it's not as simple as converting each string to a char array or using array list. (I already tried those)
Does anyone have any suggestions or examples of doing this?
Basically, I'm just trying to get better with algorithms.
public class AnagramSolver1 {
static List<String> inputList = new ArrayList<String>();
public static void main(String[] args) throws IOException {
List<String> dictionary = new ArrayList<String>();
BufferedReader in = new BufferedReader(new FileReader("src/dictionary.txt"));
String line = null;
Scanner scan = new Scanner(System.in);
while (null!=(line=in.readLine()))
{
dictionary.add(line);
}
in.close();
char[] word;
for (int i = 0; i < dictionary.size(); i++) {
word = inputList.get(i).toCharArray();
System.out.println(word);
}
If you have a String called word, you can obtain a sorted char[] of the characters in word via Arrays.sort
char[] chars = word.toCharArray();
Arrays.sort(chars);
I assume you would want to repeat this process for each member of a collection of words.
If you're interested in knowing what happens behind the scenes here, I would urge you to take a look at the source.
Java provides good support for sorting already: all you need is converting your String to an array of char[], call Arrays.sort on it, and then convert that array back to String.
If you want to have some fun with algorithms, however, you could try going for a linear counting sort: count the letters in the original, then go through the counts in alphabetical order, and write out the count number of characters.