Scala HashMap throwing key not found exception - java

I am very new to Scala, and would appreciate any help (have looked everywhere and spent the last 8 hours trying to figure this out)
Currently I have
def apply(file: String) : Iterator[String] = {
scala.io.Source.fromFile(file).getLines().map(_.toLowerCase)
}
As well as
def groupFreq[A,B](xs: Iterator[A], f: A => B): HashMap[B, Int] = {
var freqMap = new HashMap[B, Int]
for (x <- xs) freqMap = freqMap + ( f(x) -> ( freqMap.getOrElse( f(x) , 0 ) +1 ) )
freqMap
}
apply just takes a file of words that we pass in.
GroupFreq takes xs: Iterator[A] and a grouping function f that converts A values to their B groups.
The function returns a HashMap that for each B group, counts the number of A values that fell into the group.
I use both of these functions, to help me with charFreq, a function that uses both apply and groupFreq to pass back a HashMap that counts how many times a Char appears throughout the entire file. If the char does not appear anywhere in the file, then there should be no mapping for it.
def charFreq(file: String): HashMap[Char, Int] =
{
var it = Iterator[Char]()
val words = apply(file)
for {
xs<-words
} yield { it = it ++ xs.toIterator }
val chars = it
val grouper = (x: Char) => x
groupFreq(chars, grouper)
}
My solution compiles and apply and groupFreq work as intended, but when I run charFreq, it says
charFreq threw an exception: java.util.NoSuchElementException: key not
found: d
I believe I'm doing something wrong, most likely with my for loop and yield, but I've gone through the logic many times and I don't get why it doesn't work.
Google and StackOverflow has recommended flatmaps, but I coulnd't get that to work either.
Any help would be appreciated. Keep in mind this is a class assignment with the skeleton methods set up, so I cannot change the way apply and groupFreq and charFreq are set up, I can only manipulate the bodies which I have tried to do.

I can't reproduce your error with some random text files of strings. I suspect it occurred in an earlier iteration of groupFreq() w/o a getOrElse() type test.
However, when run your code, I end up with an empty map from the call to charFreq(). You're correct that the loop/yield in charFreq() is problematic. It's easier to see when you put a val l = in front of the for and check the value in an IDE which shoulds that l is of type Iterator[Unit].
You don't need vars for the for loop. The for loop isn't the same as a C-style for loop and is equivalent to calling flatMap/map over its elements ( though others can express this much better than I ). The yield is being concatenated to something for you (defined by the steps you take inside it ).
Here are two ways to get an Iterator[Char] for your call to groupFreq():
1> Remove the unnecessary var it and fill chars directly with a for comprehension loop:
val chars = for {
xs<-words
i<-xs.toIterator
} yield { i }
2> call flatMap directly on the words val:
val chars = words.flatMap( s => s )

A. Regarding your problem, there is at least one issue I can spot in the code:
The way you build up an iterator (in charFreq) seems to be too heavy. words.toIterator would suffice.
The way you update the map also seems strange to me. I would rather do:
val mapped = f(x)
if (!(freqMap contains mapped) freqMap(mapped) = 0
freqMap(mapped)+=1
B. As far as I understand, this problem can be solved with a one-liner (which is why Scala is so cool of course ;-) )
def charFreq(file:String) =
file.toCharArray.groupBy(m=>m).map(m => (m._1,m._2.size))
Explanation:
1) toCharArray converts your string into an array of Char elements
2) groupBy(m=>m) groups together all elements with the same values, the result will be of type Map[Char,Array[Char]], where every char is mapped to the array of all occurrences of that char in your string.
3) now all we need is to map each entry of the Map[Char,Array[Char]] to Map[Char,Int]] by using the mapping map(m => (m._1,m._2.size)), which takes every element (key->value), leaves the key intact and transforms the value (an array) into the size of that array.
4) If your input string is going to be very large (I haven't evaluated that but if it's in the ballpark of MB I'd start to worry about that), then I would probably use another solution, with mutable map which I'd fill up while iterating over the source:
def charFreq(hugeFile:String) = {
//create a mutable map, which can be updated when needed
val mm = scala.collection.mutable.Map[Char,Int]()
//iterate over the string
for (m <- hugeFile) {
//ensure that our map contains the entry for the given character
if (! (mm contains m)) mm(m) = 0
mm(m) = mm(m)+1
}
//return the result as an immutable map
mm.toMap
}

Related

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes

I would like to split a character by spaces but keep the spaces inside the quotes (and the quotes themselves). The problem is, the quotes can be nested, and also I would need to do this for both single and double quotes. So, from the line this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes" I would like to get [this, "'"is a possible option"'", and, ""so is this"", and, '''this one too''', and, even, ""mismatched quotes"].
This question has already been asked, but not the exact question that I'm asking. Here are several solutions: one uses a matcher (in this case """x""" would be split into [""", x"""], so this is not what I need) and Apache Commons (which works with """x""" but not with ""x"", since it takes the first two double quotes and leaves the last two with x). There are also suggestions of writing a function to do so manually, but this would be the last resort.
You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
The rest is building the list out of substrings. Here the complete function:
fun String.splitByPattern(pattern: String): List<String> {
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
var lastIndex = 0
return indices.mapIndexed { i, ele ->
val end = if(i % 2 == 0) ele else ele + 1 // magic
substring(lastIndex, end).apply {
lastIndex = end
}
}
}
Usage:
val str = """
this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
""".trim()
println(str.splitByPattern("""["']+[^"']+?["']+"""))
Output:
[this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]
Try it out on Kotlin's playground!

Searching for key words in a large file (Java/Scala)

I have a large File ~120MB which contains UTF 8 encoded Strings and I need to search for certain words in this file.
The format of the file looks like this:
[resource]<label>[resource]<label>[resource]<label>... including braces as one huge line so I can read it fast into memory.
I search only in the labels and return the labels and resources where a label contains one or more of the key words. Both the labels and the key words are in lower case.
Currently I load the whole file and create a list of Strings. Each entry in this list contains a pair of resource and label in the format [resource]<label>. And the size of this list is approximately 3,000,000. I "iterate" through this list with a tail recursive function and look if my labels contains one of the key words. This is quite fast (<800ms) but this search needs a lot of Memory and CPU-Power
My searchfunction looks like this
#tailrec
def search2( l: List[String], list: List[(String, String)]): List[(String, String)] = {
l match {
case Nil => list
case a :: as => {
val found = keyWords.foldRight(List.empty[(String, String)]) { (x, y) =>
if (a.contains(x)) {
val split = a.split("<")
if (split.size == 2) { (split(0).replace("[", "").replace("]", ""), split(1)) :: y }
else { y }
} else { y }
}
search2(as, found ::: list)
}
}
}
search2(buffer, Nil) //buffer is the list with my 3,000,000 elements
The search needs to be really fast (< 2 seconds). I already tried the MappedByteBuffer but the UTF 8 encoding made it quite difficult to search for a byte sequence and it was really slow (but maybe my search function was just bad).
If needed I could change the format or even split labels and resources into two different files.
You do not need to reparse the file every time you search for an element.
Read your file once for all and put the words in a Map[String, Set[String]].
Something like:
val allWords: Map[String, Seq[String]] =
Source.fromFile(file)
.getLines()
.head
.split(extractLabelResources)
.groupBy { case (label, resource) => label }
.mapValues(_.toSeq)
def extractLabelResources(line: String): Array[(String, String)] = {
// ...
}
def search(word: String): Set[String] = allWords.getOrElse(word, Set.empty)

java regex: is it possible to find number of captures in a match without looping

My code looks like this and it works fine for finding all the numbers in the matrix but it seems overly complicated to me.
String attr = "matrix(1 0 0 1 22.51 35)";
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
while(m2.find()) {
Log.i(logTag, "s = " + m2.group(0));
}
I would like to allocate an array and then assign values to it so I could do something like:
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
String [] matches = new String[number_of_matches];
int index = 0;
while (m2.find()) {
matches[index++] = m2.groups(0);
}
Is this possible? I've looked for several hours and can't find anything like this in native java but I've found several pieces of example code to implement this functionality but it's something I'd expect to find in the regex library.
In PERL my code would look like:
$s = "matrix(1 0 0 1 22.51 35)";
#x = ($s =~ m{(\d+(?:\.\d+)*)}g);
x #x
0 1
1 0
2 0
3 1
4 22.51
5 35
No, there's no method for that functionality.
If it existed, it would do exactly what you describe: go over all matches and count them (and then reset to the beginning of the string)
If you really needed to know up-front, you could just write the counting function yourself.
But I would suggest that you use a List (such as an ArrayList) instead of an array; then you don't need to know the number of matches up-front, and the List interface is generally much much convenient to use than an array.
(Your Perl result also returns a variable-size list rather than a fixed-size array, if my rusty Perl knowledge is not mistaken)
You should be using the List interface for these types of operations anyway.
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
List <String> matches = new ArrayList<String>();
while (m2.find()) {
matches.add(m2.groups(0));
}
You can't do it, with the find() you initiate scanning of your String of which you don't know much before hand. You can learn more only if the match succeeds, from the docs
If the match succeeds then more information can be obtained via the
start, end, and group methods, and subsequent invocations of the
find() method will start at the first character not matched by this
match.

Java - Search performantly for subset of String in String list

I want to search through a list of Strings and return the values, which contains which contain the search string.
The list could look like this (can have up to 1000 entries). Although it is not guranteed that it is always letters and then a digit. It could be digits only, words only or even both mixed up:
entry 1
entry 2
entry 3
entry 4
test 1
test 2
test 3
tst 4
If the user does search for 1, these should be returned:
entry 1
test 1
The situation is that the user has a search bar and can enter a search string. This search string is used to search through the list.
How can this be done performantly?
Currently, I have got:
for (String s : strings) {
if (s.contains(searchedText)) result.add(s);
}
It is O(N) and really slow. Especially if the user types many characters at a time.
Maybe I don't understand your question, but as you know n Java, String objects are immutable, but also can represent collection(array) of chars. So one thing what you can do is to perform search with better algorithms as binary_search, Aho-Corasick, Rabin–Karp, Boyer–Moore string search, StringSearch or one of these. Also you may consider some usage of Abstract_data_types with better performance (hashing, trees etc.).
This is very simple if you use streams:
final List<String> items = Arrays.asList("entry 1", "entry 2", "entry 3", "test 1", "test 2", "test 3");
final String searchString = "1";
final List<String> results = items.parallelStream() // work in parallel
.filter(s -> s.contains(searchString)) // pick out items that match
.collect(Collectors.toList()); // and turn those into a result list
results.forEach(System.out::println);
Notice the parallelStream() which will cause the list to be filtered and traversed using all available CPUs.
In your case you can use the results when the user expands the search term (while typing) to reduce the amount of items that need to be filtered, because if 's' matches all items in result, all those that match 'se' will be a sub-list of result.
If you don't use any additional structures, you cannot perform faster, than look though your data. That takes O(N).
If you can do some preparations, like building text index, you can increase performance of search. General information: http://en.wikipedia.org/wiki/Full_text_search. If you can make some assumptions about your data (like the last symbol is number and you are going to search only by it), it'll be easy to create such index.
Depending on the upper limit of the number in the string and if you have no concerns about space, use an Array of ArrayLists where the array index is the number of the string:
ArrayList<String>[] data = new ArrayList<String>[1000];
for ( int i = 0; i < 1000; i++ )
data[i] = new ArrayList<String>();
//inserting data
int num = Integer.parseInt(datastring.substring(datastring.length-1));
data[i].add(datastring);
//getting all data that has a 1
for ( String s: data[1] )
result.add(s);
Using a Hashmap would overwrite previous mapped values when trying to put new values into it.
i.e. if 1 maps to entry, then you try to add 1 mapping to test, the entry would get replaced with test.
As another idea, you could just keep a count of the number of strings with each number, so when you're searching, you know how many to look for, so as soon as you find all of them, you stop searching:
int[] str_count = new int[1000];
for ( int i = 0; i < 1000; i++ )
str_count[i] = 0;
//when storing data into the list:
int num = Integer.parseInt(datastring.substring(datastring.length-1));
str_count[i]++;
//when searching the list for 1s:
int count = str_count[1];
for (String s : strings) {
if (s.contains(searchedText))
result.add(s);
if (result.size() == count)
break;
}
While the first idea would be much faster, it would take up more space. Yet, the second idea takes up less space, the worst case scenario would search O(N) still.

intersection of two strings using Java HashSet

I am trying to learn Java by doing some assignments from a Stanford class and am having trouble answering this question.
boolean stringIntersect(String a, String b, int len): Given 2 strings,
consider all the substrings within them of length len. Returns true if
there are any such substrings which appear in both strings. Compute
this in O(n) time using a HashSet.
I can't figure out how to do it using a Hashset because you cannot store repeating characters. So stringIntersect(hoopla, loopla, 5) should return true.
thanks!
Edit: Thanks so much for all your prompt responses. It was helpful to see explanations as well as code. I guess I couldn't see why storing substrings in a hashset would make the algorithm more efficient. I originally had a solution like :
public static boolean stringIntersect(String a, String b, int len) {
assert (len>=1);
if (len>a.length() || len>b.length()) return false;
String s1=new String(),s2=new String();
if (a.length()<b.length()){
s1=a;
s2=b;
}
else {
s1=b;
s2=a;
}
int index = 0;
while (index<=s1.length()-len){
if (s2.contains(s1.substring(index,index+len)))return true;
index++;
}
return false;
}
I'm not sure I understand what you mean by "you cannot store repeating characters" A hashset is a Set, so it can do two things: you can add value to it, and you can add values to it, and you can check if a value is already in it. In this case, the problem wants you to answer the question by storing strings, not chars, in the HashSet. To do this in java:
Set<String> stringSet = new HashSet<String>();
Try breaking this problem into two parts:
1. Generate all the substrings of length len of a string
2. Use this to solve the problem.
The hint for part two is:
Step 1: For the first string enter the substrings into a hashset
Step 2: For the second string, check the values in the hashset
Note (Advanced): this problem is poorly specified. Entering and checking strings in a hashtable is O the length of the string. For string a of length n you have O(n-k) substrings of length k. So for string a being a string of length n and string b being a string of length m you have O((n-k)*k+(m-k)*k) this is not really big Oh of n, since your running time for k = n/2 is O((n/2)*(n/2)) = O(n^2)
Edit: So what if you actually want to do this in O(n) (or perhaps O(n+m+k))? My belief is that the original homework was asking for something like the algorithm I described above. But we can do better. Whats more, we can do better and still make a HashSet the crucial tool for our algorithm. The idea is to perform our search using a "Rolling Hash." Wikipedia describes a couple: http://en.wikipedia.org/wiki/Rolling_hash, but we will implement our own.
A simple solution would be to XOR the values of the character hashes together. This could allow us to add a new char to the hash O(1) and remove one O(1) making computing the next hash trivial. But this simple algorithm wont work for two reasons
The character hashes may not provide sufficient entropy. Okay, we dont know if we will have this problem, but lets solve it anyways, just for fun.
We will hash permutations to the same value ... "abc" should not have the same hash as "cba"
To solve the first problem we can use an idea from AI, namely lets steel from Zobrist hashing. The idea is to assign every possible character a random value of a greater length. If we were using ASCI, we could easily create an array with all the ASCI characters, but that will run into problems when using unicode characters. The alternative is to assign values lazily.
object LazyCharHash{
private val map = HashMap.empty[Char,Int]
private val r = new Random
def lHash(c: Char): Int = {
val d = map.get(c)
d match {
case None => {
map.put(c,r.nextInt)
lHash(c)
}
case Some(v) => v
}
}
}
This is Scala code. Scala tends to be less verbose than Java, but still allows me to use Java collections, as such I will be using imperative style Scala through out. It wouldn't be that hard to translate.
The second problem can be solved aswell. First, instead of using a pure XOR, we combine our XOR with a shift, thus the hash function is now:
def fullHash(s: String) = {
var h = 0
for(i <- 0 until s.length){
h = h >>> 1
h = h ^ LazyCharHash.lHash(s.charAt(i))
}
h
}
Of-course, using fullHash wont give a performance advantage. It is just a specification
We need a way of using our hash function to store values in the HashSet (I promised we would use it). We can just create a wrapper class:
class HString(hash: Int, string: String){
def getHash = hash
def getString = string
override def equals(otherHString: Any): Boolean = {
otherHString match {
case other: HString => (hash == other.getHash) && (string == other.getString)
case _ => false
}
}
override def hashCode = hash
}
Okay, to make the hashing function rolling, we just have to XOR the value associated with the character we will no longer be using. To that just takes shifting that value by the appropriate amount.
def stringIntersect(a: String, b: String, len: Int): Boolean = {
val stringSet = new HashSet[HString]()
var h = 0
for(i <- 0 until len){
h = h >>> 1
h = h ^ LazyCharHash.lHash(a.charAt(i))
}
stringSet.add(new HString(h,a.substring(0,len)))
for(i <- len until a.length){
h = h >>> 1
h = h ^ (LazyCharHash.lHash(a.charAt(i - len)) >>> (len))
h = h ^ LazyCharHash.lHash(a.charAt(i))
stringSet.add(new HString(h,a.substring(i - len + 1,i + 1)))
}
...
You can figure out how to finish this code on your own.
Is this O(n)? Well, it matters what mean. Big Oh, big Omega, big Theta, are all metrics of bounds. They could serve as metrics of the worst case of the algorithm, the best case, or something else. In this case these modification gives expected O(n) performance, but this only holds if we avoid hash collisions. It still take O(n) to tell if two Strings are equals. This random approach works pretty well, and you can scale up the size of the random bit arrays to make it work better, but it does not have guaranteed performance.
You should not store characters in the Hashset, but substrings.
When considering string "hoopla": if you store the substrings "hoopl" and "oopla" in the Hashset (linear operation), then it's linear again to find if one of the substrings of "loopla" matches.
I don't know how they're thinking you're supposed to use the HashSet but I ended up doing a solution like this:
public class StringComparator {
public static boolean compare( String a, String b, int len ) {
Set<String> pieces = new HashSet<String>();
for ( int x = 0; (x + len) <= b.length(); x++ ) {
pieces.add( a.substring( x, x + len ) );
}
for ( String piece : pieces ) {
if ( b.contains(piece) ) {
return true;
}
}
return false;
}
}

Categories