Searching for key words in a large file (Java/Scala) - java

I have a large File ~120MB which contains UTF 8 encoded Strings and I need to search for certain words in this file.
The format of the file looks like this:
[resource]<label>[resource]<label>[resource]<label>... including braces as one huge line so I can read it fast into memory.
I search only in the labels and return the labels and resources where a label contains one or more of the key words. Both the labels and the key words are in lower case.
Currently I load the whole file and create a list of Strings. Each entry in this list contains a pair of resource and label in the format [resource]<label>. And the size of this list is approximately 3,000,000. I "iterate" through this list with a tail recursive function and look if my labels contains one of the key words. This is quite fast (<800ms) but this search needs a lot of Memory and CPU-Power
My searchfunction looks like this
#tailrec
def search2( l: List[String], list: List[(String, String)]): List[(String, String)] = {
l match {
case Nil => list
case a :: as => {
val found = keyWords.foldRight(List.empty[(String, String)]) { (x, y) =>
if (a.contains(x)) {
val split = a.split("<")
if (split.size == 2) { (split(0).replace("[", "").replace("]", ""), split(1)) :: y }
else { y }
} else { y }
}
search2(as, found ::: list)
}
}
}
search2(buffer, Nil) //buffer is the list with my 3,000,000 elements
The search needs to be really fast (< 2 seconds). I already tried the MappedByteBuffer but the UTF 8 encoding made it quite difficult to search for a byte sequence and it was really slow (but maybe my search function was just bad).
If needed I could change the format or even split labels and resources into two different files.

You do not need to reparse the file every time you search for an element.
Read your file once for all and put the words in a Map[String, Set[String]].
Something like:
val allWords: Map[String, Seq[String]] =
Source.fromFile(file)
.getLines()
.head
.split(extractLabelResources)
.groupBy { case (label, resource) => label }
.mapValues(_.toSeq)
def extractLabelResources(line: String): Array[(String, String)] = {
// ...
}
def search(word: String): Set[String] = allWords.getOrElse(word, Set.empty)

Related

Gatling : How to split value getting from the Feeder?

In a csv file I've something like this
term
testing
I want to split testing into characters. I want something like this :
.feed(Feeders.search)
.foreach("${term}".toList, "search") {
exec(http("Auto Complete")
.get("${baseUrlHttps}/search/autocomplete")
.queryParam("term", "${search}")
.check(status is 200)
.check(jsonPath("$..products[0].code").optional.saveAs("code"))).pause(MIN_PAUSE, MAX_PAUSE)
}
The above code is not working as I wanted, it's splitting "${term}" into characters though I wanted to convert word "testing" which is in csv in to characters. Is there any workaround for it ?
That's not how autocomplete works. You're not posting chars by chars, you're reposting with one more char. Eg, you'll be posting "test", then "testi" then "testin" and finally "testing" (there's usually a minimum length.
exec { session =>
val term = session("term").as[String]
val parts = for (i <- 3 to term.size) yield term.substring(0, i)
session.set("parts", parts)
}
.foreach("${parts}", "search") {
exec(http("Auto Complete")
.get("${baseUrlHttps}/search/autocomplete")
.queryParam("term", "${search}")
.check(status is 200)
.check(jsonPath("$..products[0].code").optional.saveAs("code"))).pause(MIN_PAUSE, MAX_PAUSE)
}

HashMap match against Object value

I'm making a text-based adventure game in Java, and I want to be able to match the player location with the location of a character/enemy. My list of characters is imported from a text file and put into a Hashmap. The importing from a text file is a requirement.
I can match the location if I specify the value (name) of the character, but I want to be able to have it go through and match on the "location" property of the character. Here is what I have:
Character object:
Character(String name, String location, int maxHp, int maxAttackDmg, String description)
Character HashMap:
HashMap<String, Character> characters = ReadIn.createCharacters();
ReadIn.createCharacters() parses a text file for the character properties.
There is another method called player.getLocation() which gives the player's current location.
Here is what I have working:
if (player.getLocation().equals(skeleton.getLocation()) && skeleton.getDefeated() == false) {
Encounter e = new Encounter();
e.fight(player,skeleton);
}
If the player is in the same location as the character, and the character has not been defeated, then call the fight() method.
What I want to do is this:
if (player.getLocation().equals(any location in the character HashMap) {
Encounter e = new Encounter();
e.fight(player,<matched character from HashMap>);
}
I know what I want to do, I just don't know how to do it in Java. I'm quite new to Java and programming in general. Hopefully I gave enough detail, but I can provide more if needed. Any help would be greatly appreciated!
The traditional way to achieve this in Java would be:
for (Character character: characters.values()) {
if (character.getLocation().equals(player.getLocation())
&& !character.getName().equals(player.getName())
&& !character.isDefeated()) {
... fight ...
}
}
The alternative using streams would be:
characters.values().stream()
.filter(ch -> ch.getLocation().equals(player.getLocation()))
.filter(ch -> !ch.getName().equals(player.getName()))
.filter(ch -> !ch.isDefeated())
.forEach(ch -> ... fight ...);
If you want a single encounter irrespective of the number of characters with the same location then replace forEach with
.findAny()
.ifPresent(ch -> ... fight ...);
I assume that your characters map is a "location to character map"
then use this
Character matchingCharacter = characters.get(player.getLocation());
if (matchingCharacter != null) {
Encounter e = new Encounter();
e.fight(player, matchingCharacter);
}

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes

I would like to split a character by spaces but keep the spaces inside the quotes (and the quotes themselves). The problem is, the quotes can be nested, and also I would need to do this for both single and double quotes. So, from the line this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes" I would like to get [this, "'"is a possible option"'", and, ""so is this"", and, '''this one too''', and, even, ""mismatched quotes"].
This question has already been asked, but not the exact question that I'm asking. Here are several solutions: one uses a matcher (in this case """x""" would be split into [""", x"""], so this is not what I need) and Apache Commons (which works with """x""" but not with ""x"", since it takes the first two double quotes and leaves the last two with x). There are also suggestions of writing a function to do so manually, but this would be the last resort.
You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
The rest is building the list out of substrings. Here the complete function:
fun String.splitByPattern(pattern: String): List<String> {
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
var lastIndex = 0
return indices.mapIndexed { i, ele ->
val end = if(i % 2 == 0) ele else ele + 1 // magic
substring(lastIndex, end).apply {
lastIndex = end
}
}
}
Usage:
val str = """
this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
""".trim()
println(str.splitByPattern("""["']+[^"']+?["']+"""))
Output:
[this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]
Try it out on Kotlin's playground!

How to get lines before and after matching from java 8 stream like grep?

I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:
grep -A 10 -B 10 "ABC" myfile.txt
How can I implements the equivalent in Java 8 using stream?
If you're willing to use a third party library and don't need parallelism, then jOOλ offers SQL-style window functions as follows
Seq.seq(Files.readAllLines(Paths.get(new File("/path/to/Example.java").toURI())))
.window(-1, 1)
.filter(w -> w.value().contains("ABC"))
.forEach(w -> {
System.out.println("-1:" + w.lag().orElse(""));
System.out.println(" 0:" + w.value());
System.out.println("+1:" + w.lead().orElse(""));
// ABC: Just checking
});
Yielding
-1: .window(-1, 1)
0: .filter(w -> w.value().contains("ABC"))
+1: .forEach(w -> {
-1: System.out.println("+1:" + w.lead().orElse(""));
0: // ABC: Just checking
+1: });
The lead() function accesses the next value in traversal order from the window, the lag() function accesses the previous row.
Disclaimer: I work for the company behind jOOλ
Such scenario is not well-supported by Stream API as existing methods do not provide an access to the element neighbors in the stream. The closest solution which I can think up without creating custom iterators/spliterators and third-party library calls is to read the input file into List and then use indices Stream:
List<String> input = Files.readAllLines(Paths.get(fileName));
Predicate<String> pred = str -> str.contains("ABC");
int contextLength = 10;
IntStream.range(0, input.size()) // line numbers
// filter them leaving only numbers of lines satisfying the predicate
.filter(idx -> pred.test(input.get(idx)))
// add nearby numbers
.flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
// remove numbers which are out of the input range
.filter(idx -> idx >= 0 && idx < input.size())
// sort numbers and remove duplicates
.distinct().sorted()
// map to the lines themselves
.mapToObj(input::get)
// output
.forEachOrdered(System.out::println);
The grep output also includes special delimiter like "--" to designate the omitted lines. If you want to go further and mimic such behavior as well, I can suggest you to try my free StreamEx library as it has intervalMap method which is helpful in this case:
// Same as IntStream.range(...).filter(...) steps above
IntStreamEx.ofIndices(input, pred)
// same as above
.flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
// remove numbers which are out of the input range
.atLeast(0).less(input.size())
// sort numbers and remove duplicates
.distinct().sorted()
.boxed()
// merge adjacent numbers into single interval and map them to subList
.intervalMap((i, j) -> (j - i) == 1, (i, j) -> input.subList(i, j + 1))
// flatten all subLists prepending them with "--"
.flatMap(list -> StreamEx.of(list).prepend("--"))
// skipping first "--"
.skip(1)
.forEachOrdered(System.out::println);
As Tagir Valeev noted, this kind of problem isn't well supported by the streams API. If you incrementally want to read lines from the input and print out matching lines with context, you'd have to introduce a stateful pipeline stage (or a custom collector or spliterator) which adds quite a bit of complexity.
If you're willing to read all the lines into memory, it turns out that BitSet is a useful representation for manipulating groups of matches. This bears some similarity to Tagir's solution, but instead of using integer ranges to represent lines to be printed, it uses 1-bits in a BitSet. Some advantages of BitSet are that it has a number of built-in bulk operations, and it has a compact internal representation. It can also produce a stream of indexes of the 1-bits, which is quite useful for this problem.
First, let's start out by creating a BitSet that has a 1-bit for each line that matches the predicate:
void contextMatch(Predicate<String> pred, int before, int after, List<String> input) {
int len = input.size();
BitSet matches = IntStream.range(0, len)
.filter(i -> pred.test(input.get(i)))
.collect(BitSet::new, BitSet::set, BitSet::or);
Now that we have the bit set of matching lines, we stream out the indexes of each 1-bit. We then set the bits in the bitset that represent the before and after context. This gives us a single BitSet whose 1-bits represent all of the lines to be printed, including context lines.
BitSet context = matches.stream()
.collect(BitSet::new,
(bs,i) -> bs.set(Math.max(0, i - before), Math.min(i + after + 1, len)),
BitSet::or);
If we just want to print out all the lines, including context, we can do this:
context.stream()
.forEachOrdered(i -> System.out.println(input.get(i)));
The actual grep -A a -B b command prints a separator between each group of context lines. To figure out when to print a separator, we look at each 1-bit in the context bit set. If there's a 0-bit preceding it, or if it's at the very beginning, we set a bit in the result. This gives us a 1-bit at the beginning of each group of context lines:
BitSet separators = context.stream()
.filter(i -> i == 0 || !context.get(i-1))
.collect(BitSet::new, BitSet::set, BitSet::or);
We don't want to print the separator before each group of context lines; we want to print it between each group. That means we have to clear the first 1-bit (if any):
// clear the first bit
int first = separators.nextSetBit(0);
if (first >= 0) {
separators.clear(first);
}
Now, we can print out the result lines. But before printing each line, we check to see if we should print a separator first:
context.stream()
.forEachOrdered(i -> {
if (separators.get(i)) {
System.out.println("--");
}
System.out.println(input.get(i));
});
}

Scala HashMap throwing key not found exception

I am very new to Scala, and would appreciate any help (have looked everywhere and spent the last 8 hours trying to figure this out)
Currently I have
def apply(file: String) : Iterator[String] = {
scala.io.Source.fromFile(file).getLines().map(_.toLowerCase)
}
As well as
def groupFreq[A,B](xs: Iterator[A], f: A => B): HashMap[B, Int] = {
var freqMap = new HashMap[B, Int]
for (x <- xs) freqMap = freqMap + ( f(x) -> ( freqMap.getOrElse( f(x) , 0 ) +1 ) )
freqMap
}
apply just takes a file of words that we pass in.
GroupFreq takes xs: Iterator[A] and a grouping function f that converts A values to their B groups.
The function returns a HashMap that for each B group, counts the number of A values that fell into the group.
I use both of these functions, to help me with charFreq, a function that uses both apply and groupFreq to pass back a HashMap that counts how many times a Char appears throughout the entire file. If the char does not appear anywhere in the file, then there should be no mapping for it.
def charFreq(file: String): HashMap[Char, Int] =
{
var it = Iterator[Char]()
val words = apply(file)
for {
xs<-words
} yield { it = it ++ xs.toIterator }
val chars = it
val grouper = (x: Char) => x
groupFreq(chars, grouper)
}
My solution compiles and apply and groupFreq work as intended, but when I run charFreq, it says
charFreq threw an exception: java.util.NoSuchElementException: key not
found: d
I believe I'm doing something wrong, most likely with my for loop and yield, but I've gone through the logic many times and I don't get why it doesn't work.
Google and StackOverflow has recommended flatmaps, but I coulnd't get that to work either.
Any help would be appreciated. Keep in mind this is a class assignment with the skeleton methods set up, so I cannot change the way apply and groupFreq and charFreq are set up, I can only manipulate the bodies which I have tried to do.
I can't reproduce your error with some random text files of strings. I suspect it occurred in an earlier iteration of groupFreq() w/o a getOrElse() type test.
However, when run your code, I end up with an empty map from the call to charFreq(). You're correct that the loop/yield in charFreq() is problematic. It's easier to see when you put a val l = in front of the for and check the value in an IDE which shoulds that l is of type Iterator[Unit].
You don't need vars for the for loop. The for loop isn't the same as a C-style for loop and is equivalent to calling flatMap/map over its elements ( though others can express this much better than I ). The yield is being concatenated to something for you (defined by the steps you take inside it ).
Here are two ways to get an Iterator[Char] for your call to groupFreq():
1> Remove the unnecessary var it and fill chars directly with a for comprehension loop:
val chars = for {
xs<-words
i<-xs.toIterator
} yield { i }
2> call flatMap directly on the words val:
val chars = words.flatMap( s => s )
A. Regarding your problem, there is at least one issue I can spot in the code:
The way you build up an iterator (in charFreq) seems to be too heavy. words.toIterator would suffice.
The way you update the map also seems strange to me. I would rather do:
val mapped = f(x)
if (!(freqMap contains mapped) freqMap(mapped) = 0
freqMap(mapped)+=1
B. As far as I understand, this problem can be solved with a one-liner (which is why Scala is so cool of course ;-) )
def charFreq(file:String) =
file.toCharArray.groupBy(m=>m).map(m => (m._1,m._2.size))
Explanation:
1) toCharArray converts your string into an array of Char elements
2) groupBy(m=>m) groups together all elements with the same values, the result will be of type Map[Char,Array[Char]], where every char is mapped to the array of all occurrences of that char in your string.
3) now all we need is to map each entry of the Map[Char,Array[Char]] to Map[Char,Int]] by using the mapping map(m => (m._1,m._2.size)), which takes every element (key->value), leaves the key intact and transforms the value (an array) into the size of that array.
4) If your input string is going to be very large (I haven't evaluated that but if it's in the ballpark of MB I'd start to worry about that), then I would probably use another solution, with mutable map which I'd fill up while iterating over the source:
def charFreq(hugeFile:String) = {
//create a mutable map, which can be updated when needed
val mm = scala.collection.mutable.Map[Char,Int]()
//iterate over the string
for (m <- hugeFile) {
//ensure that our map contains the entry for the given character
if (! (mm contains m)) mm(m) = 0
mm(m) = mm(m)+1
}
//return the result as an immutable map
mm.toMap
}

Categories