Here is what I am trying to do.
I am reading in a list of words with each having a level of complexity. Each line has a word followed by a comma and the level of the word. "watch, 2" for example. I wish to put all of the words of a given level into a set to ensure their uniqueness in that level. There are 5 levels of complexity, so ideally I'd like an array with 5 elements, each of which is a set.
I can then add words to each of the sets as I read them in. Later on, I wish to pull out a random word of a specified level.
I'm happy with everything except how to create an array of sets. I've read several other posts here that seem to agree that this can't be done exactly as I would hope, but I can't find a good work around. (No, I'm not willing to have 5 sets in a switch statement. Goes against the grain.)
Thanks.
You can use a map . Use level as key and value as the set which contains the words. This will help you to pull out the value for a given level, When a random word is requested from a level, get the value(set in this case) using the key which is the level and pick a random value from that. This will also scale if you increase the number of levels
public static void main(String[] args) {
Map<Integer, Set<String>> levelSet = new HashMap();
//Your code goes here to get the level and word
//
String word="";
int level=0;
addStringToLevel(levelSet,word,level);
}
private static void addStringToLevel(Map<Integer, Set<String>> levelSet,
String word, int level) {
if(levelSet.get(level) == null)
{
// this means this is the first string added for this level
// so create a container to hold the object
levelSet.put(level, new HashSet());
}
Set<String> wordContainer = levelSet.get(level);
wordContainer.add(word);
}
private static String getStringFromLevel(Map<Integer, Set<String>> levelSet,
int level) {
if(levelSet.get(level) == null)
{
return null;
}
Set<String> wordContainer = levelSet.get(level);
return "";// return a random string from wordContainer`
}
If you are willing to use Guava, try SetMultimap. It will take care of everything for you.
SetMultimap<Integer, String> map = HashMultimap.create();
map.put(5, "value");
The collection will take care of creating the inner Set instances for you unlike the array or List solutions which require either pre-creating the Sets or checking that they exist.
Consider using a List instead of an array.
Doing so might make your life easier.
List<Set<String>> wordSetLevels = new ArrayList();
// ...
for ( i = 0; i < 5; i++ ) {
wordSetLevels.add(new HashSet<String>());
}
wordSetLevels = Collections.unmodifiableList(wordSetLevels);
// ...
wordSetLevels.get(2).add("watch");
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Main {
private Set<String>[] process(List<String> words) {
#SuppressWarnings("unchecked")
Set<String>[] arrayOfSets = new Set[5];
for(int i=0; i<arrayOfSets.length; i++) {
arrayOfSets[i] = new HashSet<String>();
}
for(String word: words) {
int index = getIndex(word);
String val = getValue(word);
arrayOfSets[index].add(val);
}
return arrayOfSets;
}
private int getIndex(String str) {
//TODO Implement
return 0;
}
private String getValue(String str) {
//TODO Implement
return "";
}
}
Related
tldr: How can I search for an entry in multiple (read-only) Java HashMaps at the same time?
The long version:
I have several dictionaries of various sizes stored as HashMap< String, String >. Once they are read in, they are never to be changed (strictly read-only).
I want to check whether and which dictionary had stored an entry with my key.
My code was originally looking for a key like this:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
if (map.containsKey(key))
return new DictionaryEntry(map.get(key), i);
}
return null;
}
Then it got a little more complicated: my search string could contain typos, or was a variant of the stored entry. Like, if the stored key was "banana", it is possible that I'd look up "bannana" or "a banana", but still would like the entry for "banana" returned. Using the Levenshtein-Distance, I now loop through all dictionaries and each entry in them:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
for (Map.Entry entry : map.entrySet) {
// Calculate Levenshtein distance, store closest match etc.
}
}
// return closest match or null.
}
So far everything works as it should and I'm getting the entry I want. Unfortunately I have to look up around 7000 strings, in five dictionaries of various sizes (~ 30 - 70k entries) and it takes a while. From my processing output I have the strong impression my lookup dominates overall runtime.
My first idea to improve runtime was to search all dictionaries parallely. Since none of the dictionaries is to be changed and no more than one thread is accessing a dictionary at the same time, I don't see any safety concerns.
The question is just: how do I do this? I have never used multithreading before. My search only came up with Concurrent HashMaps (but to my understanding, I don't need this) and the Runnable-class, where I'd have to put my processing into the method run(). I think I could rewrite my current class to fit into Runnable, but I was wondering if there is maybe a simpler method to do this (or how can I do it simply with Runnable, right now my limited understanding thinks I have to restructure a lot).
Since I was asked to share the Levenshtein-Logic: It's really nothing fancy, but here you go:
private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) {
Map.Entry _closestMatch = null;
int lsDist;
if (key == null) {
return null;
}
for (Map.Entry entry : _dictionary.entrySet()) {
// Perfect match
if (entry.getKey().equals(key)) {
return entry;
}
// Similar match
else {
int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);
// If "dist" is smaller than threshold and smaller than distance of already stored entry
if (dist < _maxLSDistance) {
if (_closestMatch == null || dist < _lsDistance) {
_closestMatch = entry;
_lsDistance = dist;
}
}
}
}
return _closestMatch
}
In order to use multi-threading in your case, could be something like:
The "monitor" class, which basically stores the results and coordinates the threads;
public class Results {
private int nrOfDictionaries = 4; //
private ArrayList<String> results = new ArrayList<String>();
public void prepare() {
nrOfDictionaries = 4;
results = new ArrayList<String>();
}
public synchronized void oneDictionaryFinished() {
nrOfDictionaries--;
System.out.println("one dictionary finished");
notifyAll();
}
public synchronized boolean isReady() throws InterruptedException {
while (nrOfDictionaries != 0) {
wait();
}
return true;
}
public synchronized void addResult(String result) {
results.add(result);
}
public ArrayList<String> getAllResults() {
return results;
}
}
The Thread it's self, which can be set to search for the specific dictionary:
public class ThreadDictionarySearch extends Thread {
// the actual dictionary
private String dictionary;
private Results results;
public ThreadDictionarySearch(Results results, String dictionary) {
this.dictionary = dictionary;
this.results = results;
}
#Override
public void run() {
for (int i = 0; i < 4; i++) {
// search dictionary;
results.addResult("result of " + dictionary);
System.out.println("adding result from " + dictionary);
}
results.oneDictionaryFinished();
}
}
And the main method for demonstration:
public static void main(String[] args) throws Exception {
Results results = new Results();
ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");
threadA.start();
threadB.start();
threadC.start();
threadD.start();
if (results.isReady())
// it stays here until all dictionaries are searched
// because in "Results" it's told to wait() while not finished;
for (String string : results.getAllResults()) {
System.out.println("RESULT: " + string);
}
I think the easiest would be to use a stream over the entry set:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
map.entrySet().parallelStream().foreach( (entry) ->
{
// Calculate Levenshtein distance, store closest match etc.
}
);
}
// return closest match or null.
}
Provided you are using java 8 of course. You could also wrap the outer loop into an IntStream as well. Also you could directly use the Stream.reduce to get the entry with the smallest distance.
Maybe try thread pools:
ExecutorService es = Executors.newFixedThreadPool(_numDictionaries);
for (int i = 0; i < _numDictionaries; i++) {
//prepare a Runnable implementation that contains a logic of your search
es.submit(prepared_runnable);
}
I believe you may also try to find a quick estimate of strings that completely do not match (i.e. significant difference in length), and use it to finish your logic ASAP, moving to next candidate.
I have my strong doubts that HashMaps are a suitable solution here, especially if you want to have some fuzzing and stop words. You should utilize a proper full text search solutions like ElaticSearch or Apache Solr or at least an available engine like Apache Lucene.
That being said, you can use a poor man's version: Create an array of your maps and a SortedMap, iterate over the array, take the keys of the current HashMap and store them in the SortedMap with the index of their HashMap. To retrieve a key, you first search in the SortedMap for said key, get the respective HashMap from the array using the index position and lookup the key in only one HashMap. Should be fast enough without the need for multiple threads to dig through the HashMaps. However, you could make the code below into a runnable and you can have multiple lookups in parallel.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;
public class Search {
public static void main(String[] arg) {
if (arg.length == 0) {
System.out.println("Must give a search word!");
System.exit(1);
}
String searchString = arg[0].toLowerCase();
/*
* Populating our HashMaps.
*/
HashMap<String, String> english = new HashMap<String, String>();
english.put("banana", "fruit");
english.put("tomato", "vegetable");
HashMap<String, String> german = new HashMap<String, String>();
german.put("Banane", "Frucht");
german.put("Tomate", "Gemüse");
/*
* Now we create our ArrayList of HashMaps for fast retrieval
*/
List<HashMap<String, String>> maps = new ArrayList<HashMap<String, String>>();
maps.add(english);
maps.add(german);
/*
* This is our index
*/
SortedMap<String, Integer> index = new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER);
/*
* Populating the index:
*/
for (int i = 0; i < maps.size(); i++) {
// We iterate through or HashMaps...
HashMap<String, String> currentMap = maps.get(i);
for (String key : currentMap.keySet()) {
/* ...and populate our index with lowercase versions of the keys,
* referencing the array from which the key originates.
*/
index.put(key.toLowerCase(), i);
}
}
// In case our index contains our search string...
if (index.containsKey(searchString)) {
/*
* ... we find out in which map of the ones stored in maps
* the word in the index originated from.
*/
Integer mapIndex = index.get(searchString);
/*
* Next, we look up said map.
*/
HashMap<String, String> origin = maps.get(mapIndex);
/*
* Last, we retrieve the value from the origin map
*/
String result = origin.get(searchString);
/*
* The above steps can be shortened to
* String result = maps.get(index.get(searchString).intValue()).get(searchString);
*/
System.out.println(result);
} else {
System.out.println("\"" + searchString + "\" is not in the index!");
}
}
}
Please note that this is a rather naive implementation only provided for illustration purposes. It doesn't address several problems (you can't have duplicate index entries, for example).
With this solution, you are basically trading startup speed for query speed.
Okay!!..
Since your concern is to get faster response.
I would suggest you to divide the work between threads.
Lets you have 5 dictionaries May be keep three dictionaries to one thread and rest two will take care by another thread.
And then witch ever thread finds the match will halt or terminate the other thread.
May be you need an extra logic to do that dividing work ... But that wont effect your performance time.
And may be you need little more changes in your code to get your close match:
for (Map.Entry entry : _dictionary.entrySet()) {
you are using EntrySet But you are not using values anyway it seems getting entry set is a bit expensive. And I would suggest you to just use keySet since you are not really interested in the values in that map
for (Map.Entry entry : _dictionary.keySet()) {
For more details on the proformance of map Please read this link Map performances
Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.
I am a beginner in Java. Basically, I have loaded each text document and stored each individual words in the text document in the hasmap. Afterwhich, I tried storing all the hashmaps in an ArrayList. Now I am stuck with how to retrieve all the words in my hashmaps that is in the arraylist!
private static long numOfWords = 0;
private String userInputString;
private static long wordCount(String data) {
long words = 0;
int index = 0;
boolean prevWhiteSpace = true;
while (index < data.length()) {
//Intialise character variable that will be checked.
char c = data.charAt(index++);
//Determine whether it is a space.
boolean currWhiteSpace = Character.isWhitespace(c);
//If previous is a space and character checked is not a space,
if (prevWhiteSpace && !currWhiteSpace) {
words++;
}
//Assign current character's determination of whether it is a spacing as previous.
prevWhiteSpace = currWhiteSpace;
}
return words;
} //
public static ArrayList StoreLoadedFiles()throws Exception{
final File f1 = new File ("C:/Users/Admin/Desktop/dataFiles/"); //specify the directory to load files
String data=""; //reset the words stored
ArrayList<HashMap> hmArr = new ArrayList<HashMap>(); //array of hashmap
for (final File fileEntry : f1.listFiles()) {
Scanner input = new Scanner(fileEntry); //load files
while (input.hasNext()) { //while there are still words in the document, continue to load all the words in a file
data += input.next();
input.useDelimiter("\t"); //similar to split function
} //while loop
String textWords = data.replaceAll("\\s+", " "); //remove all found whitespaces
HashMap<String, Integer> hm = new HashMap<String, Integer>(); //Creates a Hashmap that would be renewed when next document is loaded.
String[] words = textWords.split(" "); //store individual words into a String array
for (int j = 0; j < numOfWords; j++) {
int wordAppearCount = 0;
if (hm.containsKey(words[j].toLowerCase().replaceAll("\\W", ""))) { //replace non-word characters
wordAppearCount = hm.get(words[j].toLowerCase().replaceAll("\\W", "")); //remove non-word character and retrieve the index of the word
}
if (!words[j].toLowerCase().replaceAll("\\W", "").equals("")) {
//Words stored in hashmap are in lower case and have special characters removed.
hm.put(words[j].toLowerCase().replaceAll("\\W", ""), ++wordAppearCount);//index of word and string word stored in hashmap
}
}
hmArr.add(hm);//stores every single hashmap inside an ArrayList of hashmap
} //end of for loop
return hmArr; //return hashmap ArrayList
}
public static void LoadAllHashmapWords(ArrayList m){
for(int i=0;i<m.size();i++){
m.get(i); //stuck here!
}
Firstly your login wont work correctly. In the StoreLoadedFiles() method you iterate through the words like for (int j = 0; j < numOfWords; j++) { . The numOfWords field is initialized to zero and hence this loop wont execute at all. You should initialize that with length of words array.
Having said that to retrieve the value from hashmap from a list of hashmap, you should first iterate through the list and with each hashmap you could take the entry set. Map.Entry is basically the pair that you store in the hashmap. So when you invoke map.entrySet() method it returns a java.util.Set<Map.Entry<Key, Value>>. A set is returned because the key will be unique.
So a complete program will look like.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.Scanner;
public class FileWordCounter {
public static List<HashMap<String, Integer>> storeLoadedFiles() {
final File directory = new File("C:/Users/Admin/Desktop/dataFiles/");
List<HashMap<String, Integer>> listOfWordCountMap = new ArrayList<HashMap<String, Integer>>();
Scanner input = null;
StringBuilder data;
try {
for (final File fileEntry : directory.listFiles()) {
input = new Scanner(fileEntry);
input.useDelimiter("\t");
data = new StringBuilder();
while (input.hasNext()) {
data.append(input.next());
}
input.close();
String wordsInFile = data.toString().replaceAll("\\s+", " ");
HashMap<String, Integer> wordCountMap = new HashMap<String, Integer>();
for(String word : wordsInFile.split(" ")){
String strippedWord = word.toLowerCase().replaceAll("\\W", "");
int wordAppearCount = 0;
if(strippedWord.length() > 0){
if(wordCountMap.containsKey(strippedWord)){
wordAppearCount = wordCountMap.get(strippedWord);
}
wordCountMap.put(strippedWord, ++wordAppearCount);
}
}
listOfWordCountMap.add(wordCountMap);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} finally {
if(input != null) {
input.close();
}
}
return listOfWordCountMap;
}
public static void loadAllHashmapWords(List<HashMap<String, Integer>> listOfWordCountMap) {
for(HashMap<String, Integer> wordCountMap : listOfWordCountMap){
for(Entry<String, Integer> wordCountEntry : wordCountMap.entrySet()){
System.out.println(wordCountEntry.getKey() + " - " + wordCountEntry.getValue());
}
}
}
public static void main(String[] args) {
List<HashMap<String, Integer>> listOfWordCountMap = storeLoadedFiles();
loadAllHashmapWords(listOfWordCountMap);
}
}
Since you are beginner in Java programming I would like to point out a few best practices that you could start using from the beginning.
Closing resources : In your while loop to read from files you are opening a Scanner like Scanner input = new Scanner(fileEntry);, But you never closes it. This causes memory leaks. You should always use a try-catch-finally block and close resources in finally block.
Avoid unnecessary redundant calls : If an operation is the same while executing inside a loop try moving it outside the loop to avoid redundant calls. In your case for example the scanner delimiter setting as input.useDelimiter("\t"); is essentially a one time operation after a scanner is initialized. So you could move that outside the while loop.
Use StringBuilder instead of String : For repeated string manipulations such as concatenation should be done using a StringBuilder (or StringBuffer when you need synchronization) instead of using += or +. This is because String is an immutable object, meaning its value cannot be changed. So each time when you do a concatenation a new String object is created. This results in a lot of unused instances in memory. Where as StringBuilder is mutable and values could be changed.
Naming convention : The usual naming convention in Java is starting with lower-case letter and first letter upper-case for each word. So its a standard practice to name a method as storeLoadedFiles as opposed to StoreLoadedFiles. (This could be opinion based ;))
Give descriptive names : Its a good practice to give descriptive names. It helps in later code maintenance. Say its better to give a name as wordCountMap as opposed to hm. So in future if someone tries to go through your code they'll get a better and faster understanding about your code with descriptive names. Again opinion based.
Use generics as much as possible : This avoid additional casting overhead.
Avoid repetition : Similar to point 2 if you have an operation that result in the same output and need to be used multiple times try moving it to a variable and use the variable. In your case you were using words[j].toLowerCase().replaceAll("\\W", "") multiple times. All the time the result is the same but it creates unnecessary instances and repetitions. So you could move that to a String and use that String elsewhere.
Try using for-each loop where ever possible : This relieves us from taking care of indexing.
These are just suggestions. I tried to include most of it in my code but I wont say its the perfect one. Since you are a beginner if you tried to include these best practices now itself it'll get ingrained in you. Happy coding.. :)
for (HashMap<String, Integer> map : m) {
for(Entry<String,Integer> e:map.entrySet()){
//your code here
}
}
or, if using java 8 you can play with lambda
m.stream().forEach((map) -> {
map.entrySet().stream().forEach((e) -> {
//your code here
});
});
But before all you have to change method signature to public static void LoadAllHashmapWords(List<HashMap<String,Integer>> m) otherwise you would have to use a cast.
P.S. are you sure your extracting method works? I've tested it a bit and had list of empty hashmaps all the time.
I am decomposing a series of 90,000+ strings into a discrete list of the individual, non-duplicated pairs of words that are included in the strings with the rxcui id values associated with each string. I have developed a method which tries to accomplish this, but it is producing a lot of redundancy. Analysis of the data shows there are about 12,000 unique words in the 90,000+ source strings, after I clean and format the contents of the strings.
How can I change the code below so that it avoids creating the redundant rows in the destination 2D ArrayList (shown below the code)?
public static ArrayList<ArrayList<String>> getAllWords(String[] tempsArray){//int count = tempsArray.length;
int fieldslenlessthan2 = 0;//ArrayList<String> outputarr = new ArrayList<String>();
ArrayList<ArrayList<String>> twoDimArrayList= new ArrayList<ArrayList<String>>();
int idx = 0;
for (String s : tempsArray) {
String[] fields = s.split("\t");//System.out.println(" --- fields.length is: "+fields.length);
if(fields.length>1){
ArrayList<String> row = new ArrayList<String>();
System.out.println("fields[0] is: "+fields[0]);
String cleanedTerms = cleanTerms(fields[1]);
String[] words = cleanedTerms.split(" ");
for(int j=0;j<words.length;j++){
String word=words[j].trim();
word = word.toLowerCase();
if(isValidWord(word)){//outputarr.add(word);
System.out.println("words["+j+"] is: "+word);
row.add(word_id);//WORD_ID NEEDS TO BE CREATED BY SOME METHOD.
row.add(fields[0]);
row.add(word);
twoDimArrayList.add(row);
idx += 1;
}
}
}else{fieldslenlessthan2 += 1;}
}
System.out.println("........... fieldslenlessthan2 is: "+fieldslenlessthan2);
return twoDimArrayList;
}
The output of the above method currently looks like the following, with many rxcui values for some name values, and with many name values for some rxcui:
How do I change the code above so that the output is a list of unique pairs of name/rxcui values, summarizing all relevant data from the current output while removing only the redundancies?
If you just need a Collection of all words, use a HashSet Sets are primarily used for contains logic. If you need to associate a value with your string use a HashMap
public HashSet<String> getUniqueWords(String[] stringArray) {
HashSet<String> uniqueWords = new HashSet<String>();
for (String str : stringArray) {
uniqueWords.add(str);
}
return uniqueWords;
}
This will give you a collection of all the unique Strings in your array. If you need an ID use a HashMap
String[] strList; // your String array
int idCounter = 0;
HashMap<String, Integer> stringIDMap = new HashMap<String, Integer>();
for (String str : strList) {
if (!stringIDMap.contains(str)) {
stringIDMap.put(str, new Integer(idCounter));
idCounter++;
}
}
This will provide you a HashMap with unique String keys and unique Integer values. To get an id for a String you do this:
stringIDMap.get("myString"); // returns the Integer ID associated with the String "myString"
UPDATE
Based on the question update from the OP. I recommend creating an object that holds the String value and the rxcui. You can then place these in a Set or HashMap using a similar implementation to the one provided above.
public MyObject(String str, int rxcui); // The constructor for your new object
MyObject mo1 = new MyObject("hello", 5);
Either
mySet.add(myObject);
will work or
myMap.put(mo1.getStr, mo1.getRxcui);
What is the purpose of the unique word ID? Is the word itself not unique enough since you are not keeping duplicates?
A very basic way would be to keep a counter going as you are checking new words. For each word that doesn't already exist you could increase the counter and use the new value as the unique id.
Lastly, might I suggest you use a HashMap instead. It would allow you to both insert and retrieve words in O(1) time. I am not entirely sure what you are going for, but I think the HashMap might give you more range.
Edit2:
It would be something a little more along these lines. This should help you out.
public static Set<DataPair> getAllWords(String[] tempsArray) {
Set<DataPair> set = new HashSet<>();
for (String row : tempsArray) {
// PARSE YOUR STRING DATA
// the way you were doing it seemed fine but something like this
String[] rowArray = row.split(" ");
String word = row[1];
int id = Integer.parseInt(row[0]);
DataPair pair = new DataPair(word, id);
set.add(pair);
}
return set;
}
class DataPair {
private String word;
private int id;
public DataPair(String word, int id) {
this.word = word;
this.id = id;
}
public boolean equals(Object o) {
if (o instanceof DataPair) {
return ((DataPair) o).word.equals(word) && ((DataPair) o).id == id;
}
return false;
}
}
I am trying to create a program that detects if multiple words are in a string as fast as possible, and if so, executes a behavior. Preferably, I would like it to detect the order of these words too but only if this can be done fast. So far, this is what I have done:
if (input.contains("adsf") && input.contains("qwer")) {
execute();
}
As you can see, doing this for multiple words would become tiresome. Is this the only way or is there a better way of detecting multiple substrings? And is there any way of detecting order?
I'd create a regular expression from the words:
Pattern pattern = Pattern.compile("(?=.*adsf)(?=.*qwer)");
if (pattern.matcher(input).find()) {
execute();
}
For more details, see this answer: https://stackoverflow.com/a/470602/660143
Editors note: Despite being heavily upvoted and accepted, this does not function the same as the code in the question. execute is called on the first match, like a logical OR.
You could use an array:
String[] matches = new String[] {"adsf", "qwer"};
bool found = false;
for (String s : matches)
{
if (input.contains(s))
{
execute();
break;
}
}
This is efficient as the one posted by you but more maintainable. Looking for a more efficient solution sounds like a micro optimization that should be ignored until proven to be effectively a bottleneck of your code, in any case with a huge string set the solution could be a trie.
In Java 8 you could do
public static boolean containsWords(String input, String[] words) {
return Arrays.stream(words).allMatch(input::contains);
}
Sample usage:
String input = "hello, world!";
String[] words = {"hello", "world"};
if (containsWords(input, words)) System.out.println("Match");
This is a classical interview and CS problem.
Robin Karp algorithm is usually what people first talk about in interviews. The basic idea is that as you go through the string, you add the current character to the hash. If the hash matches the hash of one of your match strings, you know that you might have a match. This avoids having to scan back and forth into your match strings.
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
Other typical topics for that interview question are to consider a trie structure to speed up the lookup. If you have a large set of match strings, you have to always check a large set of match strings. A trie structure is more efficient to do that check.
https://en.wikipedia.org/wiki/Trie
Additional algorithms are:
- Aho–Corasick https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
- Commentz-Walter https://en.wikipedia.org/wiki/Commentz-Walter_algorithm
If you have a lot of substrings to look up, then a regular expression probably isn't going to be much help, so you're better off putting the substrings in a list, then iterating over them and calling input.indexOf(substring) on each one. This returns an int index of where the substring was found. If you throw each result (except -1, which means that the substring wasn't found) into a TreeMap (where index is the key and the substring is the value), then you can retrieve them in order by calling keys() on the map.
Map<Integer, String> substringIndices = new TreeMap<Integer, String>();
List<String> substrings = new ArrayList<String>();
substrings.add("asdf");
// etc.
for (String substring : substrings) {
int index = input.indexOf(substring);
if (index != -1) {
substringIndices.put(index, substring);
}
}
for (Integer index : substringIndices.keys()) {
System.out.println(substringIndices.get(index));
}
Use a tree structure to hold the substrings per codepoint. This eliminates the need to
Note that this is efficient only if the needle set is almost constant. It is not inefficient if there are individual additions or removals of substrings though, but a different initialization each time to arrange a lot of strings into a tree structure would definitely slower it.
StringSearcher:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
class StringSearcher{
private NeedleTree needles = new NeedleTree(-1);
private boolean caseSensitive;
private List<Integer> lengths = new ArrayList<>();
private int maxLength;
public StringSearcher(List<String> inputs, boolean caseSensitive){
this.caseSensitive = caseSensitive;
for(String input : inputs){
if(!lengths.contains(input.length())){
lengths.add(input.length());
}
NeedleTree tree = needles;
for(int i = 0; i < input.length(); i++){
tree = tree.child(caseSensitive ? input.codePointat(i) : Character.toLowerCase(input.codePointAt(i)));
}
tree.markSelfSet();
}
maxLength = Collections.max(legnths);
}
public boolean matches(String haystack){
if(!caseSensitive){
haystack = haystack.toLowerCase();
}
for(int i = 0; i < haystack.length(); i++){
String substring = haystack.substring(i, i + maxLength); // maybe we can even skip this and use from haystack directly?
NeedleTree tree = needles;
for(int j = 0; j < substring.maxLength; j++){
tree = tree.childOrNull(substring.codePointAt(j));
if(tree == null){
break;
}
if(tree.isSelfSet()){
return true;
}
}
}
return false;
}
}
NeedleTree.java:
import java.util.HashMap;
import java.util.Map;
class NeedleTree{
private int codePoint;
private boolean selfSet;
private Map<Integer, NeedleTree> children = new HashMap<>();
public NeedleTree(int codePoint){
this.codePoint = codePoint;
}
public NeedleTree childOrNull(int codePoint){
return children.get(codePoint);
}
public NeedleTree child(int codePoint){
NeedleTree child = children.get(codePoint);
if(child == null){
child = children.put(codePoint, new NeedleTree(codePoint));
}
return child;
}
public boolean isSelfSet(){
return selfSet;
}
public void markSelfSet(){
selfSet = true;
}
}
I think a better approach would be something like this, where we can add multiple values as a one string and by index of function validate index
String s = "123";
System.out.println(s.indexOf("1")); // 0
System.out.println(s.indexOf("2")); // 1
System.out.println(s.indexOf("5")); // -1
I have this input:
5
it
your
reality
real
our
First line is number of strings comming after. And i should store it this way (pseudocode):
associative_array = [ 2 => ['it'], 3 => ['our'], 4 => ['real', 'your'], 7 => ['reality']]
As you can see the keys of associative array are the length of strings stored in inner array.
So how can i do this in java ? I came from php world, so if you will compare it with php, it will be very well.
MultiMap<Integer, String> m = new MultiHashMap<Integer, String>();
for(String item : originalCollection) {
m.put(item.length(), item);
}
djechlin already posted a better version, but here's a complete standalone example using just JDK classes:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
public class Main {
public static void main(String[] args) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String firstLine = reader.readLine();
int numOfRowsToFollow = Integer.parseInt(firstLine);
Map<Integer,Set<String>> stringsByLength = new HashMap<>(numOfRowsToFollow); //worst-case size
for (int i=0; i<numOfRowsToFollow; i++) {
String line = reader.readLine();
int length = line.length();
Set<String> alreadyUnderThatLength = stringsByLength.get(length); //int boxed to Integer
if (alreadyUnderThatLength==null) {
alreadyUnderThatLength = new HashSet<>();
stringsByLength.put(length, alreadyUnderThatLength);
}
alreadyUnderThatLength.add(line);
}
System.out.println("results: "+stringsByLength);
}
}
its output looks like this:
3
bob
bart
brett
results: {4=[bart], 5=[brett], 3=[bob]}
Java doesn't have associative arrays. But it does have Hashmaps, which mostly accomplishes the same goal. In your case, you can have multiple values for any given key. So what you could do is make each entry in the Hashmap an array or a collection of some kind. ArrayList is a likely choice. That is:
Hashmap<Integer,ArrayList<String>> words=new HashMap<Integer,ArrayList<String>>();
I'm not going to go through the code to read your list from a file or whatever, that's a different question. But just to give you the idea of how the structure would work, suppose we could hard-code the list. We could do it something like this:
ArrayList<String> set=new ArrayList<String)();
set.add("it");
words.put(Integer.valueOf(2), set);
set.clear();
set.add("your");
set.add("real");
words.put(Integer.valueOf(4), set);
Etc.
In practice, you probably would regularly be adding words to an existing set. I often do that like this:
void addWord(String word)
{
Integer key=Integer.valueOf(word.length());
ArrayList<String> set=words.get(key);
if (set==null)
{
set=new ArrayList<String>();
words.put(key,set);
}
// either way we now have a set
set.add(word);
}
Side note: I often see programmers end a block like this by putting "set" back into the Hashmap, i.e. "words.put(key,set)" at the end. This is unnecessary: it's already there. When you get "set" from the Hashmap, you're getting a reference, not a copy, so any updates you make are just "there", you don't have to put it back.
Disclaimer: This code is off the top of my head. No warranties expressed or implied. I haven't written any Java in a while so I may have syntax errors or wrong function names. :-)
As your key appears to be small integer, you could use a list of lists. In this case the simplest solution is to use a MultiMap like
Map<Integer, Set<String>> stringByLength = new LinkedHashMap<>();
for(String s: strings) {
Integer len = s.length();
Set<String> set = stringByLength.get(s);
if(set == null)
stringsByLength.put(len, set = new LinkedHashSet<>());
set.add(s);
}
private HashMap<Integer, List<String>> map = new HashMap<Integer, List<String>>();
void addStringToMap(String s) {
int length = s.length();
if (map.get(length) == null) {
map.put(length, new ArrayList<String>());
}
map.get(length).add(s);
}