Iterate through only part of a large list in java - java

I'm trying to make a Boggle game in Java, and for my program once I randomize the board I have a method which iterates through the possible combinations and compares each one to a dictionary list to check if it's a valid word, and if yes, I put it in the key. It works fine, however the program takes three or four minutes to generate the key, which is mostly due to the size of the dictionary. The one I'm using has about 19k words and comparing every combination takes up a ton of time. Here's the part of the code I'm trying to make faster:
if (str.length()>3&&!key.contains(str)&&prefixes.contains(str.substring(0,3))&&dictionary.contains(str)){
key.add(str);
}
where str is the combination generated. prefixes is a list I generated based on dictionary that goes like this:
public void buildPrefixes(){
for (String word:dictionary){
if(!prefixes.contains(word.substring(0,3))){
prefixes.add(word.substring(0,3));
}
}
}
which just adds all the three letter prefixes in the dictionary such as "abb" and "mar" so that when str is jibberish like "xskfjh" it won't get checked against the whole dictionary, just prefixes which is something like 1k words.
What I'm trying to do is cut down on time by iterating through only the words in the dictionary that have the same first letter as str, so if str is "abbey" then it will only check str against words that start with "a" instead of the whole list, which would cut down on time significantly. Or even better, it only checks str against words that have the same prefix. I am pretty new to Java so I would really appreciate if you're very descriptive in your answers, thanks!

What comments are trying to say is - do not reinvent wheel. Java is not Assembler or C and it is powerful enough to handle such trivial cases.
Here is simple code which shows that simple Set can handle your vocabulary easy:
import java.util.Set;
import java.util.TreeSet;
public class Work {
public static void main(String[] args) {
long startTime=System.currentTimeMillis();
Set<String> allWords=new TreeSet<String>();
for (int i=0; i<20000;i++){
allWords.add(getRandomWord());
}
System.out.println("Total words "+allWords.size()+" in "+(System.currentTimeMillis()-startTime)+" milliseconds");
}
static String getRandomWord() {
int length=3+(int)(Math.random()*10);
String r = "";
for(int i = 0; i < length; i++) {
r += (char)(Math.random() * 26 + 97);
}
return r;
}
}
On my computer it shows
Total words 19875 in 47 milliseconds
As you can see 125 words out of 20,000 were duplicated. And it took not only time to generate 20,000 words in very inefficient way but store them as well as check for duplicates.

Related

NZEC error in Hackerearth problem in java

I'm trying the solve this hacker earth problem https://www.hackerearth.com/practice/basic-programming/input-output/basics-of-input-output/practice-problems/algorithm/anagrams-651/description/
I have tried searching through the internet but couldn't find the ideal solution to solve my problem
This is my code:
String a = new String();
String b = new String();
a = sc.nextLine();
b = sc.nextLine();
int t = sc.nextInt();
int check = 0;
int againCheck =0;
for (int k =0; k<t; k++)
{
for (int i =0; i<a.length(); i++)
{
char ch = a.charAt(i);
for (int j =0; j<b.length(); j++)
{
check =0;
if (ch != b.charAt(j))
{
check=1;
}
}
againCheck += check;
}
}
System.out.println(againCheck*againCheck);
I expect the output to be 4, but it is showing the "NZEC" error
Can anyone help me, please?
The requirements state1 that the input is a number (N) followed by 2 x N lines. Your code is reading two strings followed by a number. It is probably throwing an InputMismatchException when it attempts to parse the 3rd line of input as a number.
Hints:
It pays to read the requirements carefully.
Read this article on CodeChef about how to debug a NZEC: https://discuss.codechef.com/t/tutorial-how-to-debug-an-nzec-error/11221. It explains techniques such as catching exceptions in your code and printing out a Java stacktrace so that you can see what is going wrong.
1 - Admittedly, the requirements are not crystal clear. But in the sample input the first line is a number.
As I've written in other answers as well, it is best to write your code like this when submitting on sites:
def myFunction():
try:
#MY LOGIC HERE
except Exception as E:
print("ERROR Occurred : {}".format(E))
This will clearly show you what error you are facing in each test case. For a site like hacker earth, that has several input problems in various test cases, this is a must.
Coming to your question, NZEC stands for : NON ZERO EXIT CODE
This could mean any and everything from input error to server earthquake.
Regardless of hacker-whatsoever.com I am going to give two useful things:
An easier algorithm, so you can code it yourself, becuase your algorithm will not work as you expect;
A Java 8+ solution with totally a different algorithm, more complex but more efficient.
SIMPLE ALGORITM
In you solution you have a tipical double for that you use to check for if every char in a is also in b. That part is good but the rest is discardable. Try to implement this:
For each character of a find the first occurence of that character in b
If there is a match, remove that character from a and b.
The number of remaining characters in both strings is the number of deletes you have to perform to them to transform them to strings that have the same characters, aka anagrams. So, return the sum of the lenght of a and b.
NOTE: It is important that you keep track of what you already encountered: with your approach you would have counted the same character several times!
As you can see it's just pseudo code, of a naive algorithm. It's just to give you a hint to help you with your studying. In fact this algorithm has a max complexity of O(n^2) (because of the nested loop), which is generally bad. Now, a better solution.
BETTER SOLUTION
My algorithm is just O(n). It works this way:
I build a map. (If you don't know what is it, to put it simple it's a data structure to store couples "key-value".) In this case the keys are characters, and the values are integer counters binded to the respective character.
Everytime a character is found in a its counter increases by 1;
Everytime a character is found in b its counter decreases by 1;
Now every counter represents the diffences between number of times its character is present in a and b. So, the sum of the absolute values of the counters is the solution!
To implement it actually add an entry to map whenever I find a character for the first time, instead of pre-costructing a map with the whole alphabet. I also abused with lambda expressions, so to give you a very different sight.
Here's the code:
import java.util.HashMap;
public class HackerEarthProblemSolver {
private static final String a = //your input string
b = //your input string
static int sum = 0; //the result, must be static because lambda
public static void main (String[] args){
HashMap<Character,Integer> map = new HashMap<>(); //creating the map
for (char c: a.toCharArray()){ //for each character in a
map.computeIfPresent(c, (k,i) -> i+1); //+1 to its counter
map.computeIfAbsent(c , k -> 1); //initialize its counter to 1 (0+1)
}
for (char c: b.toCharArray()){ //for each character in b
map.computeIfPresent(c, (k,i) -> i-1); //-1 to its counter
map.computeIfAbsent(c , k -> -1); //initialize its counter to -1 (0-1)
}
map.forEach((k,i) -> sum += Math.abs(i) ); //summing the absolute values of the counters
System.out.println(sum)
}
}
Basically both solutions just counts how many letters the two strings have in common, but with different approach.
Hope I helped!

Trying every letter/word combination possible in Java

I'm trying to create a program, that will "create" a series of characters over and over, and compare them to a keyword (unknown to the user or computer). This is very similar to a "brute force" attack if you will, except this will logically build out every single letter it can.
The other thing, is that I've temporarily built this code to handle JUST 5 letter words, and have it broken out into a "value" 2D string array. I have this as a very temporary solution, to help logically discover what it is that my code is doing, before I throw it into super-dynamic and complex for-loops.
public class Sample{
static String key, keyword = "hello";
static String[] list = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","1","2","3","3","4","5","6","7","8","9"};
int keylen = 5; // Eventually, this will be thrown into a for-loop, to get dynamic "keyword" sizes. (Will test to every word, more/less than 5 characters eventually)
public static void main(String[] args) {
String[] values = {"a", "a", "a", "a", "a"}; // More temporary hardcodes. If I can figure out the for loop, the rest can be set to dynamic values.
int changeout_pos = 0;
int counter = 0;
while(true){
if (counter == list.length){ counter = 0; changeout_pos++; } // Swap out each letter we have in list, in every position once.
// Try to swap them. (Try/catch is temporary lazy way of forcing the computer to say "we've gone through all possible combinations")
try { values[changeout_pos] = list[counter]; } catch (Exception e) { break; }
// Add up all the values in their respectful positions. Again, will be dynamic (and in a for-loop) once figured out.
key = values[0] + values[1] + values[2] + values[3] + values[4];
System.out.println(key); // Temporarily print it.
if (key.equalsIgnoreCase(keyword)){ break; } // If it matches our lovely keyword, then we're done. We've done it!
counter ++; // Try another letter.
}
System.out.println("Done! \nThe keyword was: " + key); // Should return what "Keyword" is.
}
}
My goal is to have the output look like this: (For five letter example)
aaaaa
aaaab
aaaac
...
aaaba
aaabb
aaabc
aaabd
...
aabaa
aabab
aabac
...
So on and so forth. By running this code now however, it is not what I was hoping for. Now, it will go:
aaaaa
baaaa
caaaa
daaaa
... (through until 9)
9aaaa
9baaa
9caaa
9daaa
...
99aaa
99baa
99caa
99daa
... (Until it hits 99999 without finding the "keyword")
Any help appreciated. I'm really struggling to solve this puzzle.
First of all, your alphabet is missing 0 (zero) and z. It also has 3 twice.
Second, the number of five letter words using 36 possible characters is 60,466,176. The equation is (size of alphabet)^(length of word). In this case, that is 36^5. I ran your code, and its only generating 176 permutations.
On my machine, with a basic implementation of five nested for loops, each iterating over the alphabet, it took 144 seconds to generate and print all the permutations. So, if you're getting quick results, you should check what's being generated.
Of course, manually nesting for loops isn't a valid solution for when you want the length of the word to be variable, so you still have some work to do. However, my advice would be to pay attention to the details and validate your assumptions!
Good luck.

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?
There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.
I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).
What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}
I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string
First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.
If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.
Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!
If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).
I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Java - Search keywords list in another string list

I have a list of keywords in a List and I have data coming from some source which will be a list too.
I would like to find if any of keywords exists in the data list, if yes add those keywords to another target list.
E.g.
Keywords list = FIRSTNAME, LASTNAME, CURRENCY & FUND
Data list = HUSBANDFIRSTNAME, HUSBANDLASTNAME, WIFEFIRSTNAME, SOURCECURRENCY & CURRENCYRATE.
From above example, I would like to make a target list with keywords FIRSTNAME, LASTNAME & CURRENCY, however FUND should not come as it doesn't exists in the data list.
I have a solution below that works by using two for loops (one inside another) and check with String contains method, but I would like to avoid two loops, especially one inside another.
for (int i=0; i<dataList.size();i++) {
for (int j=0; j<keywordsList.size();j++) {
if (dataList.get(i).contains(keywordsList.get(j))) {
targetSet.add(keywordsList.get(j));
break;
}
}
}
Is there any other alternate solution for my problem?
Here's a one loop approach using regex. You construct a pattern using your keywords, and then iterate through your dataList and see if you can find a match.
public static void main(String[] args) throws Exception {
List<String> keywords = new ArrayList(Arrays.asList("FIRSTNAME", "LASTNAME", "CURRENCY", "FUND"));
List<String> dataList = new ArrayList(Arrays.asList("HUSBANDFIRSTNAME", "HUSBANDLASTNAME", "WIFEFIRSTNAME", "SOURCECURRENCY", "CURRENCYRATE"));
Set<String> targetSet = new HashSet();
String pattern = String.join("|", keywords);
for (String data : dataList) {
Matcher matcher = Pattern.compile(pattern).matcher(data);
if (matcher.find()) {
targetSet.add(matcher.group());
}
}
System.out.println(targetSet);
}
Results:
[CURRENCY, LASTNAME, FIRSTNAME]
Try Aho–Corasick algorithm. This algorithm can get the count of appearance of every keyword in the data (You just need whether it appeared or not).
The Complexity is O(Sum(Length(Keyword)) + Length(Data) + Count(number of match)).
Here is the wiki-page:
In computer science, the Aho–Corasick algorithm is a string searching
algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is
a kind of dictionary-matching algorithm that locates elements of a
finite set of strings (the "dictionary") within an input text. It
matches all patterns simultaneously. The complexity of the algorithm
is linear in the length of the patterns plus the length of the
searched text plus the number of output matches.
I implemented it(about 200 lines) years ago for similar case, and it works well.
If you just care keyword appeared or not, you can modify that algorithm for your case with a better complexity:
O(Sum(Length(Keyword)) + Length(Data)).
You can find implementation of that algorithm from internet everywhere but I think it's good for you to understand that algorithm and implement it by yourself.
EDIT:
I think you want to eliminate two-loops, so we need find all keywords in one loop. We call it Set Match Problem that a set of patterns(keywords) to match a text(data). You want to solve Set Match Problem, then you should choose Aho–Corasick algorithm which is particularly designed for that case. In that way, we will get one loop solution:
for (int i=0; i < dataList.size(); i++) {
targetSet.addAll(Ac.run(keywordsList));
}
You can find a implementation from here.

How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044   15939353   C8H14O3S2    beta-lipoic acid   C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**
Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044   15939353   C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine
Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}
This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Categories