I have 2 sets of data.
Let say one is a people, another is a group.
A people can be in multiple groups while a group can have multiple people.
My operations will basically be CRUD on group and people.
As well as a method that makes sure a list of people are in different groups (which gets called alot).
Right now I'm thinking of making a table of binary 0's and 1's with horizontally representing all the people and vertically all the groups.
I can perform the method in O(n) time by adding each list of binaries and compare with the "and" operation of the list of binaries.
E.g
Group A B C D
ppl1 1 0 0 1
ppl2 0 1 1 0
ppl3 0 0 1 0
ppl4 0 1 0 0
check (ppl1, ppl2) = (1001 + 0110) == (1001 & 0110)
= 1111 == 1111
= true
check (ppl2, ppl3) = (0110 + 0010) == (0110+0010)
= 1000 ==0110
= false
I'm wondering if there is a data structure that does something similar already so I don't have to write my own and maintain O(n) runtime.
I don't know all of the details of your problem, but my gut instinct is that you may be over thinking things here. How many objects are you planning on storing in this data structure? If you have really large amounts of data to store here, I would recommend that you use an actual database instead of a data structure. The type of operations you are describing here are classical examples of things that relational databases are good at. MySQL and PostgreSQL are examples of large scale relational databases that could do this sort of thing in their sleep. If you'd like something lighter-weight SQLite would probably be of interest.
If you do not have large amounts of data that you need to store in this data structure, I'd recommend keeping it simple, and only optimizing it when you are sure that it won't be fast enough for what you need to do. As a first shot, I'd just recommend using java's built in List interface to store your people and a Map to store groups. You could do something like this:
// Use a list to keep track of People
List<Person> myPeople = new ArrayList<Person>();
Person steve = new Person("Steve");
myPeople.add(steve);
myPeople.add(new Person("Bob"));
// Use a Map to track Groups
Map<String, List<Person>> groups = new HashMap<String, List<Person>>();
groups.put("Everybody", myPeople);
groups.put("Developers", Arrays.asList(steve));
// Does a group contain everybody?
groups.get("Everybody").containsAll(myPeople); // returns true
groups.get("Developers").containsAll(myPeople); // returns false
This definitly isn't the fastest option available, but if you do not have a huge number of People to keep track of, you probably won't even notice any performance issues. If you do have some special conditions that would make the speed of using regular Lists and Maps unfeasible, please post them and we can make suggestions based on those.
EDIT:
After reading your comments, it appears that I misread your issue on the first run through. It looks like you're not so much interested in mapping groups to people, but instead mapping people to groups. What you probably want is something more like this:
Map<Person, List<String>> associations = new HashMap<Person, List<String>>();
Person steve = new Person("Steve");
Person ed = new Person("Ed");
associations.put(steve, Arrays.asList("Everybody", "Developers"));
associations.put(ed, Arrays.asList("Everybody"));
// This is the tricky part
boolean sharesGroups = checkForSharedGroups(associations, Arrays.asList(steve, ed));
So how do you implement the checkForSharedGroups method? In your case, since the numbers surrounding this are pretty low, I'd just try out the naive method and go from there.
public boolean checkForSharedGroups(
Map<Person, List<String>> associations,
List<Person> peopleToCheck){
List<String> groupsThatHaveMembers = new ArrayList<String>();
for(Person p : peopleToCheck){
List<String> groups = associations.get(p);
for(String s : groups){
if(groupsThatHaveMembers.contains(s)){
// We've already seen this group, so we can return
return false;
} else {
groupsThatHaveMembers.add(s);
}
}
}
// If we've made it to this point, nobody shares any groups.
return true;
}
This method probably doesn't have great performance on large datasets, but it is very easy to understand. Because it's encapsulated in it's own method, it should also be easy to update if it turns out you need better performance. If you do need to increase performance, I would look at overriding the equals method of Person, which would make lookups in the associations map faster. From there you could also look at a custom type instead of String for groups, also with an overridden equals method. This would considerably speed up the contains method used above.
The reason why I'm not too concerned about performance is that the numbers you've mentioned aren't really that big as far as algorithms are concerned. Because this method returns as soon as it finds two matching groups, in the very worse case you will call ArrayList.contains a number of times equal to the number of groups that exist. In the very best case scenario, it only needs to be called twice. Performance will likely only be an issue if you call the checkForSharedGroups very, very often, in which case you might be better off finding a way to call it less often instead of optimizing the method itself.
Have you considered a HashTable? If you know all of the keys you'll be using, it's possible to use a Perfect Hash Function which will allow you to achieve constant time.
How about having two separate entities for People and Group. Inside People have a set of Group and vice versa.
class People{
Set<Group> groups;
//API for addGroup, getGroup
}
class Group{
Set<People> people;
//API for addPeople,getPeople
}
check(People p1, People p2):
1) call getGroup on both p1,p2
2) check the size of both the set,
3) iterate over the smaller set, and check if that group is present in other set(of group)
Now, you can basically store People object in any data structure. Preferably a linked list if size is not fixed otherwise an array.
Related
For one of my school assigments, I have to parse GenBank files using Java. I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible. Is there a difference between using HashMaps or storing the data as records? I know that using HashMaps would be O(1), but the readability and immutability of using records leads me to prefer using them instead. The objects will be stored in an array.
This my approach now
public static GenBankRecord parseGenBankFile(File gbFile) throws IOException {
try (var fileReader = new FileReader(gbFile); var reader = new BufferedReader(fileReader)) {
String organism = null;
List<String> contentList = new ArrayList<>();
while (true) {
String line = reader.readLine();
if (line == null) break; //Breaking out if file end has been reached
contentList.add(line);
if (line.startsWith(" ORGANISM ")) {
// Organism type found
organism = line.substring(12); // Selecting the correct part of the line
}
}
// Loop ended
var content = String.join("\n", contentList);
return new GenBankRecord(gbFile.getName(),organism, content);
}
}
with GenBankRecord being the following:
record GenBankRecord(String fileName,String organism, String content) {
#Override
public String toString(){
return organism;
}
}
Is there a difference between using a record and a HashMap, assuming the keys-value pairs are the same as the fields of the record?
String current_organism = gbRecordInstance.organism();
and
String current_organism = gbHashMap.get("organism");
I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible.
Firstly, I am somewhat doubtful that your teachers actually stated the requirements like that. It doesn't make a lot of sense to optimize just for time complexity.
Complexity is not efficiency.
Big O complexity is not about the value of the measure (e.g. time taken) itself. It is actually about how the measure (e.g. time taken) changes as some variable gets very large.
For example, HashMap.get(nameStr) and someRecord.name are both O(1) complexity.
But they are not equivalent in terms of efficiency. Using Java 17 record types or regular Java classes with named fields will be orders of magnitude faster than using a HashMap. (And it will use orders of magnitude less memory.)
Assuming that your objects have a fixed number of named fields, the complexity (i.e how the performance changes with an ever increasing number of fields) is not even a relevant.
Performance is not everything.
The most differences between HashMap and a record class are actually in the functionality that they provide:
A Map<String, SomeType> provides an set of name / value pairs where:
the number of pairs in the set is not fixed
the names are not fixed
the types of the values are all instances of SomeType or a subtype.
A record (or classic class) can be viewed as set of fieldname / value pairs where:
the number of pairs is fixed at compile time
the field names are fixed at compile time
the field types don't have to be subtypes of any single given type.
As #Louis Wasserman commented:
Records and HashMap are apples and oranges -- it doesn't really make sense to compare them.
So really, you should be choosing between records and hashmaps by comparing the functionality / constraints that they provide versus what your application actually needs.
(The problem description in your question is not clear enough for us to make that judgement.)
Efficiency concerns may be relevant, but it is a secondary concern. (If the code doesn't meet functional requirements, efficiency is moot.)
Is Complexity relevant to your assignment?
Well ... maybe yes. But not in the area that you are looking at.
My reading of the requirements is that one of them is that you be able to retrieve information from your in-memory data structures efficiently.
But so far you have been thinking about storing individual records. Retrieval implies that you have a collection of records and you have to (efficiently) retrieve a specific record, or maybe a set of records matching some criteria. So that implies you need to consider the data structure to represent the collection.
Suppose you have a collection of N records (or whatever) representing (say) N organisms:
If the collection is a List<SomeRecord>, you need to iterate the list to find the record for (say) "cat". That is O(N).
If the collection is a HashMap<String, SomeRecord> keyed by the organism name, you can find the "cat" record in O(1).
I need some structure where to store N Enums, some of them repeated. And be able to easily extract them. So far I've try to use the EnumSet like this.
cards = EnumSet.of(
BEST_OF_THREE,
BEST_OF_THREE,
SIMPLE_QUESTION,
SIMPLE_QUESTION,
STAR);
But now I see it can only have one of each. Conceptually, which one would be the best structure to use for this problem.
Regards
jose
You can use a Map of type Enumeration -> Integer, where the integer indicates how many of each there are. The google guava "MultiSet" does this for you, and handles the edge cases of adding an enum to the set when there is not already an entry, and removing an enum when it leaves none left.
Another strategy is to use the Enumeration ordinal index. Because this index is unique, you can use this to index into an int array that is sized to the Enumeration size, where the count in each array slot would indicate how many of each enumeration you have. Like this:
// initialize array for counting each enumeration type
// TODO: someone should double check every initial value will be zero
int[] cardCount = new int[CardEnum.values().length];
...
// incrementing the count for an enumeration (when we add)
cardCount[BEST_OF_THREE.ordinal()]++;
...
// decrementing the count for an enumeration (when we remove)
cardCount[BEST_OF_THREE.ordinal()]--;
// DEBUG: assert cardCount[BEST_OF_THREE.ordinal()] >= 0
...
// getting the count for an enumeration
int count = cardCount[BEST_OF_THREE.ordinal()];
... Some time later
Having read the clarifying comments underneath the original post that explained what the OP was asking, it is clear that you're best off with a linear structure with an entry per element. I didn't realize that you didn't need detailed information on how many of each you needed. Storing them in a MultiSet or an equivalent counting structure makes it hard to randomly pick, as you need to attribute an index picked at random from [0, size) to a particular container, which takes log time.
Sets don't allow duplicates, so if you want repeats you'll need either a List or a Map.
If you just need the number of duplicates, an EnumMap with Integer values is probably your best bet.
If the order is important, and you need quick access to the number of each type, you'll probably need to roll your own data structure.
If the order is important (but the count of each is not), then a List is the way to go, which implementation depends on how you will use it.
LinkedList - Best when there will be many inserts/removals from the beginning of the List. Indexing into a LinkedList is very expensive, and should be avoided whenever possible. If a List is built by shifting data onto the front of the list, but any later additions are at the end, conversion to an ArrayList once the initial List is built is a good idea - especially if indexing into the List is anticipated at any point.
ArrayList - When in doubt, this is a good place to start. Inserting or removing items requires shifting, so if this is a common operation look elsewhere.
TreeList - This is a good all-around option, and insertions and removals anywhere in the List are inexpensive. This does require the Apache commons library, and uses a bit more memory than the others.
Benchmarks, and the code used go generate them can be found in this gist.
How to unit generated strings where the end order is fairly flexible. Lets say I'm trying to test some code that prints out out generated SQL that comes from key-value pairs. However, the exact order of many of the fragments does not matter.
For example
SELECT
*
FROM
Cats
WHERE
fur = 'fluffy'
OR
colour = 'white'
is functionally identical to
SELECT
*
FROM
Cats
WHERE
colour = 'white'
OR
fur = 'fluffy'
It doesn't matter in which order the condition clauses get generated, but it does matter that they follow the where clause. Also, it is hard to predict since the ordering of pairs when looping through the entrySet() of a HashMap is not predictable. Sorting the keys would solve this, but introduces a runtime penalty for no (or negative) business value.
How do I unit test the generation of such strings without over-specifying the order?
I thought about using a regexp but* I could not think of how to write one that said:
A regex is what I was thinking of but I can think of a regex that says something like "SELECT * FROM Cats WHERE" followed by one of {"fur = 'fluffy', colour = 'white'} followed by "OR"followed by one of one of {"fur = 'fluffy',colour = 'white'} ... and not the one used last time.
NB: I'm not actually doing this with SQL, it just made for an easier way to frame the problem.
I see a few different options:
If you can live with a modest runtime penalty, LinkedHashMap keeps insertion order.
If you want to solve this completely without changing your implementation, in your example I don't see why you should have to do something more complicated than checking that every fragment appears in the code, and that they appear after the WHERE. Pseudo-code:
Map<String, String> parametersAndValues = { "fur": "fluffy", "colour", "white" };
String generatedSql = generateSql(parametersToValues);
int whereIndex = generatedSql.indexOf("WHERE");
for (String key, value : parametersAndValues) {
String fragment = String.format("%s = '%s'", key, value);
assertThat(generatedSql, containsString(fragment));
assertThat(whereIndex, is(lessThan(generatedSql.indexOf(fragment))));
}
But we can do it even simpler than that. Since you don't actually have to test this with a large set of parameters - for most implementations there are only three important quantities, "none, one, or many" - it's actually feasible to test it against all possible values:
String variation1 = "SELECT ... WHERE fur = 'fluffy' OR colour = 'white'";
String variation2 = "SELECT ... WHERE colour = 'white' OR fur = 'fluffy'";
assertThat(generatedSql, is(anyOf(variation1, variation2)));
Edit: To avoid writing all possible variations by hand (which gets rather tedious if you have more than two or three items as there are n! ways to combine n items), you could have a look at the algorithm for generating all possible permutations of a sequence and do something like this:
List<List<String>> permutations = allPermutationsOf("fur = 'fluffy'",
"colour = 'white'", "scars = 'numerous'", "disposition = 'malignant'");
List<String> allSqlVariations = new ArrayList<>(permutations.size());
for (List<String> permutation : permutations) {
allSqlVariations.add("SELECT ... WHERE " + join(permutation, " OR "));
}
assertThat(generatedSql, is(anyOf(allSqlVariations)));
Well, one option would be to somehow parse the SQL, extract the list of fields and check that everything is ok, disregarding order of the fields. However, this is going to be rather ugly: If done right, you have to implement a complete SQL parser (obviously overkill), if you do it quick-and-dirty using regex or similar, you risk that the test will break for minor changes to the generated SQL.
Instead, I'd propose to use a combination of unit and integration testing:
Have a unit test that tests the code which supplies the list of fields for building the SQL. I.e., have a method Map getRestrictions() which you can easily unit-test.
Have an integration test for the SQL generation as a whole, which runs against a real database (maybe some embedded DB like the H2 database, which you can a start just for the test).
That way, you unit-test the actual values supplied to the SQL, and you integration-test that you are really creating the right SQL.
Note: I my opinion this is an example of "integration code", which cannot be usefully unit-tested. The problem is that the code does not produce a real, testable result by itself. Rather, its purpose is to interface with a database (by sending it SQL), which produces the result. In other words, the code does the right thing not if it produces some specific SQL string, but if it drives the database to do the right thing. Therefore, this code can be meaningfully tested only with the database, i.e. in an integration test.
First, use a LinkedHashMap instead of a regular HashMap. It shouldn't introduce any noticeable performance degradation. Rather than sorting, it retains insertion ordering.
Second, insert the pairs into the map in a well understood manner. Perhaps you are getting the data from a table, and adding an ordering index is unacceptable. But perhaps the database can be ordered by primary key or something.
Combined, those two changes should give you predictable results.
Alternatively, compare actual vs. expected using something smarter than string equals. Perhaps a regex to scrape out all the pairs that have been injected into the actual SQL query?
The best I have come-up so far is to use some library during testing (suck as PowerMockito) to replace the HashMap with a SortedMap like TreeMap. That way for the tests the order will be fixed. However, this only works if the map isn't built in the same code that generates the string.
All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks
If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset
I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.
I'm working on a piece of software that very frequently needs to return a single list that consists of the first (up to) N elements of a number of other lists. The return is not modified by its clients -- it's read-only.
Currently, I am doing something along the lines of (code simplified for readability):
List ret = new ArrayList<String>();
for (List aList : lists) {
// add the first N elements, if they exist
ret.addAll(aList.subList(0, Math.min(aList.size(), MAXMATCHESPERLIST)));
if (ret.size() >= MAXMATCHESTOTAL) {
break;
}
}
return ret;
I'd like to avoid the creation of the new list and the use of addAll() as I don't need to be returning a new list, and I'm dealing with thousands of elements per second. This method is a major bottleneck for my application.
What I'm looking for is an implementation of List that simply consists of the subList() results (those are cheap views, not actual copies) of each of the contained lists.
I've looked through the usual suspects including java.util, Commons Collections, Commons Lang, etc., but can't for the life of me find any such implementation. I'm pretty sure it has to have been implemented at some point though and hopefully I've missed something obvious.
So I'm turning to you, Stack Overflow -- is anyone aware of such an implementation? I can write one myself, but I hate re-inventing the wheel if the wheel is out there.
Suggestions for alternative more efficient approaches are very welcome!
Optional background detail (probably not all that relevant to my question, but just in case it helps you understand what I'm trying to do): this is for a program to fill crossword-style grids with words that revolve around a theme. Each theme may have any number of candidate word lists, ordered in decreasing order of theme relevancy. For instance, the "film" theme may start with a list of movie titles, then a list of actors, then a generic list of places that may or may not be film-relevant, then a generic list of english words. The lists are each stored in a wildcarded trie structure to allow fast lookups that meet the grid constraints (e.g. "CAT" would be stored in trie'd lists against the keys "CAT", "CA?", "C??", "?AT", ... "???" etc.) Lists vary from a few words to several tens of thousands of words.
For any given query, e.g. "C??", I want to return a list that contains up to N (say 50) matching words, ordered in the same order as the source lists. So if list 1 contains 3 matches for "C??", list 2 contains 7, and list 3 contains 100, I need a return list that contains first the 3 matches from list 1, then the 7 matches from list 2, then 40 of the matches from list 3. And I want that returned "conjoined list view" operation to be more efficient than having to continuously call addAll(), in a similar manner to the implementation of subList().
Caching the returned lists is not an option due to memory constraints -- my trie is already consuming the vast majority of my (32 bit) max-sized heap.
PS this isn't homework, it's for a real project. Any help much appreciated!
Do you need random access for the resulting list? Or you client code only iterates over the result?
If you only need to iterate over the result. Create a custom list implementation which will have list of the original lists :) as the instance field. Return custom iterator which will take items from every list one by one and stops when there are no more items in any of the underlying lists or you return MAXMATCHESTOTAL items already.
With some thoughts you can do the same for random access.
Use list.addAll() multiple times. Simple, does not require external jars and ineffective.
Jakarta collections framework has such list. it is effective but requires external jar and does not support generics.
Check Guava from Google. I think it has something that you are looking for.
What's wrong with returning the sublist? That is the fastest way, since the sublist is not a copy but uses a reference to the backing array, and clients are read-only - seems perfect to me.
EDIT:
I understand why you want to group up the contents of several lists to make a larger chunk, but can you change you clients to not need such a large chunk? See my other answer re BlockingQueue and producer/consumer approach.
Have you considered using a BlockingQueue and having consumers pull items from the queue one by one as they need them, rather than getting items in chunks (lists)? It seems you are attempting to reinvent the producer/consumer pattern here.