I have a question about mapping Long value to another Long value and the best way to achieve that.
I have to map left values to right values just right before writing data to database.
3 => 70
8 => 12
1 => 45
Is there any "best way"? I was thinking about static map where the left Long will be a key and a right would be a value, and I have just to get a value corresponding to a given key.
Is it good approach?
You have two main options: an associative container, or an array. If the input values are all within a small range and performance is very important, you could use an array. Otherwise you may as well use a map as you said.
As #John Zwinck points out, a map is generally fine for this type of thing. The cost of mapping from one Long to another is trivial & going to be dwarfed by the network latency of writing to a database (so don't use a primitive arrray :).
Open for extension, but closed for modification
I think it's probably more import for you to consider what happens if the mappings change or you need to add another one. In line with SOLID principles (and in particular open-closed), it should be possible to modify the mappings without changing the class.
In practice you should make sure you can read the mappings (initially,on demand or periodically) from an external source (e.g. a property file, db, NoSQL cache).
Use map and pass initial capacity to constructor if you know the size of your key-value pairs.Choose implementation of map carefully depending upon concurrency/ordering requirements as per source.
Related
For possibly no other good reason at this point in time other that 'we've always done it like this', how are new systems being architected to use reference data used to represent state codes?
For example, a Case may have 2 valid states, 'Open' or 'Closed'. Historically I've seen many systems where these valid values would be stored in a database table containing this reference data, and referred to as a code type ('CaseStatus'), and each valid value has a 'code' value (eg 'OPN') and a decode or display value that is used when the value is needed to be displayed to a user (in this case 'Open').
If developing a Java based system today, from a code point of view with type safety, we would define an Enum like this:
public enum CaseStatus{
Open("OPN"),
Closed("CLS");
private String codeValue;
private CaseStatus(String codeValue){
this.codeValue = codeValue;
}
}
This is great solely from the view of the source code, the Enum enforces type-safety with a restricted list of valid values, but by itself there is no representation of this code type or it's valid values in the database. If there are users of the data who run adhoc reports directly against the database, they need a way to look up decoded values for 'OPN', 'CLS'. Historically this would have been done using a reference table containing the codetype, the codes and their decode values.
It seems odd that we continue to use these state code values as '3 letter codes', where the motivation at this point is no longer because we need to save space in the database ('OPN' vs 'Open' is hardly a great optimization anyway).
What other approaches have people used or seen on recent systems they have worked on? Do you maintain the reference data only in the database, only in code, or in both places, and if you maintain it in both, what approaches do you use to keep the two in sync?
First, if there are only two possible values, and it is not possible to expect them to develop into a larger number (as in your example of open/closed), I would probably define a status_open column as BOOLEAN or SMALLINT (0/1) or CHAR (Y/N).
When the universe of status is bigger (or may increase to more than two values), I would use a surrogate key. While saving a few bytes is hardly an optimization, indexing and joining CHAR valued columns is more expensive than indexing and joining INTEGER columns. While I don't have a metric on the issue of INTEGER vs CHAR(3), I would suppose that for this case the difference would not be as big as in the case of INTEGER vs CHAR(50).
However, an disadventage that I find in small CHAR abbreviations is that sometimes it is difficult to find meaningful values. Suppose that you have an status of "broken - replacement has been ordered", does it help if I call it "BRO"? Is it better than calling it 3?
On the other hand, even when it is not required by the model, I found convenient adding a short VARCHAR column on status, for describing what each mnemonic or surrogate key means. (After the model grows, it becomes quite difficult to remember all of them!)
My implementation (with due exceptions in particular cases) would likely be:
On the Java side, the enum, as you defined it. (Even for boolean-like values, sometimes it helps having different enums for each value, particularly if there are methods taking several of those values as parameter. Methods with a long list of parameters of the same type are a recipe for disaster).
On the SQL side:
CREATE TABLE status (
id INTEGER PRIMARY KEY,
description VARCHAR(40)
)
CREATE TABLE entity (
...
status_id INTEGER REFERENCES status(id)
)
INSERT INTO status VALUES (0,'Closed');
INSERT INTO status VALUES (1,'Open');
INSERT INTO status VALUES (2,'Broken - replacement has been ordered');
One solution I've encountered is to use a materialized view in the database to dynamically recalculate the denormalized relation. In a document based database you would probably store the CaseStatus as a String. Finally, you might use an ORM tool to store CaseStatus as an Object but in the cases I'm familiar with the reference data is stored in the database (if you store it in code then it requires a build and deployment to production, along with additional testing for the release).
I'm reviewing the capabilities of Googles Guava API and I ran into a data structure that I haven't seen used in my 'real world programming' experience, namely, the BiMap. Is the only benefit of this construct the ability to quickly retrieve a key, for a given value? Are there any problems where the solution is best expressed using a BiMap?
Any time you want to be able to do a reverse lookup without having to populate two maps. For instance a phone directory where you would like to lookup the phone number by name, but would also like to do a reverse lookup to get the name from the number.
Louis mentioned the memory savings possible in a BiMap implementation. That's the only thing that you can't get by wrapping two Map instances. Still, if you let us wrap the Map instances for you, we can take care of a few edges cases. (You could handle all these yourself, but why bother? :))
If you call put(newKey, existingValue), we'll error out immediately to keep the two maps in sync, rather than adding the entry to one map before realizing that it conflicts with an existing mapping in the other. (We provide forcePut if you do want to override the existing value.) We provide similar safeguards for inserting null or other invalid values.
BiMap views keep the two maps in sync: If you remove an element from the entrySet of the original BiMap, its corresponding entry is also removed from the inverse. We do the same kind of thing in Entry.setValue.
We handle serialization: A BiMap and its inverse stay "connected," and the entries are serialized only once.
We provide a smart implementation of inverse() so that foo.inverse().inverse() returns foo, rather than a wrapper of a wrapper.
We override values() to return a Set. This set is identical to what you'd get from inverse().keySet() except that it maintains the same iteration order as the original BiMap.
Here is a tricky data structure and data organization case.
I have an application that reads data from large files and produces objects of various types (e.g., Boolean, Integer, String) that are categorized in a few (less than a dozen) groups and then stored in a database.
Each object is currently stored in a single HashMap<String, Object> data structure. Each such HashMap corresponds to a single category (group). Each database record is built from the information in all the objects contained in all categories (HashMap data structures).
A requirement has appeared for checking whether subsequent records are "equivalent" in the number and type of columns, where equivalence must be verified across all maps by comparing the name (HashMap key) and the type (actual class) of each stored object.
I am looking for an efficient way of implementing this functionality, while maintaining the original object categorization, because listing objects by category in the fastest possible way is also a requirement.
An idea would be to just sort the keys (e.g., by replacing each HashMap with a TreeMap) and then walk over all maps. An alternative would be to just copy everything in a TreeMap for comparison purposes only.
What would be the most efficient way of implementing this functionality?
Also, if how would you go about finding the difference (i.e., the fields added and those removed), between successive records?
Create a meta SortedSet in which you store all the created maps.
Means SortedSet<Map<String,Object>> e.g. a TreeSet which as a custom Comparator<Map<String,Object>> which does check exactly your requirements of same number and names of keys and same object type per value.
You can then use the contains() method of this meta set structure to find out if a similar record does already exist.
==== EDIT ====
Since I've misundertood the relation between database records and the maps in the first place, I've to change some semantics my answer now of course a little bit.
Still I'would use the mentioned SortedSet<Map<String,Object>> but of course the Map<String,Object> would now point to that Map you and havexy suggested.
On the other hand could it be a step forward to use a Set<Set<KeyAndType>> or SortedSet<Set<KeyAndType>> where your KeyAndType will only contain the key and the type with appropriate Comparable implementation or equals with hashcode.
Why? You asked how to find the differences between two records? If each record relates to one of those inner Set<KeyAndType> you can easily use retainAll() to form the intersection of two successive Sets.
If you would compare this to the idea of a SortedSet<Map<String,Object>>, in both ways you would have the logic which differenciates between the fields within the comparator, one time comparing inner sets, one time comparing inner maps. And since this information gets lost when the surrounding set is constructed, it will be hard to get the differences between two records later on, if you do not have another reduced structure which is easy to use to find such differences. And since such a Set<KeyAndType> could act as key as well as as easy base for comparison between two records, it could be a good candidate to be used for both purposes.
If furthermore you wanna keep the relation between such a Set<KeyAndType> to your record or the group of Map<String,Object> your meta structure could be something like:
Map<Set<KeyAndType>,DatabaseRecord> or Map<Set<KeyAndType>,GroupOfMaps> implemented by a simple LinkedHashMap which allows simple iteration in original order.
One soln is to keep both category based HashMap and combined TreeMap. This will have slight more memory requirement, not much though, as you ll just keep the same reference in both of them.
So whenever you are adding/removing to HashMap you will do the same operation in the TreeMap too. This way both will always be in sync.
You can then use TreeMap for comparison, whether you want comparison of type of object or actual content comparison.
This is more actually more of a Lucene question, but it's in the context of a neo4j database.
I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.
When searching, I always want to find nodes of a specific type, never across all nodes.
I can see three ways of organizing this:
One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.
A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).
A single index, type is part of the field name: 'foo.id'='1234'.
Once created, the database is read-only.
Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?
As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.
I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. Since ActiveRecord and ActiveRelation, both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. Might not be the best solution, but here's how I solved it:
Created an index named model_index which indexed nodes under two keys, type and model
Index lookup with type key currently happens with just one value model. This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph.
Index lookup with model key takes place with values corresponding to different model names in my system. This is primarily for achieving DESC <TABLENAME> functionality.
With each table creation as in CREATE TABLE, a node is created with table definition attributes being stored in node properties.
Created node is indexed under model_index with type:model and model:<model-name>. This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key.
For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type.
Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances. Filtered lookups can be further achieved by applying filters and other complex traversals.
The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal.
Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances. Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup.
I'm currently translating AREL to Gremlin scripts for almost everything. You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter
Hope this helps, Cheers! :)
A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)
Another advantage of a smaller index is that the OS cache is likely to make queries run faster.
Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.
spring-data-neo4j is using the first approach - it creates a different index for each type. So I guess that's a good option for the general scenario. But in your particular case it might be suboptimal, as you say. I'd run some benchmarks to measure the performance.
The other two, by the way, seem a bit artificial. You are possibly indexing completely unrelated information in the same index, which doesn't sound right.
What is the best data structure to store an oracle table that's about 140 rows by 3 columns. I was thinking about a multi dimensional array.
By best I do not necessarily mean most efficient (but i'd be curious to know your opinions) since the program will run as a job with plenty of time to run but I do have some restrictions:
It is possible for multiple keys to be "null" at first. so the first column might have multiple null values. I also need to be able to access elements from the other columns. Anything better than a linear search to access the data?
So again, something like [][][] would work.. but is there something like a 3 column map where I can access by the key or the second column ? I know maps have only two values.
All data will probably be strings or cast as strings.
Thanks
A custom class with 3 fields, and a java.util.List of that class.
There's no benefit in shoe-horning data into arrays in this case, you get no improvement in performance, and certainly no improvement in code maintainability.
This is another example of people writing FORTRAN in an object-oriented language.
Java's about objects. You'd be much better off if you started using objects to abstract your problem, hide details away from clients, and reduce coupling.
What sensible object, with meaningful behavior, do those three items represent? I'd start with that, and worry about the data structures and persistence later.
All data will probably be strings or cast as strings.
This is fine if they really are strings, but I'd encourage you to look deeper and see if you can do better.
For example, if you write an application that uses credit scores you might be tempted to persist it as a number column in a database. But you can benefit from looking at the problem harder and encapsulating that value into a CreditScore object. When you have that, you realize that you can add something like units ("FICO" versus "TransUnion"), scale (range from 0 to 850), and maybe some rich behavior (e.g., rules governing when to reorder the score). You encapsulate everything into a single object instead of scattering the logic for operating on credit scores all over your code base.
Start thinking less in terms of tables and columns and more about objects. Or switch languages. Python has the notion of tuples built in. Maybe that will work better for you.
If you need to access your data by key and by another key, then I would just use 2 maps for that and define a separate class to hold your record.
class Record {
String field1;
String field2;
String field3;
}
and
Map<String, Record> firstKeyMap = new HashMap<String, Record>();
Map<String, Record> secondKeyMap = new HashMap<String, Record>();
I'd create an object which map your record and then create a collection of this object.