We have the following situation. We have a couple of repositories that hold documents. We have written front-end services that deal with documents and document data across the different repositories. We have operations that allow you to, among other things, store new documents and retrieve document metadata.
The problem is, there are different types of documents in the repositories that each have different sets of metadata. For example, all documents in one repository have document name, date added, size, ID, document type and document source. Billing documents also have billing account number and customer name. Policy documents have policy number, insured name and agency code. Some special policy documents also have effective date and packet type.
In the second repository, documents have document name, date added, size, type and location. Invoices (which are Billing documents in the other repository) have account number and customer name, but also invoice date. Policy documents have policy number, insured name, agency code, effective date and policy type. Some special policy documents have cancellation date and amount due.
The reality is more complicated, but this represents the basic issue I'm having.
I don't really have control over the existing metadata fields. Those are defined elsewhere and some of it's legacy. Also, these are SOAP web services, but will eventually become RESTful. But for now, they're defined by a WSDL.
So, what's the best way to represent these things that have many similarities, but some differences?
Some of the considerations:
I'd like to shield the client from as much repository-specific info as possible. In a perfect world, the client shouldn't care if the doc is from one repository or another, although the different fields may make this a pipe dream.
I'd like a single newDocument and getDocumentProperties call to accept and return the pertinent data for each type, rather than have individual new and get calls for each different document type.
I could go with one big fat object with all possible fields and an enum to tell them apart, but that means the client has to magically know what fields apply and what don't.
I could go with a specific object for each possible set of document fields, but then the client has to know whether the doc is going to or coming from a particular repository which is more than I want them to know.
For now, I've gone with the best (or worst?) of both worlds, going with a few high-level abstractions (Policy document, Billing document), converting where I can and leaving any unknown or undefined data for that abstraction empty.
But this means that the client still has to know that, for example, for some Billing docs you'll have invoice date, but for others you won't. Or that for docs from one repository you'll have an ID but for the other you'll have location.
Anyway, I'm looking for best practices for dealing with these sorts of similar, but different objects.
So, what's the best way to represent these things that have many similarities, but some differences?
I think the approach to how to represent/model the data depends on your application requirements and there isn't a globally accepted best practice I know of, some (all?) of the options are:
Map document fields with key value pairs
One fat object with every possible field.
Slim hierarchy with classes containing only shared fields.
Slim hierarchy + dynamic meta-data (e.g. BillingDocument only contains shared fields and also contains a map that contains fields unique to this repository)
Complex hierarchy with sub classes to hold the unique fields (e.g. BaseBillingDocument, RepoOneBillingDocument, Repo2BillingDocument)
Some of the considerations:
I'd like to shield the client from as much repository-specific info as possible. In a perfect world, the client shouldn't care if the doc is from one repository or another, although the different fields may make this a pipe dream.
This is a business issue not a technical one, normalise the data by either discarding unnecessary fields, declare them as optional and should be expected to be empty at times, compute missing values if they are derived from other common attributes or live with the fact that you have different sub types of the same document (BillingDocRepo1, BillingDocRepo2)
I'd like a single newDocument and getDocumentProperties call to accept and return the pertinent data for each type, rather than have individual new and get calls for each different document type.
This is almost doable in all representations, inheritance and polymorphism are supported in both REST and SOAP web services and also doable if your using dynamic schema (a map for instance or class with metadata)
Related
In the Pragmatic Programmer book, chapter “Data source duplication” the authors state:
Many Data sources allow you to introspect on their data schema. This can be used to remove much of the duplication between them and your code. Rather than manually creating the code to contain this stored data, you can generate the containers directly from the schema. Many persistence frameworks will do this heavy lifting for you.
So far so good. We can achieve this easily connecting our IDE to the DB and let it create our entities for us.
Then it continues:
There’s another option, and one we often prefer. Rather than writing code that represents external data in a fixed structure (an instance of a struct of class for example), just stick it into a key/value data structure (your language might call it a map, hash, dictionary, or even object). On its own this is risky .... we recommend adding a second layer to this solution: a simply table-driven validation suite that verifies that the map you’ve created contains at least the data you need. Your API documentation tool might be able to generate this.
The idea if I got it right is to avoid having an Entity to represent the table in the DB (so to avoid duplication of knowledge) but rather to use a map, so that if we add a new column to the schema we don’t need to update our representation of that schema (i.e. the entity) as well in our application.
Then comes the part that is not clear to me: he talks about an autogenerated “table-driven validation suite that verifies that the map you’ve created contains at least the data you need”.
Does any of you know how these concept implemented would look like?
The closest thing i could find on Google about this topic is this question on StackOverflow but the answers skipped the second part.
I think it really depends on the language you’re using and on the data you need to read. For Java, if you’re mapping the raw data to a Map what you could do is to use validators (ex Hibernate validators or spring validators) to define your custom annotation and enforce that schema’s constraints are respected when creating the in memory representation (eg: you’re reading a user table with id as primary key, the map must then contain the id key with a valid value)
In my applications I have a set of object which stay alive during the whole application lifecycle and I need to create an historical database of them.
These objects are instances of a hierarchy of Java / Scala classes annotated with Hibernate annotations, which I use in my application to load them at startup. Luckily all the classes already contain a timestamp, which means that I do not need to change the object model to be able to create historical records.
What is the most suitable approach:
Use Hibernate without annotations and providing external xml mappings, which are the same as the one of annotations besides the primary key ( which is now a composite key consisting of the previous primary key + the timestamp)
Use other classes for historical records ( this sounds very complicated, as I do have a hierarchy of classes and not a single class, and I would have to subclass my HistoricalRecordClass for every type of record, as I want to build it back). Still use Hibernate
Use a completely different approach (Please not I do not like ORMS, it is just a matter of convience)
Some considerations:
The goal of storing historical records is that the user, through a single GUI, might access both the real-time values of certain data or the historical value, just by specifying a date.
How do you intend to use the historical records? The easiest solution would be to serialize them as JSON and log them to a file.
I've never combined hibernate xml mappings in conjunction with hibernate annotations, but if it worked, it sounds more attractive than carrying two parallel object models.
If you need to be able to recreate the application state at any point in time, then you're more or less stuck with writing them to a database (because of the fast random access). You could cheat and have a "history" table that has a composite key of id + timestamp + type, then a "json" field where you just marshal the thing down and save it. That would help with a) carrying one history table instead of a bunch of clone tables, and b) give you some flexibility if the schema changes (i.e. leverage the open schema nature of JSON)
But since it's archive data with a different usage pattern (you're just reading/writing the records whole), I'd think about some other means of storing it than with the same strict schema as the live data.
It's a nice application of the "write once" paradigm... do you have Hadoop available? ;)
I am exposing a couple domain objects via a SOAP based web service. Some of my domain objects have a large number of fields. I do not want to include values in my web service request/response unless they are needed.
For example, if I have a Book domain object with fields title, genre, and isbn, if I wanted to use my web service to update the name of a book, I want my request to only include the title field (omitting the other two fields that aren't being updated).
Likewise, I want my web service clients to be able to specify which fields they want to be returned when they load books.
This would allow clients to load the title field thereby reducing the size of the data going across the wire because the fields that aren't needed would not be included in the response.
Does anyone know of any patterns or best practices to deal with this type of requirement?
You touched multiple problems where each deserves separate explanation:
Reducing traffic - reducing traffic usually means reducing roundtrips not reducing payload. Reducing traffic is achieved by implement better operations which do multiple actions instead of exposing CRUD operations.
Reducing payload - if you don't want to transfer whole entity you should use Data transfer objects. Special object transferring only data required for given operation.
Dynamic response - web services are not supposed to do that. Web service has fixed interface defined by WSDL where each message payload is specified by XSD. If you want dynamically change returned data structure you will break this. It doesn't mean it is not possible - you can define that your service operation returns xsd:any = any XML and it will be your duty to prepare returned XML and duty of your client to parse XML.
You can either make the fields optional in the XSD data type, or you can specify that in the changeTitle request you don't expect a Book, but only an ID and a string.
When you invent the changeAttributes request and you have optional fields, you have to decide what a missing field means. It could be clear this field or leave this field untouched.
I've recently overheard people saying that data transfer objects (DTOs) are an anti-pattern.
Why? What are the alternatives?
Some projects have all data twice. Once as domain objects, and once as data transfer objects.
This duplication has a huge cost, so the architecture needs to get a huge benefit from this separation to be worth it.
DTOs are not an anti-pattern. When you're sending some data across the wire (say, to an web page in an Ajax call), you want to be sure that you conserve bandwidth by only sending data that the destination will use. Also, often it is convenient for the presentation layer to have the data in a slightly different format than a native business object.
I know this is a Java-oriented question, but in .NET languages anonymous types, serialization, and LINQ allow DTOs to be constructed on-the-fly, which reduces the setup and overhead of using them.
"DTO an AntiPattern in EJB 3.0" (original link currently offline) says:
The heavy weight nature of Entity
Beans in EJB specifications prior to
EJB 3.0, resulted in the usage of
design patterns like Data Transfer
Objects (DTO). DTOs became the
lightweight objects (which should have
been the entity beans themselves in
the first place), used for sending the
data across the tiers... now EJB 3.0
spec makes the Entity bean model same
as Plain old Java object (POJO). With
this new POJO model, you will no
longer need to create a DTO for each
entity or for a set of entities... If
you want to send the EJB 3.0 entities
across the tier make them just
implement java.io.Serialiazable
OO purists would say that DTO is anti-pattern because objects become data table representations instead of real domain objects.
I don't think DTOs are an anti-pattern per se, but there are antipatterns associated with the use of DTOs. Bill Dudney refers to DTO explosion as an example:
http://www.softwaresummit.com/2003/speakers/DudneyJ2EEAntiPatterns.pdf
There are also a number of abuses of DTOs mentioned here:
http://anirudhvyas.com/root/2008/04/19/abuses-of-dto-pattern-in-java-world/
They originated because of three tier systems (typically using EJB as technology) as a means to pass data between tiers. Most modern day Java systems based on frameworks such as Spring take a alternative simplified view using POJOs as domain objects (often annotated with JPA etc...) in a single tier... The use of DTOs here is unnecessary.
Some consider DTOs an anti-pattern due to their possible abuses. They're often used when they shouldn't be/don't need to be.
This article vaguely describes some abuses.
The question should not be "why", but "when".
Definitively it's anti-pattern when only result of using it is higher cost - run-time or maintenance. I worked on projects having hundreds of DTOs identical to database entity classes. Each time you wanted to add a single field you ad to add id like four times - to DTO, to entity, to conversion from DTO to domain classes or entities, the inverse conversion, ... You forgot some of the places and data got inconsistent.
It's not anti-pattern when you really need different representation of domain classes - flatter, richer, ...
Personally I start with a domain class and pass it around, with proper checks at the right places. I can annotate and/or add some "helper" classes to make mappings to database, to serialization formats like JSON or XML ... I can always split a class to two if I feel the need.
It's about your point of view - I prefer to look at a domain object as a single object playing various roles, instead of multiple objects created from each other. If the only role an object is to transport data, then it's DTO.
If you're building a distributed system, then DTOs are certainly not an anti pattern. Not everyone will develop in that sense, but if you have a (for example) Open Social app all running off JavaScript.
It will post a load of data to your API. This is then deserialized into some form of object, typically a DTO/Request object. This can then be validated to ensure the data entered is correct before being converted into a model object.
In my opinion, it's seen as an anti-pattern because it's mis-used. If you're not build a distributed system, chances are you don't need them.
DTO becomes a necessity and not an ANTI-PATTERN when you have all your domain objects load associated objects EAGERly.
If you don't make DTOs, you will have unnecessary transferred objects from your business layer to your client/web layer.
To limit overhead for this case, rather transfer DTOs.
I think the people mean it could be an anti-pattern if you implement all remote objects as DTOs. A DTO is merely just a set of attributes and if you have big objects you would always transfer all the attributes even if you do not need or use them. In the latter case prefer using a Proxy pattern.
The intention of a Data Transfer Object is to store data from different sources and then transfer it into a database (or Remote Facade) at once.
However, the DTO pattern violates the Single Responsibility Principle, since the DTO not only stores data, but also transfers it from or to the database/facade.
The need to separate data objects from business objects is not an antipattern, since it is probably required to separate the database layer anyway.
Instead of DTOs you should use the Aggregate and Repository Patterns, which separates the collection of objects (Aggregate) and the data transfer (Repository).
To transfer a group of objects you can use the Unit Of Work pattern, that holds a set of Repositories and a transaction context; in order to transfer each object in the aggregate separately within the transaction.
Let's say I have a set of Countries in my application. I expect this data to change but not very often. In other words, I do not look at this set as an operational data (I would not provide CRUD operations for Country, for example).
That said I have to store this data somewhere. I see two ways to do that:
Database driven. Create and populate a Country table. Provide some sort of DAO to access it (findById() ?). This way client code will have to know Id of a country (which also can be a name or ISO code). On the application side I will have a class Country.
Application driven. Create an Enum where I can list all the Countries known to my system. It will be stored in DB as well, but the difference would be that now client code does not have to have lookup method (findById, findByName, etc) and hardcode Id, names or ISO codes. It will reference particular country directly.
I lean towards second solution for several reasons. How do you do this?
Is this correct to call this 'dictionary data'?
Addendum: One of the main problems here is that if I have a lookup method like findByName("Czechoslovakia") then after 1992 this will return nothing. I do not know how the client code will react on it (after all it sorta expects always get the Country back, because, well, it is a dictionary data). It gets even worse if I have something like findById(ID_CZ). It will be really hard to find all these dependencies.
If I will remove Country.Czechoslovakia from my enum, I will force myself to take care of any dependency on Czechoslovakia.
In some applications I've worked on there has been a single 'Enum' table in the database that contained all of this type of data. It simply consisted of two columns: EnumName and Value, and would be populated like this:
"Country", "Germany"
"Country", "United Kingdom"
"Country", "United States"
"Fruit", "Apple"
"Fruit", "Banana"
"Fruit", "Orange"
This was then read in and cached at the beginning of the application execution. The advantages being that we weren't using dozens of database tables for each distinct enumeration type; and we didn't have to recompile anything if we needed to alter the data.
This could easily be extended to include extra columns, e.g. to specify a default sort order or alternative IDs.
This won't help you, but it depends...
-What are you going to do with those countries ?
Will you store them in other tables in the DB / what will happen with existing data if you add new countries / will other applications access to those datas ?
-Are you going to translate the contry names in several languages ?
-Will the business logic of your application depend on the choosen country ?
-Do you need a Country class ?
etc...
Without more informations I would start with an Enum with a few countries and refactor depending on my needs...
If it's not going to change very often and you can afford to bring the application down to apply updates, I'd place it in a Java enumeration and write my own methods for findById(), findByName() and so on.
Advantages:
Fast - no DB access for invariant data (or caching requirement);
Simple;
Plays nice with refactoring tools.
Disadvantages:
Need to bring down the application to update.
If you place the data in its own jarfile, updating is as simple as updating the jar and restarting the application.
The hardcoding concern can be made to go away either by consumers storing a value of the enumeration itself, or by referencing the ISO code which is unlikely to change for countries...
If you're worried about keeping this enumeration "in synch" with the database, write an integration test that checks exactly that and run it regularly (eg: on your CI machine).
Personally, I've always gone for the database approach, mostly because I'm already storing other information in the database so writing another DAO is easy.
But another approach might be to store it in a properties file in the jar? I've never done it that way in Java, but it seems to be common in iPhone development (something I'm currently learning).
I'd probably have a text file embedded into my jar. I'd load it into memory on start-up (or on first use.) At that point:
It's easy to change (even by someone with no programming knowledge)
It's easy to update even without full redeployment - put just the text file somewhere on the class path
No database access required
EDIT: Okay, if you need to refer to the particular country data from code, then either:
Use the enum approach, which will always mean redeployment
Use the above approach, but keep an enum of country IDs and then have a unit test to make sure that each ID is mapped in the text file. That means you could change the rest of the data without redeployment, and a non-technical person can still update the data without seeing scary code everywhere.
Ultimately it's a case of balancing pros and cons - if the advantages above aren't relevant for you (e.g. there'll always be a coder on hand, and deployment isn't an issue) then an enum makes sense.
One of the advantages of using a database table is you can put foreign key constraints in. That way your referential integrity will always be intact. No need to run integration tests as DanVinton suggested for enums, it will never get out of sync.
I also wouldn't try making a general enum table as saw-lau suggested, mainly because you lose clean foreign key constraints, which is the main advantage of having them in the DB in the first place (might was well stick them in a text file). Databases are good at handling lots of tables. Prefix the table names with "ENUM_" if you want to distinguish them in some fashion.
The app can always load them into a Map as start-up time or when triggered by a reload event.
EDIT: From comments, "Of course I will use foreign key constraints in my DB. But it can be done with or without using enums on app side"
Ah, I missed that bit while reading the second bullet point in your question. However I still say it is better to load them into a Map, mainly based on DRY. Otherwise, when whoever has to maintain it comes to add a new country, they're surely going to update in one place but not the other, and be scratching their heads until they figure out that they needed to update it in two different places. A case of premature optimisation. The performance benefit would be minimal, at the cost of less maintainable code, IMHO.
I'd start off doing the easiest thing possible - an enum. When it comes to the point that countries change almost as frequently as my code, then I'd make the table external so that it can be updated without a rebuild. But note when you make it external you add a whole can of UI, testing and documentation worms.