Jena/ARQ: Difference between Model, Graph and DataSet - java

I'm starting to work with the Jena Engine and I think I got a grasp of what semantics are.
However I'm having a hard time understanding the different ways to represent a bunch of triples in Jena and ARQ:
The first thing you stumble upon when starting is Model and the documentation says its Jenas name for RDF graphs.
However there is also Graph which seemed to be the necessary tool when I want to query a union of models, however it does not seem to share a common interface with Model, although one can get the Graph out of a Model
Then there is DataSet in ARQ, which also seems to be a collection of triples of some sort.
Sure, afer some looking around in the API, I found ways to somehow convert from one into another. However I suspect there is more to it than 3 different interfaces for the same thing.
So, question is: What are the key design differences between these three? When should I use which one ? Especially: When I want to hold individual bunches of triples but query them as one big bunch (union), which of these datastructures should I use (and why)?
Also, do I "loose" anything when "converting" from one into another (e.g. does model.getGraph() contain less information in some way than model)?

Jena is divided into an API, for application developers, and an SPI for systems developers, such as people making storage engines, reasoners etc.
DataSet, Model, Statement, Resource and Literal are API interfaces and provide many conveniences for application developers.
DataSetGraph, Graph, Triple, Node are SPI interfaces. They're pretty spartan and simple to implement (as you'd hope if you've got to implement the things).
The wide variety of API operations all resolve down to SPI calls. To give an example the Model interface has four different contains methods. Internally each results in a call:
Graph#contains(Node, Node, Node)
such as
graph.contains(nodeS, nodeP, nodeO); // model.contains(s, p, o) or model.contains(statement)
graph.contains(nodeS, nodeP, Node.ANY); // model.contains(s, p)
Concerning your question about losing information, with Model and Graph you don't (as far as I recall). The more interesting case is Resource versus Node. Resources know which model they belong to, so you can (in the api) write resource.addProperty(...) which becomes a Graph#add eventually. Node has no such convenience, and is not associated with a particular Graph. Hence Resource#asNode is lossy.
Finally:
When I want to hold individual bunches of triples but query them as one big bunch (union), which of these datastructures should I use (and why)?
You're clearly a normal user, so you want the API. You want to store triples, so use Model. Now you want to query the models as one union: You could:
Model#union() everything, which will copy all the triples into a new model.
ModelFactory.createUnion() everything, which will create a dynamic union (i.e. no copying).
Store your models as named models in a TDB or SDB dataset store, and use the unionDefaultGraph option.
The last of these works best for large numbers of models, and large model, but is a little more involved to set up.

Short answer: Model is just a stateless wrapper with lots of convenience methods around a Graph. ModelFactory.createModelForGraph(Graph) wraps a graph in a model. Model.getGraph() gets the wrapped graph.
Most application programmers would use Model. Personally I prefer to use Graph because it's simpler. I have trouble remembering all the cruft on the Model class.
Dataset is a collection of several Models: one “default model” and zero or more “named models”. This corresponds to the notion of an “RDF dataset” in SPARQL. (Technically speaking, SPARQL is not a query language for “RDF graphs” but for “RDF datasets” which can be collections of named RDF graphs plus a default graph.)

Related

Implementing Apache TinkerPop over my JavaBeans

I'm new to graph databases (although I've extensive experience with Semantic Web technologies) and I'd like to understand if what I've in mind makes sense.
I've my own data model, made of Java's JavaBean objects, the model is rather similar to a graph, with a Node interface (and a few subclasses), an Edge interface (and a few subclasses), methods to query the model (get Node instances with attribute = 'x', get all edges for a node, etc).
I'd like to wrap this model with one of those query languages out there (let's say Cypher or Gremlin), so to have something more standardised and so that I can avoid implementing my own query language and, most importantly, my own query engine.
One obvious way would be to use Neo4j or some TinkerPop implementation as a backend for my object model (or similarly, to convert/synch my objects to a graph for one of those frameworks). However, because the model is already graph-like, has good search methods and efficient storage components (to/from simple XML files), I'm also thinking that maybe I could adapt a query language to my model. TinkerPop seems designed to support that.
Does this make sense? Is TinkerPop the best (or a good) way to go? Is/are there documentation/tutorials about that?
As a comitter of SimpleGraph I had similar needs that led me to starting the
SimpleGraph open source project in the first place.
For conversion of Pojos to and from Tinkerpop there is the ORM/OGM stack FERMA.
The idea of SimpleGraph is to "graphenize" other information sources e.g. the tabular structures of Excel Tabels or SQL databases.
Since your own data structures are already in graph form obviously the mapping to and from tinkerpop is much simpler. The SimpleGraph approach in this case would be a simple back and force (link) between the node and edge structures of so that each tinkerpop node corresponds to one of your nodes and tinkerpop each corresponds to one of your edges. I have succesfully used this approach e.g. for a graphical representation of UML models by mapping XML structural elements to tinkerpop elements and graphical representation elements in a graph editor at the same time. So my answers would be:
Does this make sense? Yes
Is TinkerPop the best (or a good) way to go? Yes
Is/are there documentation/tutorials about that? I'd neither say Yes and No this one
I have not seen a specific tutorial for your use case. If you experiment a bit e.g. with the SimpleGraph modules you might get a feeling how things work.

Design for defining graphs or flowing structures

I'm trying to create a system for representing and designing graphs in an easy way. That means it should be easy to create some graphical representation from the data structure but it should also be easy to store the structure and do easy calculation on it. Easy calulations in this sence are questions like which nodes are the next nodes from a given node in the graph.
Is there some nice way to define stuff like this in xml or database structures? Later would be easier to edit.
Is there maybe already some good java library abstract enough to support my problems?
I'm trying to define a production process which can also have cycles (these cylces are not so important and could be modeled differently), but it feels kind of weird having to make these fundamental design decisions when this problem is so generic.
JUNG - http://jung.sourceforge.net/, may be a good solution for you. It's pretty extensible and has visualization, graph algorithm support etc
neo4j is the "standard" graph database (see also). you can abstract away from a particular implementation (so that you can change the database without changing you code) using blueprints.
alternatively, if the database part is not so important, a library like jgrapht (i wasn't aware of jung, from chris's answer, but it looks similar) gives you access to the usual algorithms for in-memory structures.
[neo4j licencing]

Collection as a metaphor for real world containers

I find modeling physical containers using collections very intuitive. I override/delegate add methods with added capacity constraints based on physical attributes such as volume of added elements, sort based on physical attributes, locate elements by using maps of position to element and so on.
However, when I read the documentation of collection classes, I get the impression that it's not the intended use, that it's just a mathematical construct and a bounded queue is just meant to be constrained by the number of elements and so forth.
Indeed I think that I unless I'm able to model this collection coherently, I should perhaps not expose this class as a collection but only delegate to it internally. Opinions?
Many structures in software development do not have a physical counterpart. In fact, some structures and algorithms are quite abstract, and do not model objects directly in the physical world. So just because an object does not serve as a suitable model for physical objects in the real world does not necessarily mean it cannot be used effectively to solve problems within a computer program.
Indeed I think that I unless I'm able to model this collection coherently, I should perhaps not expose this class as a collection but only delegate to it internally. Opinions?
Firstly, you don't want to get too hung up with the modeling side of software engineering. UML style models (usually) serve primarily as a way of organizing and expressing the developer's high level ideas about how an application should be implemented. There is no need to have a strict one-to-one relationship between the classes in the model and the implementation classes in the application code.
Second, you don't want to get too hung up about modeling "real world" (i.e. physical) objects and their behavior. Most of the "objects" that are used in a typical applications have no real connection with the real world. For example, a "folder" or "directory" is really little more than an analogy of the physical objects with the same names. There's typically no need for the computer concept to be constrained by the physical behavior of the real world objects.
Finally, there are a number of software engineering reasons why it is a bad idea to have your Java domain classes extend the standard collection types. For example:
The collections have a generic behavior that it is typically not appropriate to expose in a domain object. For instance, you typically don't want components of a domain object to be added and removed willy-nilly.
By extending a collection type, you are implicitly giving permission for some part of your application to treat domain objects as just lists or sets or whatever.
By extending collection classes, you would be hard-wiring implementation details into your domain APIs. For example, you would need to decide between extending ArrayList or LinkedList, and changing your mind would result (at least) in a binary API incompatibility ... and possibly worse.
Not entirely sure that I've understood you correctly. I gather that you want to know if you should expose the collection (subclassing) or wrap it (have a private field).
As Robert says, it really depends on the case. It's pretty much your choice. Nonetheless I'd say that in many cases the better choice is to not expose the collection because the constraints define the object you are modelling and are not fully congruent with the underlying collection. In short: users of your object shouldn't need to know that they are dealing with a collection unless it is really a collection with some speciality e.g. has all properties of a collection but allows only a certain number of objects.

Why doesn't the Java Collections API include a Graph implementation?

I’m currently learning the Java Collections API and feel I have a good understanding of the basics, but I’ve never understood why this standard API doesn’t include a Graph implementation. The three base classes are easily understandable (List, Set, and Map) and all their implementations in the API are mostly straightforward and consistent.
Considering how often graphs come up as a potential way to model a given problem, this just doesn’t make sense to me (it’s possible it does exist in the API and I’m not looking in the right place of course). Steve Yegge suggests in one of his blog posts that a programmer should consider graphs first when attacking a problem, and if the problem domain doesn’t fit naturally into this data structure, only then consider the alternative structures.
My first guess is that there is no universal way to represent graphs, or that their interfaces may not be generic enough for an API implementation to be useful? But if you strip down a graph to its basic components (vertices and a set of edges that connect some or all of the vertices) and consider the ways that graphs are commonly constructed (methods like addVertex(v) and insertEdge(v1, v2)) it seems that a generic Graph implementation would be possible and useful.
Thanks for helping me understand this better.
Note that some special graphs are included in the Collection Framework, notably linked lists and trees.
This also points to a possible reason why no general Graph implementation is present: as graphs can have so many different forms and flavours with wildly different characteristics, a general Graph might not turn out to be very useful.
Also, at least in my practice so far, I haven't felt the need for graphs most of the time. Some domains surely do need them, but many simply don't. (Out of more than a dozen projects in various domains I have been involved in so far, I recount two which actually needed graphs.) So I guess there was no really big pressure from the Java community in general to have a Graph in the Collection Framework. It contains only the basic stuff, which is needed "almost always", by "almost everyone". And one of its strengths is indeed its (relative) simplicity and clarity, which, I believe, its designers see as an asset to be preserved.

Data Model Evolution

When writing code I am seeing requirements to change data models (e.g. adding/changing/removing data members from a class). When these data models belong to an interface, it seems difficult to change without breaking the existing client codes. So I am wondering if there is any best practice of designing interfaces/data models in a way to minimize the impact during evolution.
The closest thing I can find from google is data contract versioning. But that seems to be a .net specific topic. I am wondering if the same practice applies to the Java world, or there is a different or generic way to deal with data model evolution.
Thanks
There are some tools which can help, have a look at LiquiBase.
This article goves a good overview on developerworks
There are no easy answers to this in either the Java or data modeling domains.
Some changes are upwards compatible; e.g. addition of new methods, optional fields, subclasses and so on.
Some changes are not compatible, but can be handled using a simple transformation; e.g. addition of a mandatory field could supported by a transformation that adds an extra constructor argument.
Some changes unavoidably require major programmer intervention.
Another point to note is that the problem gets a lot harder when the data corresponding to the data models is persistent, and cannot be thrown away when the data model changes. This is referred to as the "schema evolution" problem, and I believe that it has been proven that there is no general solution.

Categories