I'm currently struggling to wrap my head around some concurrency concepts in general. Suppose we have a REST api with several endpoints for updating and creating entities in our database. Lets assume that we receive 100 concurrent requests at the same time for a certain update. How do we guarantee that our data consistency is retained? If working with Java, I guess some options would be:
Using lock mechanisms
Using synchronization on methods in our service layer
However surely this would make a huge impact on the scalability of our application? But I can't see any other way currently of ensuring that we don't encounter any race conditions when interacting with our database. (Also I dont think there is much point to adding synchronization to every method we write in our service?)
So, I guess the question is: how can we reliably secure our application from race conditions with concurrent requests, while at the same time retaining scalability?
I realize this is a very open-ended and conceptual question, but please if you can point me in the right direction of what area / topic I should dive into for learning, I would be grateful.
Thanks!
You got a good understanding of the problem.
You have to decide between eventual consistency and strong consistency. Strong consistency will limit scaling to a certain extent but you also really need to sit down and be realistic/honest about your scaling needs(or your consistency needs).
It's also possible to limit consistency for example rows in a database could be consistent or you can be consistent geographically within a region or a continent. Different queries can also have different requirements.
Creating efficient and strongly consistent databases is a whole field of research and all the big tech giants have people working on that, there are too many solutions/technologies to list. Just googling something like "strong consistency scaling" will get you a ton of results you can read.
Related
I have this concept of rewriting a game engine as a scalable collection of microservices.
It's currently a proof of concept but the main principle lies in each player having their session/connection held and managed by a single container, so containers will scale up and down based on the amount of connected users.
Each player container will speak to multiple other microservices to gather data and perform actions, these services will be static replica's of 2 or 3.
There is one microservice I have in mind which I feel is a bit of bottleneck which I'm currently looking for ways to make more 'scalable' and 'robust'.
This microservice in question is the GameMap service. There will be multiple GameMap services (atleast one service for each uniqe or instanced gamemap). Each map will contain N number of cells and each cell can contain objects with different types / states for example (i.e other playerObjects, ItemObjects)
I would like to be able to have a replica of atleast 2 for each GameMap to instantly flip if one was to for some reason fail and shutdown.. it is important for the users to have a seamless transition between the failing and failover GameMap. To achieve that I need to have consistent / up to date state shared between them.
The need to be able to load balance traffic between the two replica's is a nice to have but not essential.
So far the one potential solution I have come is hazelcast. This will allow me to keep the state of each map cell in a scalable memory data grid (again for robustness and scalability).
I expect that there may be up to 100s of state changes within a gamemap every second and my concern is that it may be too slow and cause huge latency between users.
Has anyone got any hints, suggestions or feedback based on the both scenario or more importantly the usecase of hazelcast here?
P.S. i can upload my very crude connectivity/architect diagram of my game engine as micro services at some point if it helps or if anyone is interested.
It really depends on your requirements, environment etc.
Especially if you want to be HA, you probably want to replicate to different availability zones or potentially different regions and you will be bound by the speed of light (or need to accept there is a chance for data loss). So in other words; the performance is mostly determined by the infrastructure.
But just to give you some ballpark numbers; for a simple read on c5.9xlarge instances on EC between machines in the same low latency group you are looking at 100/200 us. And running a hundreds of thousands of gets second per instance is normally not an issue.
In other words; it is very difficult to say if this is the right approach. Depending on your situation and how important this is, I would take a single slice of your whole system and make some benchmarks to get an impression how well it performs and how well it scales.
But my alarm-bells are going of when I see the combination of micro-service with 'real time' and 'game engine'.
Recently I see a lot of code in few projects using stream for filtering objects, like:
library.stream()
.map(book -> book.getAuthor())
.filter(author -> author.getAge() >= 50)
.map(Author::getSurname)
.map(String::toUpperCase)
.distinct()
.limit(15)
.collect(toList()));
Is there any advantages of using that instead of direct HQL/SQL query to the database returning already the filtered results.
Isn't the second aproach much faster?
If the data originally comes from a DB it is better to do the filtering in the DB rather than fetching everything and filtering locally.
First, Database management systems are good at filtering, it is part of their main job and they are therefore optimized for it. The filtering can also be sped up by using indexes.
Second, fetching and transmitting many records and to unmarshal the data into objects just to throw away a lot of them when doing local filtering is a waste of bandwidth and computing resources.
On a first glance: streams can be made to run in parallel; just by changing code to use parallelStream(). (disclaimer: of course it depends on the specific context if just changing the stream type will result in correct results; but yes, it can be that easy).
Then: streams "invite" to use lambda expressions. And those in turn lead to usage of invoke_dynamic bytecode instructions; sometimes gaining performance advantages compared to "old-school" kind of writing such code. (and to clarify the misunderstanding: invoke_dynamic is a property of lambdas, not streams!)
These would be reasons to prefer "stream" solutions nowadays (from a general point of view).
Beyond that: it really depends ... lets have a look at your example input. This looks like dealing with ordinary Java POJOs, that already reside in memory, within some sort of collection. Processing such objects in memory directly would definitely be faster than going to some off-process database to do work there!
But, of course: when the above calls, like book.getAuthor() would be doing a "deep dive" and actually talk to an underlying database; then chances are that "doing the whole thing in a single query" gives you better performance.
The first thing is to realize, that you can't tell from just this code, what statement is issued against the database. It might very well, that all the filtering, limiting and mapping is collected, and upon the invocation of collect all that information is used to construct a matching SQL statement (or whatever query language is used) and send to the database.
With this in mind there are many reasons why streamlike APIs are used.
It is hip. Streams and lambdas are still rather new to most java developers, so they feel cool when they use it.
If something like in the first paragraph is used it actually creates a nice DSL to construct your query statements. Scalas Slick and .Net LINQ where early examples I know about, although I assume somebody build something like it in LISP long before I was born.
The streams might be reactive streams and encapsulate a non-blocking API. While these APIs are really nice because they don't force you to block resources like threads while you are waiting for results. Using them requires either tons of callbacks or using a much nicer stream based API to process the results.
They are nicer to read the imperative code. Maybe the processing done in the stream can't [easily/by the author] be done with SQL. So the alternatives aren't SQL vs Java (or what ever language you are using), but imperative Java or "functional" Java. The later often reads nicer.
So there are good reasons to use such an API.
With all that said: It is, in almost all cases, a bad idea to do any sorting/filtering and the like in your application, when you can offload it to the database. The only exception I can currently think of is when you can skip the whole roundtrip to the database, because you already have the result locally (e.g. in a cache).
Well, your question should ideally be - Is it better to do reduction / filtering operations in the DB or fetch all records and do it in Java using Streams?
The answer isn't straightforward and any stats that give a "concrete" answer will not generalize to all cases.
The operations you are talking about are better done in the DB itself, because that is what DBs are designed for, very fast handling of data. Of course usually in case of relational databases, there will be some "book-keeping and locks" being used to ensure that independent transactions don't end up making the data inconsistent, but even with that, DBs do a pretty good job in filtering data, especially large data sets.
One case where I would prefer filtering data in Java code rather than in DB would be if you need to filter different features from the same data. For example, right now you are getting only the Author's surname. If you wanted to get all books written by the author, ages of authors, children of author, place of birth etc. Then it makes sense to get only one "read-only" copy from the DB and use parallel streams to get different information from the same data set.
Unless measured and proven for a specific scenario either could be good or equally bad. The reason you usually want to take these kind of queries to the database is because (among other things):
DB can handle much larger data then your java process
Queries in a database can be indexed (making them much faster)
On the other hand, if your data is small, using a Stream the way you did is effective. Writing such a Stream pipeline is very readable (once you talk Streams good enough).
Hibernate and other ORMs are usually way more useful for writing entities rather than reading, because they allow developers to offload ordering of specific writes to framework that almost never will "get that wrong".
Now, for reading and reporting, on the other hand (and considering we are talking DB here) an SQL query is likely to be better because there will not be any frameworks in-between, and you will be able to tune query performance in terms of database that will be invoking this query rather than in terms of your framework of choice, which gives more flexibility to how that tuning can be done, sort of.
I'm working on a Genetic Programming project that tries to generate GPs that would represent an image. My approach is to split the image into different independent sections and having seperate threads do the evolution jobs on them.
Since things are going to be asynchronous, naturally you'd want objects to be independent as well. The problem is that I noticed that certain objects in JGAP are actually shared variables, so they are going to be shared between threads, and that would cause a lot of issues. For example, I noticed that all Variables with the same name are the same, which means that if I wanted to evaluate more than one IGPProgram at the same time I'd have to lock the variable, which could really hamper performance.
I also noticed that if you tried to create more than one GPConfiguration, the program would complain that you would have to reset it first. So this seems to me all GPConfigurations are shared (i.e. you can't have multiple threads create multiple configurations at the same time), which is a problem because creating GPProblems can take a lot of time, and I'm creating a lot of GPProblems, so I was hoping to reduce the time taken by splitting the work into multiple threads.
Are there any "gotchas" that I would need to know about when working with JGAP and threads? Unfortunately, multithreading isn't touched upon too much in the JGAP documentation and I was hoping I'd get some advice from people who might have experience with JGAP.
According to the FAQ, JGAP "does support multi-threaded computation". However, this doesn't mean that the entire API/object graph is fully thread safe. Do you have a code sample that demonstrates the problem you are having? I don't think you're going to get a canonical answer without refining your question a bit.
There is a threaded example in the JGAP distribution zip under examples/src/examples/simpleBooleanThreaded.
If you want some variable do not shared across thread, and make minor changes to let the code suppport multithread. you can using ThreadLocal.
When and how should I use a ThreadLocal variable?
We are trying to cache the results of database selects (in hash map), so we wouldn’t have to execute them multiple times. and whenever we are changing data base, so for getting the changes in app we have added refresh list functionality.
Now we have a large no of list to fetch, so it taking too much time to load pick list from the data base.
So I have some question regarding this issue:
How I can find out how much memory the list is using? (I have used the method where we are using garbage collector for collecting the memory and taking the difference but there are many list and so it is taking too much time)
How I can optimize the refresh list?
Thanks for the help.
how i can find how much memory the list is using
JProfiler
VisualVM
how i can optimize the refresh list.
Make sure you're using the correct collection type for your data.
Have a look here.
Also have a look at the Guava collections.
One last thing, ignis is very right by advising you not to use System.gc() this might be the very reason you're having performance problems. This is why.
First, while not wanting to generalize when it comes to performance problems, the issue you're seeing are unlikely to be purely down to memory use, though if the lists are large this could come into play when they're refreshed and a large number of objects become eligible for collection.
To solve issues relating to garbage collection there's a few rules of thumb, but it always comes down to breaking out a profiler an tuning the garbage collector - there's more on that here.
But before that any loading of a database is going to involve iteration over a result set, so the biggest optimization you can make will be to reduce the size of the result sets. There's a couple of ways to do that:
if you using a map, try to use keys that don't require loading and do the load when you get a miss.
once loaded, only refresh the rows that have changed since you last loaded the data, though this obivously doesn't solve the start-up problem.
Now all that said, I would not recommend you write your own caching code in the first place. The reasons I say this are:
all modern RDBMS cache, so providing your queries are performant getting the actual result set should not be a bottleneck.
Hibernate provides not only ORM but a robust and well understood caching solution.
if you really need to cache massive datasets, use Coherence or similar - the cache can be started in a seperate JVM and your application doesn't need to take the load hit.
You have two problems here: discovering how much memory is in use, and managing a cache. I'm not sure that the two are really closely related, although they may be.
Discovering how much memory an object uses isn't extremely difficult: one excellent article you can use for reference is "Sizeof for Java" from JavaWorld. It escapes the whole garbage collection fiasco, which has a ton of holes in it (it's slow, it doesn't count the object but the heap - meaning that other objects factor into your results that you may not want, etc.)
Managing the time to initialize the cache is another problem. I work for a company that has a data grid as a product, and thus I'm biased; be aware.
One option is not using a cache at all, but using a data grid. I work for GigaSpaces Technologies, and I feel ours is the best; we can load data from a database on startup, and hold your data there as a distributed, transactional data store in memory (so your greatest cost is network access.) We have a community edition as well as full-featured platforms, depending on your need and budget. (The community edition is free.) We support various protocols, including JDBC, JPA, JMS, Memcached, a Map API (similar to JCache), and a native API.
Other similar options include Coherence, which is itself a data grid, and Terracotta DSO, which can distribute an object graph on a JVM heap.
You can also look at the cache projects themselves: Two include Ehcache and OSCache. (Again: bias. I was one of the people who started OpenSymphony, so I've a soft spot for OSCache.) In your case, what would happen is not a preload of cache - note that I don't know your application, so I'm guessing and might be wrong - but a cache on demand. When you acquire data, you'd check the cache for data first and fetch from the DB only if the data is not in cache, and load the cache on read.
Of course, you can also look at memcached, although I obviously prefer my employer's offering here.
Be aware that invoking
System.gc()
or
Runtime.getRuntime().gc()
is a bad idea unless you really need to do that. You should leave the VM the task of deciding when to free objects, unless after profiling you found that it's the only way to make the application go faster on your client's VM.
I tend to use YourKit for this sort of thing. It costs money but IMO is worth every penny (no connection other than as a customer).
It has been suggested that in order to improve performance of our system that the use of lazy loading should be used across the board. That is to change the OneToOne mapping with the “mappedBy” property to the #OneToMany mapping. This is to address and stop the loading of unwanted data from the database which leads to slowness of the applications.
We run a multi-tier system (basically 2 tier). We have the front end - using JSF and the back end which contains the business and database access layers. Front and back communicate view EJB - but EJB have no real logic in them. Other technology used - Spring and Hibernate
Now, after some reading on the topic it seems that the usage of lazing loading is not a silver bullet in that it needs to be applied correctly. For each lazy loading, a Select statement will be issued to fetch the data. There is also the issue that if the front end makes access to a property that is to be lazy loaded and the session/connection is closed on the back end, then we will get a null.
Is the above a correct concern?
So, what is the best approach/practice to go about in implementing a lazy loading solution or performance improvement? The hope is not to redo the data model if at all possible.
My initial though was to work with the DBA group to get an ideal of what is going on between the two systems - how the queries look, how we are using the data etc. Identify trouble spots, examine the Hibernate object/queries to see how best to improve it. Also to look at the front end to determine what and how the data is passed from the back to the front to be displayed etc.
Good approach/other approaches?
The very first thing you should do is measure your application and find out what exactly is causing your performance issues.
Use a tool like JProfiler to find out where the issues are.
Once you know what's going on you can then decide how you're going to fix it.
Just going straight to implementing a lazy loading scheme without knowing what's causing your performance issues will be a waste of your time.
If you discover that the DB layer is where your issue is then you can get the DBA's involved to see if your schema / queries can be improved before doing anything more radical.
It is true that if your fetching of data is slowing your load times down, then lazy loading is a great solution. But applying it across the board seems to be a premature optimization. You would probably want to impliment it for each set of data and then test to see if it speeds up the app. If it does not, then it is not a good candidate for the lazy loading.
The way I have implemented the lazy loading caused no change in the data tier. It was all down in the business logic and presentation controllers. As I am not familiar with EJB I am going to assume that this will work for your java app. Any way, When I implement the lazy loading, I load no data (atleat none of the data I am going to load lazily), until it is needed. Then I call the data tier and get my data (or a subset of the data).
As for the connection concern, you will need to put checks in place to test the data connection to see if it is closed. That is, is you are pooling the data connections. Then if the connection is closed, then reopen it. But as with the actual lazy loading implementation, this should be done in your logic classes and not in the front end, so you don't have to duplicate this functionality many times.
Lazy loading is great for delaying expensive operations, but I agree with your conclusion that it's not a silver bullet to solve all performance issues.
Go with your gut instinct and do some analysis first to see where the real performance problems are in the application. It may turn out that lazy loading some data will be a good solution, or it could be something completely different. There's really no way to tell until you start profiling and analyzing what is going on inside the app.
It so much depends on what you are trying to do, that it is difficult to say. Your idea of going to your DBA could be productive, definitively. The only thing people might be able to do is provide some examples. In my case:
Lazy loading was a huge performance improvement for us in the following scenario. We had a huge tree with 15,000 nodes with different levels. Originally we tried to load the whole tree and then render it. It took ages. We changed the code to load the branches lazily only when the user expanded the nodes. It takes slightly longer on node expansion, but the application overall feels faster. In this case it makes sense to change the mapping
Now, if you need to process the whole tree anyways, because it is required by your business logic, lazy loading is not going to be much of a difference. In this case just changing the mapping is really no solution, and it might even be more expensive.
You need to think a little bit on what your application does, on where does it feel slow, on what are you trying to accomplish. Pulling out a profiler is also no silver bullet.
Without specific examples from your application it makes no sense to say if lazy loading is useful or not.