How can we synchronize accessing a common resource using synchronization in Java when my java application is deployed on multiple instances behind a load balancer ?
Because, as far as I know, synchronization works only under one JVM. But when we deploy the same Java application on multiple instances to handle the load, then how can we provide a synchronization mechanism?
For example:- There is a HDFS file to which the java application either appends/edits the contents of that HDFS file. When I deploy my java application on multiple instances, then how can I make sure that only one request from Java application accesses that HDFS file ?
Short answer - You can't do it without introducing a lot of complexities in your setup.
While technically you can use something like a distributed lock that's available on Zookeeper. I wouldn't really recommend them. It's kinda hard to reason with them at scale and there's also additional complexity in terms of Zookeeper Operations itself.
Regarding the example you posted, isn't that why systems like HBase was built? Model your data into a Key->[Multiple Columns] format. You can then read / write data on HBase and it would behind the scenes do the heavy lifting of editing / managing multiple files for you.
On the other hand, if you can model your change that you want to do on your file as an event you can then build your system on the principles of Event Driven Architecture.
You can read more about this on
Introduction video on EDA from Martin Fowler - https://www.youtube.com/watch?v=STKCRSUsyP0
Part 1 - https://www.confluent.io/blog/build-services-backbone-events/
Part 2 - https://www.confluent.io/blog/apache-kafka-for-service-architectures/
CQRS model that Martin Fowler talks about in this talk - https://martinfowler.com/bliki/CQRS.html
I can recommend you using the distributed lock mechanism provided by redis.
It works the same as standard lock or mutex but works in the context of a distributed system. The application instance that creates the lock will block the access to the resource to do the changes, then it releases the lock to allow other instances get access to the resource.
We already use this solution in production to protect the access to some critical resources that do not provide synchronization and consistency natively.
Here is the link for redis distributed lock:
Distributed locks with Redis
I believe there are other solutions providing the same feature. Redis is very lightweight, scalable and easy to integrate.
I'm trying to accomplish something that in terms of concept is very simple to understand. I want to synchronize a block of java code between different machines. There are two instances of a programa running in different machines that cannot run at the same time.
I've heard of zookeeper, jgroups and akka too, but while reading the documentation it seemed to me a bit overkill for what I'm trying to do. Does anyone have any idea if there's anything more straight forward to use?
Thanks in advance,
Rui
I think Hazelcast's Distributed Lock ( http://docs.hazelcast.org/docs/3.6/manual/html-single/index.html#lock ) may be helpful. Hazelcast is relatively lightweight so should hopefully not be overkill.
If all the technologies you mentioned (also take a look at Terracotta) are too sophisticated for your needs, maybe simple database locking?
SELECT FOR UPDATE statement will lock given database record, making other clients running this query to block. Simple, yet safe and reliable.
A very very basic solution would be using RMI.
Decide to use one machine as master which has a method which uses a mutex lock to allow only one mthod caller passing.
This special method you have to call via RMI from all other slave instances before you run your special java code block.
I was reading how Clojure is 'cool' because of its syntax + it runs on the JVM so it is multithreaded etc. etc.
Are languages like ruby and python single threaded in nature then? (when running as a web app).
What are the underlying differences between python/ruby and say java running on tomcat?
Doesn't the web server have a pool of threads to work with in all cases?
Both Python and Ruby have full support for multi-threading. There are some implementations (e.g. CPython, MRI, YARV) which cannot actually run threads in parallel, but that's a limitation of those specific implementations, not the language. This is similar to Java, where there are also some implementations which cannot run threads in parallel, but that doesn't mean that Java is single-threaded.
Note that in both cases there are lots of implementations which can run threads in parallel: PyPy, IronPython, Jython, IronRuby and JRuby are only few of the examples.
The main difference between Clojure on the one side and Python, Ruby, Java, C#, C++, C, PHP and pretty much every other mainstream and not-so-mainstream language on the other side is that Clojure has a sane concurrency model. All the other languages use threads, which we have known to be a bad concurrency model for at least 40 years. Clojure OTOH has a sane update model which allows it to not only present one but actually multiple sane concurrency models to the programmer: atomic updates, software transactional memory, asynchronous agents, concurrency-aware thread-local global variables, futures, promises, dataflow concurrency and in the future possibly even more.
A confused question with a lot of confused answers...
First, threading and concurrent execution are different things. Python supports threads just fine; it doesn't support concurrent execution in any real-world implementation. (In all serious implementations, only one VM thread can execute at a time; the many attempts to decouple VM threads have all failed.)
Second, this is irrelevant for web apps. You don't need Python backends to execute concurrently in the same process. You spawn separate processes for each backend, which can then each handle requests in parallel because they're not tied together at all.
Using threads for web backends is a bad idea. Why introduce the perils of threading--locking, race conditions, deadlocks--to something inherently embarrassingly parallel? It's much safer to tuck each backend away in its own isolated process, avoiding the potential for all of these problems.
(There are advantages to sharing memory space--it saves memory, by sharing static code--but that can be solved without threads.)
CPython has a Global Interpreter Lock which can reduce the performance of multi-threaded code in Python. The net effect, in some cases, is that threads can't actually run simultaneously because of locking contention. Not all Python implementations use a GIL so this may not apply to JPython, IronPython or other implementations.
The language itself does support threading and other asynchronous operations. The python libraries can also support threading internally without exposing it directly to the Python interpreter.
If you've heard anything negative about Python and threading (or that it doesn't support it), it is probably someone encountering a situation where the GIL is causing a bottleneck..
Certainly the webserver will have a pool of threads. That's only outside the control of your program. Those threads are used to handle HTTP requests. Each HTTP request is handled in a separate thread and the thread is released back to pool when the associated HTTP response is finished. If the webserver doesn't have such a pool, it would have been extremely slow in serving.
Whether a programming language is singlethreaded or multithreaded dependens on the possibility to programmatically spawn new threads using the language in question. If that isn't possible, then the language is singlethreaded, for example PHP. As far as I can see, both Ruby and Python supports multithreading.
The short answer is yes, they are single threaded.
The long answer is it depends.
JRuby is multithreaded and can be run in tomcat like other java code. MRI (default ruby) and Python both have a GIL (Global Interpreter Lock) and are thus single threaded.
The way it works for web servers is further complicated by the number of available server configurations. For most ruby applications there are (at least) two levels of servers, a proxy/static file server like nginx and then the ruby app server.
Nginx does not use threads like apache or tomcat, it uses non-blocking events (and I think forked worker processes). This allows it to deal with higher levels of concurrency than would be allowed with the overhead and scheduling inefficiencies of native threads.
The various ruby apps servers also work in different ways to get high throughput and concurrency without threads. Thin uses libev and the asynchronous evented model like Nginx. Mongrel uses a round-robin pool of worker processes. Unicorn uses native Unix IPC (select on a socket) to load balance to a pool of forked processes through one master proxy socket.
Threads are only one way to address concurrency. Multiple processes and evented models are a different approach that ties in well with the Unix base. This is fundamentally different from the way Java treats the world.
Python
Let me try to put it more simply than the more detailed answers.
The heart of the answer here doesn't really have to do with Python being single-threaded versus multi-threaded. It has a more to do with threading versus multiprocessing.
Saying Python is "single-threaded" doesn't really capture reality, because you can certainly have more than one thread running in a Python process. Just use the threading library, and create more than one thread. There, now you have just proven that Python isn't single-threaded.
But using multiple threads in Python does NOT mean you're using multiple CPU processors concurrently. In fact, the Global Interpreter Lock prevents this. So this is where questions arise.
Basically, threading in Python cannot be used for parallel CPU computation. But you CAN do parallel CPU computation with Python by using multiprocessing instead of multi-threading.
I found this article very helpful when researching this: https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/ . It includes real-world examples of when you'd want to use multiprocessing versus multi-threading.
Most languages don't define single or multithreading. Usually, that is left up to the libraries to implement.
That being said, some languages are better at it than others. CPython, for instance, has issues with interpreter locking during multithreading, Jython (python running on the JVM) does not.
Some of the real power of Clojure (IMO) is that it runs on the JVM. You get multithreading and tons of libraries for free.
A few interpreted programming
languages such as CPython and Ruby
support threading, but have a
limitation that is known as a Global
Interpreter Lock (GIL). The GIL is a
mutual exclusion lock held by the
interpreter that prevents the
interpreter from concurrently
interpreting the applications code on
two or more threads at the same time,
which effectively limits the
concurrency on multiple core systems.
from wikipedia Thread
keeping this very short..
Python supports Multi Threading.
Python does NOT support parallel execution of its Threads.
Exception:
Above statement may vary with implementations of Python not using GIL (Global Interpreter Locking).
If a particular implementation is not using GIL, then, that would be Multi Threaded as well as support Parallel Execution
Ruby
The Ruby Interpreter is single threaded, which is to say that several of its methods are not thread safe.
In the Rails world, this single-thread has mostly been pushed to the server. So, you'll see nginx running with a pool of mongrel servers, each of which has an interpreter in memory, processes 1 request at a time, and in its own thread.
Passenger, running "ruby enterprise" brings the concept of garbage collection and some thread safety into Rails, and it's nice.
Still work to be done in Rails on this area, but it's getting there slowly -- but in general, the idea is to have multiple services and servers.
How to untangle the knots in al those threads...
Clojure did not invent threading, however it has particularly strong support for it with Software Transactional Memory, Atoms, Agents, parallel map operations, ...
All other have accumulated threading support. Ruby is a special case as it has green threads in some implementations which are a kind of software emulated threads and do not use all the cores. 1.9 will put this to rest.
Regarding web servers, no they do not always work multithreaded, apache has traditionally ran as a flock of daemons which are a pool of separate single threaded processes. Now currently there are more options to run apache servers.
To summarize all modern languages support threading in one form or another.
The newer languages like scala and clojure are adding specific support to improve working with multiple threads without explicit locking as this has traditionally be the great pitfall of multithreading.
Reading these answers here... A lot of them try to sound smarter than they really are imho (im mostly talking about Ruby related stuff as thats the one i'm most familiar with).
In fact, JRuby is currently the only Ruby implementation that supports true concurrency. On JVM Ruby threads are mapped to OS native threads, without GIL interfering. So its totally correct to say that Ruby is not multithreaded.
In 1.8.x Ruby is actually run inside one OS thread, and while you do have the fake feeling of concurrency with green threads, then in reality GIL will pretty much prevent you from having true concurrency.
In Ruby 1.9 this changed a bit, as now a Ruby process can have many OS threads attached to it (plus the green threads), but again GIL will totally destroy the point and become the bottleneck.
In practice, from a regular webapp standpoint, it should not matter much if its single or multithreaded. The problem mostly arises on the server side anyhow and it mostly is a matter of scaling technique difference.
Yes Ruby and Python can handle multi-threading, but for many cases (web) is better to rely on the threads generated by the http requests from the client to the server. Even if you generate many threads on a same application to low the runtime cost or to handle many task at time, in a web application case that's usually too much time, no one will wait happily more than some fractions of a second for the response of your application in a single page, it's more wise to use AJAX (Asynchronous JavaScript And XML) techniques: make sure the design of your web shows up rapidly, and make an asynchronous insertion of those hard-coding things later.
That does not mean that multi-threading is useless for web! It's highly recommended to low the charge of your server if you want to run recursive-complicated-hardcore-applications (not for a website, I mean), but what that thing return must end in files or in databases, so then could be softly served by a http response.
this is a bit related to this question.
I'm using make to extract some information concerning some C programs. I'm wrapping the compilation using a bash script that runs my java program and then gcc. Basically, i'm doing:
make CC=~/my_script.sh
I would like to use several jobs (-j option with make). It's running several processes according to the dependency rules.
If i understood well, I would have as many instances of the jvm as jobs, right ?
The thing is that i'm using sqlite-jdb to collect some info. So the problem is how to avoid several processes trying to modify the db at the same time ?
It seems that the sqlite lock is jvm-dependant (i mean one lock can be "see" only inside the locking jvm), and that this is the same for RandomAccessFile.lock().
Do you have any idea how to do that ? (creating a tmp file and then looking if it exists or not seems to be one possibility but may be expensive. A locking table in the dB ? )
thanks
java.nio.channels.FileLock allows OS-level cross-process file locking.
However, using make to start a bash scripts that runs several JVMs in parallel before calling gcc sounds altogether too Rube-Goldbergian and brittle to me.
there are several solutions for this.
if your lock should be within the same machine, you can use a server socket to implement it (The process that manages to bind to the port first owns the lock, other processes waits for the port to become available).
if you need a lock that span across multiple machines you can use a memcached lock. this will require a memcached server running. I can paste some code if you are interested in this solution.
you can get Java library to connect to memcached here.
You may try Terracotta for sharing objects between various JVM instances. It may appear as a too heavy solution for your needs, but at least worth considering.
Without having the source code for a Java API, is there anyway to know if the API methods create multiple threads ? Are there any conventions to follow if you are writing Java APIs and they create multiple threads. This may be very fundamental question but it happened to spawn out of a discussion in which the crux question was - " How do you know which Java APIs create threads and which don't " ?
One way of determining which libraries create new threads is by disallowing Thread creation and ThreadGroup modification in the SecurityManager. See the java.lang.SecurityManager.checkAccess(Thread) method. By implementing your own SecurityManager, you are able to react on the creation of Threads.
To answer the other question: many libraries create new threads, even if you don't expect it. For example APIs for HTTP communication create Timers for Keep-Alives or session timeouts. Java 2D is creating a signalling thread. Java itself has multiple threads, e.g. the Finalizer thread; the AWT/Swing event dispatcher thread etc.
There's no way to tell. Actually, I don't think you normally would care that much unless you're in some kind of constrained environment. What's I've found is more relevant is to determine if a method is written with an expectation of being run on a particular thread (the AWT Event dispatch thread, in the case I've seen). There's not a way to do that either, unless the code is using some kind of naming convention, or it's documented.
In my experience, if you are looking at core java, not J2EE, the only time I can think that threads are created in core Java is with Swing.
I haven't seen any example of other threads being created by the core Java APIs, except for the Thread class, of course. :)
But, if you are using other libraries then it may be that they are creating threads, but, if you don't want to profile, you may want to use AspectJ to log whenever a new thread is created, and the stack track of what called it, so you can see what is creating the threads.
UPDATE:
Swing uses 4 threads, according to this post, but he also explains how you can go about killing off the threads, if needed.
http://www.herongyang.com/Swing/jframe_2.html
If you want to see active threads, just fire up the jvisualvm application (located in your $JDK/bin directory) to connect to any local java process. You'll be able to see a multitude of information about the process, including thread names, status, and history. Get more information here.