I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.
Related
I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.
Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb
I have seen many so-called "reverse geocoding" libraries in various languages; all depend on calling an external provider via REST or some similar method. However, you cannot call a REST provider if you must handle thousands of requests per second.
On the other hand, the problem should be simple to solve - CSV-based databases are available for free with this information. The issue is the time and cost of writing an efficient and well-tested in-memory search implementation, versus downloading or buying an existing one.
I can't find any after a lot of looking, but I can't believe there can't be one. Is there any pre-written library that does this?
This question:
Fastest way to find the location(zip, city, state) given latitude/longitude
came the closest, but essentially indicates how to write the solution, not that there is anything available off the shelf. But there must be some library everyone uses for this. A dozen people a day must have this problem.
Spatial databases (e.g. Postgresql with PostGis) use algorithms which are fast in looking up data for given latitude/longitude information. As you want to use a Java library and have it in memory you could look at the H2 Spatial database. I have never used it, so I can't comment on its performance.
Edit: Hm, looking closer at the link I've provided shows that this is a planned feature... Personally I'd simply use Postgresql/PostGis (with or without Java as server frontend) and be done with it. If your server has enough memory it will still fit the "in-memory" requirement. Naturally it does not fit the Java library requirement. There is however JSI, which could be used in memory and with Java.
What is the better way to cache a Java Web Application using MySQL? to improve performance.
What are the best techniques to do it?
It is better to do this at the application level or database level?
I'm new to this, so, sorry if I'm wrong.
Well there are ways to have some performance tips both at database levels and Application levels.
For database levels here are few inputs
Query optimization
Indexes creation on frequent asked data.
For some ORM layers like hibernate it also provides some sort of mechanism to cache the outputs in primary levels and secondary levels.
For application levels we have many options few of them are
1.EHCache
2. Memcached
JCS
here is a complete list of java based caching frameworks
java-caching-system
and some googling will help you to find many other options
for UI layers there are lots of area of improvements like
Proper use of HTTP headers
Less number of server Hits.
Way to load javascripts
Way to load CSS files
use of CDN severs
Yahoo has very good blog for this.YSLOW from YAHOO.If you are in the early stage of development i will suggest not to go for them as they lead to premature optimization and can lead to may problems.
Why don't you look at below links which could help you.
Article1
Article2
Frameworks exist for this purpose and Ehcache is one of them. Here you can read up on how you can use it:ehcache
Unfortunately, the question is way too broad (there are books on the topic, so it literally falls in the FAQ definition of offtopic)—and thus is likely to be closed soon.
In brief, there are plenty of Java caching solutions, including for example Guava and ehcache.
The three best techniques would be:
Profile
Profile
Profile
First, before changing anything, second, to make sure your changes have effect, third time in production, to make sure your changes do work in real life.
On the levels—both, the decision depends on the profiling data.
I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D
We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.
Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).
It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am currently developing a proof of concept for an alternative data store. The reason why is I need to enhance a read-mostly clustered webapp, but also because I want to free myself from the pain of the sometimes overly-complex ORM+RDBMS solution.
Overall the idea is quite similar to a distributed cache with persistence (letting the cluster be the SoR), however:
want to be able to retrieve any object along with its children, by
id (providing class & id) [only that to start off, as the main
querying part is already resolved with lucene in my app].
need to have map of maps of types ( ~ tables in the relational
world), and therein distributed maps of 'dehydrated' stored objects (flattening the object graph via reflection deep cloning)
a bin log (like Prevayler, for example) for
eventual recovery if whole cluster goes down
development (and ability to refactor code / change structure)
perhaps asynchronously processed for other purposes (reporting, whatever)
eventually later on try to integrate a statically-typed query mechanism, like LINQ, Jaque or H2's JaQu / see ODBs / Lucene (?)
it has to be transaction-aware (not sure "JTA type" though)
I'm planning to implement this idea with Hazelcast (I love its super-simple API) or Terracotta (which I never used - but I'm aware of their 'sweet spot', medium-term data). If you will, my aim is to do more or less what Jonas once blogged about. Using one of these, stored data would roughly have to fit in the sum of the JVM heaps of the cluster.
This should be pretty simple to scale, would avoid the relational impedance mismatch (ie save as with an ODB) and JDBC + I/O overhead.
Do you know of other tools/frameworks or combination thereof already providing similar functionality, that I'm ignoring?
Can you suggest other ways of tackling this 'getting rid of the DB'? What flaws do you already see in this idea?
Concurrency-wise would it make sense to consider Scala instead of Java?
How about non-relational data stores such as Couch DB, Neo4j, HyperTable, HBase?
A similar question was asked one month ago - but there was no concrete solution.
BTW I just stumbled upon the concept of Enterprise Data Fabric, which, to my surprise, describes a lot of these ideas.
Definitely give Terracotta a try. It's free (unless you go Enterprise which has an SLA and support). It is a JVM-level cluster, so to speak, so you don't have the issues associated with sessions on multiple boxes behind disparate JK workers (assuming you're using this for a J2EE app).
I'm just rambling, so have a look here: http://en.wikipedia.org/wiki/Terracotta_Cluster
UPDATE numerous bits of info on Terracotta on the web too, e.g. http://blog.terracottatech.com/2007/12/fud_of_the_week_terracotta_doe.html
UPDATE2 Bit of background on my views: I work for a company with a fairly big audience. We have a enterprise MySQL running with a master and about 5 slaves (times 2 considering we have 2 channels, with 4 app servers per channel), using MySQL's JDBC Replication driver (for which we've already submitted various patches). We use Spring2.5/Hibernate3 using Spring's declarative JTA transaction management, so read-onlies go to the slaves. With the advent of numerous Ajax enhancements on a future version of our site, our DB servers' load has gone up - we create pricing summaries for thousands of products for all countries, taking into account duties/tax rules for all these countries (plus promotions and real-time auctions running all the time), then the Ajax services have the latest prices in a blink. Terracotta takes the load off the DB and app servers by making these prices available to all app servers on a JVM-layer, with all the JVMs across the boxes linked. So, server A can update the prices every few minutes, and if Ajax hits server B, the prices are available immediately. I know there are people/companies out there with similar businesses, who probably have better ideas and implementations, so I'm always open for discussion, but this is my two cents.
I get inspiration from the guys at Facebook too, for instance this very informative article:
http://www.facebook.com/note.php?note_id=23844338919
They talk about memcached which you should also definitely check out.
As Neo4j is mentioned in the question, I'm chiming in with a few thoughts on using a graph database in this case. (I'm part of the Neo4j team)
retrieving children is trivial in a
graph db
there is a map implementation
for neo4j
as graphs are native to a graph db
you could consider not to flatten the
object graph, but to persist data in nodes
and edges/relationships (this gives you
more flexibility in handling the data)
neo4j is fully transactional
With the new DB technologies emerging today, there's really no need to stay with a RDBMS if your data isn't a good fit for the relational paradigm.
Seems to me Terracotta is a perfect fit for your requirements:
cluster a map to retrieve children
via keys (e.g. clustered Map)
map of maps - no problem
no explicit bin log - but Terracotta already persists everything to disk so full cluster restart is already supported
integrated already to Compass, Hibernate Search, and Lucene for search
Transactions? Too slow. Use the cache as a datastore. With persistence you won't lose data writing to (clustered) memory and trickle back to the DB.
In addition, Terracotta does the "reflection" thing you ask for - although it doesn't use reflection as that is far too slow. It uses BCM. Only changes are propagated on the network.
Hazelcast btw requires serialization so it will be slow and will not do well at all with a map of maps data structure (every put will result in a full deep clone copy across the network) and it doesn't have any kind of persistence built in.
Interesting.
I have a view that we all develop a zoo which comprises all the abstraction layers we habitually use in our projects. And each abstraction layer is a completely different animal.
My goal is to minimize the amount of time spent on just care and feeding of the animals whenever it diverts me from solving the problem at hand - it's overhead - wasted resources. So the fewer, simpler abstraction layers we can get away with, the more productive we are.
I can usually do just fine with two beasties - OOP and RDBMS, coupled through nice, simple, minimal, hand-crafted DAL. For me, ORM is mostly overhead - one abstraction too many, and a pretty hungry one.
Don't discount the option of treating stored procedures as an abstraction tool, either. If you're real comfortable with SQL, it can be a useful resource for implementing a light-weight BL facade that means not needing to think about the ORM problem.
And this post suggests the emergence of alternatives to RDBMS for some requirements, anyway.
Thanks for your answers.
Actually, you talk about DBs which is something I want to completely take out of the picture.
The use case I'm targetting is a startup's small/medium-sized clustered webapp (boxes in a LAN, or in the cloud). It needs to retrieve objects at ~RAM-speed levels and scale fairly easily. As a side-effect, one wouldn't have to think about DB server installations, impedance mismatch, JDBC, caches, polluting domain models with annotations, etc.
Again, what I want to accomplish is something like described here, and I would love to have some more feedback on ideas concerning the actual implementation (why use Terracotta instead of Hazelcast, use serialization or deep cloning via reflection or whatever else, and also the major drawbacks of an approach like this - eg. why wouldn't you change it for your current ORM/DB setup).
It has to be super simple to integrate so it'll feature a really neat Java API, improving code readability. No other software (DB, memcached will be required).
Try GigaSpaces. I think they have exactly what you require, and if I'm not mistaken there's a free version for startups.
Some concepts:
"Space" is some place where you can store and retrieve objects
Space can be backed by any JDBC-compliant DB, automatically (no code, only configuration)
Space can be started in your java process, so all accesses are at RAM speed
Space can be clustered/partitioned in any way you want (full mirror, partial, grid).
Space supports distributed or local transactions
Check their wiki, (but only "programmer's guide" - all the rest is marketing BS).