From what little understanding of Cassandra I have, it seems that data locality is mostly transparent to the client application that accesses a node, as it should.
However, what if I explicitly only wanted to access the data of a column family that is local to the node I'm connected to? Is such a thing possible? I haven't found a way of getting this from a client API out-of-the-box, but it seems that I could get some of this information through the system tables, but I can't quite figure out how to do this.
The idea is to perform mapreduce, but without using Hadoop. A local client would connect to its local cassandra node, perform aggregation on the local data and then pass it back upstream.
Is such a thing possible at all? By the looks of it, it seems possible since I've seen evidence of Hadoop being able to use Cassandra, but the examples seem to be geared towards Hadoop rather than a generic client. The local client (the bit talking to Casandra) would be in Java. I'm currently using Hector, but I'm unsure whether it would provide any data locality information.
A recent article on the Netflix Techblog introduces Aegisthus, a project which reads the SSTables stored on disk across the cluster and merges them into a single, consistent view of the data (in MapReduce). I would imagine that the mechanics would then trivially exist for generating a view of the data on a single node.
Unfortunately, I don't think they've open sourced this tool yet so you won't be able to use it. The most it can be at this point is a glimmer that yes it's possible to natively read SSTables using non-Cassandra code.
You may be able to hack something together using the Cassandra source that reads SSTables and have that feed the local client you're hoping to build. A great starting point would be looking at the source of org.apache.cassandra.tools.SSTableExport which is used in the sstable2json tool.
Related
I'm currently getting into Socket Programming and building a multi-threaded console application where I need to register/login users. The data needs to be saved locally, but I can not seem find the right structure for it.
Here are the ideas I though about:
Simply saving the data to .txt file. (will be troublesome to search and authenticate the logins)
Using the Java Preferences API but since the application is multi-threaded I keep on overwriting the data each time a new client connects to my server. Can I create a new node for each new user?
What do you guys think is the ideal structure for saving login credentials? (security isn't currently a concern for this application)
I would consider the H2 database engine.
quote:"Very fast, open source, JDBC API Embedded and server modes; in-memory
databases Browser based Console application Small footprint: around 2
MB jar file size"
http://www.h2database.com
It really depends on what you want to do with the application. The result would be different, depending on what you would answer to the following questions:
Do you want/need to persist the databases?
Is there any other data which you need to store along with that?
are you using plain java or a framework like Spring?
Some options:
if you're just prototyping and you don't have any persistence: consider using an in-memory storage for it. For simplicity in coding/dependencies, something like a ConcurrentMap can be completely sufficient. If you wrap it properly, you can exchange it later - and you don't add dependencies and complexities at an early state.
If you're prototyping but you still need persistence, using properties files on top of the ConcurrentMaps can give you a quick win.
There might be some more stages to this, depending on where you want to go with this, choosing a database at one point can be an option. Depending on your experience and needs, you can use a SQL or NoSQL database. Personally, I get faster results with NoSQL (MongoDB in my case) but prefer SQL in production for use cases like account management.
I have created a java daemon program that collects data from social network accounts. I use a lot of services including Flick, S3, GeoCoding, etc. Currently I have the program set up to read all these API keys from a properties file. I also have a similarly formatted properties file in my test folder that contains different keys for testing purposes. These property files are not committed to source obviously. This collection program writes to a mongo db. I am also building a web app that also works with mongo and will be deployed along side the collection. During my development I am reading that it is best to store keys as environment variables on the production side. It got me think; which leads me to my question...
I am wondering if there is a better way to handle these keys in my java program (from a deployment standpoint) or some possible routes that people have tried in doing something similar to this. Can someone shed some light on this?
I would recommend a database. If you are only storing API keys for personal use, then the size of the database isn't probably a major concern. Personally, I would suggest MySQL (or alternatively SQLite) as they are both quite well-supported.
If you encrypt your keys then it shouldn't matter too much where you store your database, although of course I still wouldn't make it openly downloadable. Just pick a good encryption tool and do not try developing your own encryption algorithm!
The latest hotness (in a world of containers) is to use zookeeper, etcd or consul as a distributed configuration store. The confd tool is capable of ensuring that application configuration files are kept in sync with changes to configuration.
My personal preference is Consul which has a similar template tool called consul-template, and another called envconsul if you would prefer your program to consume environment variables.
Finally Hasicorp, the makers of consul, have an encryption product called vault. It works well with consul and is also supported by consul-template.
I'm planning to write a Java application wich relies on a small (Around 3000 nodes) graph to represent its structure. The data should be loaded from a custom file at startup to create an in-memory graph database. I've looked into Neo4j but saw that you can't make it run directly as in-memory. Googling around a bit I found Google JIMFS (Java in-memory file system) may suit my needs.
Does anyone have experience with getting Neo4j to work on a JIMFS FileSystem?
Are there more suited alternatives wich work in Java (possibly in-memory out of the box like HSQLDB) for small-scale graphs and still provide a declarative query language like Cypher?
Note that performance is not so much of an issue to me, it's more of a playground to gather some experience with graph databases, but I don't want the application to create a Database file system on disk.
Note that performance is not so much of an issue to me,
In that case you can go for ImpermamentGraphDatabase of neo4j, which is created like this:
graphDb = new TestGraphDatabaseFactory().newImpermanentDatabase();
It doesn't create any files on filesystem.
Source:
http://neo4j.com/docs/stable/tutorials-java-unit-testing.html
I don't know why you wouldn't want the application to create a Database file system on disk but I can easily tell that there are many options. I used neo4j and for most cases found its query methodology clear and visualizer very useful, thereby in my limited knowledge, make it my number one choice. However considering your requirements you might find this interesting :
https://bitbucket.org/lambdazen/bitsy/wiki/Home
I have an embedded system using a python interface. Currently the system is using a (system-local) XML-file to persist data in case the system gets turned off. But normally the system is running the entire time. When the system starts, the XML-file is read in and information is stored in python-objects. The information then is used for processing. My aim is to edit this information remotely (over TCP/IP) even during process. I would like to use JAVA to get this done, and i have been thinking about something to share the objects. The problem is, that I'm missing some keywords to find the right technologies to get this done. What i found is SOAP, but i think it is not the right thing for this case, is that true? I'm grateful for any tips.
As I understand, you are using XML file to store start up configuration
And my assumptions on your interface between Java & Python apps
You want your Java application to retrieve objects over Python interface
And process them locally and send it back to Python interface to reload config ?
So, depending on your circumstances, you can workout something with the following
Jython
Pickle (if you have no restriction on startup config file format or can afford to do conversion)
https://pypi.python.org/pypi/Pyro4
Also you can get some ideas from here:
Sharing a complex object between Python processes?
You should ask your python application to open a XML-RPC socket which clients can connect on. This could let an outside application to execute an endpoint, which would manipulate your python object values in someway. There are several good choices for Java XML-RPC libraries, including the amazing org.apache.xmlrpc library.
I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?
Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)