I was thinking of building an app to serve audio content.
The first question I get is how to store it. Two obvious solutions that occur are:
Dump in database as BLOB
Dump in filesystem, and store path in DB
There was a similar question here and the answer urged to store in file-system. I can think of at least one disadvantage of storing in files, i.e. I loose all backup, recovery and other awesome features of databases.
Also I wanted to know how both solutions would fare in terms of scalability.
Does anyone know how flickr or youtube does it?
Or does anyone has even more creative(scalable :)) ideas?
Your file system should have backup and recovery procedures setup if this data is important. (The rest of the application is backed up right?). So you shouldn't use a database just for the backup and restore capability.
Storing the files outside of the database allows you to separate your database and file servers which will be a plus on the scalability side.
I would definitely go for Filesystem. storing and deliviring (large) files is exactly what it was made for.
Storing files in a file system would allow for using Content Delivery Networks. Outsource the storage may bring several benefits.
This is a classic question. And a classic argument, with good points for both solutions. Scalability can be achieved with both solutions. Distributed databases are usually easier to handle than distributed filesystems if you grow to the size where all you media dont fit on a single server (but even that is open to debate). Think MongoDB or other NoSQL scalable databases.
It boils down to what features you need. It is very hard to implement transactionality on a filesystem, so if it is a concern to you, you should use a database.
Backup and recovery of filesystem is much easier to implement than proper and consistent backup of the database. Also if you lose a file on the disk, it's just a file. If you lose a part of the huge table, it's a loss of all files contained or referenced in that table (as the table becomes unreadable).
Of course, for small databases where you can turn off the DBMS and quickly copy all DB files all of the above is not applicable, but this scenario is almost the same as having data files on the disk.
I think that both ways are viable. But the backup issue i definately there. Both solutions are scalable given the right design. But big files are probably better of in the file system.
Regards,
Morten
Related
I have file data (specifically language resource files). These files are automatically generated using machine translation api's (goog translate). They change relatively infrequently but when the master one changes (new string added or changed), this causes all the other language files to be updated automatically.
I'm trying to decide between serving these files directly from the blobstore or serving them from memcache and storing them in the datastore.
Which is faster/more efficient?
Nick Johnson described the speed tradeoffs in this article. The blobstore is best at handling uploads from users. For your problem, you will probably get the fastest and cheapest performance using the memcache backed by the datastore. In python, NDB will automate this for you. In java, use objectify.
It really depends on what you're serving. When people talk about the blobstore they are generally talking about large data (media files) that aren't going to fit in memcache. Our app serves up a lot of audio files and I've found that the blobstore is particularly good for this because it supports progressive-http download.
In both cases the lookup time is virtually instantaneous (they are both just maps and you look up data by a key). The time it takes to serve it depends on the item being returned. I can't think of any reason why I would take something from the blobstore and put it in memcache. It's really not going to save any time.
Now the datastore is a different beast...
The answer to every "which is faster" question is "benchmark it". The particularities of your setup (disk speed, memory access latency, bandwidth, demonic infestations) make any general answer about performance chancy at best. The fact that you're running in Google App Engine just makes this even harder - you don't know what hardware you're going to get! So test it.
That said, it is likely that a local(ish) memcache like Google provides will be faster than anything that might involve hitting the disk. Memory access latency is an order of magnitude faster than disk access latency, and memory bandwidth is a hundred times or more that of even the fastest SSDs on the market today.
So, if you can afford the RAM and you want to maximize your responsiveness, storing your data in memory is generally more efficient.
I am saving huge blob(500 MB) into oracle DB using JDBC. It takes a lot of time in insertion and later on retrieval.
Please suggest, if any of you have encountered this problem.
Lots of non-database people are really scared of sticking BLOBs and CLOBs in databases. They shouldn't be. Oracle manages them very well. Also bear in mind that Oracle also develops file systems (including BTRFS) so know about storing all kinds of data. Data in the database can be better protected against media or system failure, secured against unauthorised access and audited for improper use.
You should be using 11g and SecureFile LOBS. This document on SecureFile performance gives guidelines to achieving performance that is as good or better than regular filesystem storage.
Of course it is worth checking what the bottleneck is first. If it is the network between the app server and the DB server then no amount of database tuning will bypass the issue.
500 MB blob? Oh my God.
Look, seriously, the answer is "don't do that!" Relational databases aren't intended for that or optimized for it; the way the tables and filesystem have to be organized for indexing and searching aren't suited for that kind of big indigestable lump.
Consider, instead, creating a separate filesystem for this big lumps, and storing a pathname in the database.
I have a number of rather large binary files (fixed length records, the layout of which is described in another –textual– file). Data files can get as big as 6 GB. Layout files (cobol copybooks) are small in size, usually less than 5 KB.
All data files are concentrated in a GNU/Linux server (although they were generated in a mainframe).
I need to provide the testers with the means to edit those binary files. There is a free product called RecordEdit (http://record-editor.sourceforge.net/), but it has two severe drawbacks:
It forces the testers to download
the huge files through SFTP, only to
upload them once again every time a slight
change has been made. Very
inefficient.
It loads the entire
file into working memory, rendering
it useless for all but the relatively small
data files.
What I have in mind is a client/server architecture based in Java:
The server would be running a permanent
process, listening for
edition-oriented requests coming from
the client. Such requests would
include stuff like
return the list of available files
lock certain file for edition
modify this data in that record
return the n-th page of records
and so on…
The client could take any form
(RCP-based in a desktop –which is my first candidate-, ncurses in the same server, a middle web
application…) as long as it is able to
send requests to the server.
I've been exploring NIO (because of its buffers) and MINA (because of protocol transparency) in order to implement the scheme. However, before any further advancement of this endeavor, I would like to collect your expert opinions.
Is mine a reasonable way to frame the problem?
Is it feasible to do it using the language and frameworks I'm thinking of? Is it convenient?
Do you know of any patterns, blue prints, success cases or open projects that resemble or have to do with what I'm trying to do?
As I see it, the tricky thing here is decoding the files on the server. Once you've written that, it should be pretty easy.
I would suggest that, whatever the thing you use client-side is, it should basically upload a 'diff' of the person's changes.
Might it make sense to make something that acts like a database (or use an existing database) for this data? Or is there just too much of it?
Depending on how many people need to do this, the quick-and-dirty solution is to run the program via X forwarding -- that eliminates a number of the issues.. as long as that server has quite a lot of RAM free.
Is mine a reasonable way to frame the problem?
IMO, yes.
Is it feasible to do it using the language and frameworks I'm thinking of?
I think so. But there are other alternatives. For example:
Put the records into a database, and access by a key consisting of a filename + a record number. Could be a full RDBMS, or a more lightweight solution.
Implement as a RESTful web service with a UI implemented in HTML + javascript.
Implement using a scalable distributed file-system.
Also, from your description there doesn't seem to be a pressing need to use a highly scalable / transport independent layer ... unless you need to support hundreds of simultaneous users.
Is it convenient?
Convenient for who? If you are talking about you the developer, it depends if you are already familiar with those frameworks.
Have you considered using a distributed file system like OpenAFS? That should be able to handle very large files. Then you can write a client-side app for editing the files as if they are local.
I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:
Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day
What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.
There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.
For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.
For persistency, we have to write our data to disk after all. Supposing optimal organization
this can be done as background activity, not affecting latency.
However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.
Data model is not restricting here: most of the methods support access based on a unique key.
We have to decide,
if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.
Solutions might be
self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
MemcacheDB, Redis other caching databases with persistency,
popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.
A list of NoSQL tools can be found eg. here.
Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).
Have a look at LinkedIn's Voldemort.
If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.
What about something like CouchDB?
I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP),
using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue)
I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.
MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.
In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.
Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.
MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.
Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?
You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).
However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.
If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.
Another alternative is Tokyo Cabinet, a modern DBM implementation.
How bad would it be if you lose a couple of entries in case of a crash?
If it isn't that bad the following approach might work for you:
Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.
Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.
Talk to a file system expert how to make this really fast
It is simple and it might be fast.
Of course you lose transactions including the ACID principles.
Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.
How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?
If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.
Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?
Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.
Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.
ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.
Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.
I'm coming from a web-development background and I am wondering how I would make a learning algorithm in Java/C++. Not so much the algorithm part, but making the program "remember" what it learned from the previous day. I would think something like saving a file, but I suspect their might be an easier way. Apologies if this question is just over the top stupid. Thanks.
I think that would depend a bit on the problem domain. You might want to store learned "facts" or "relationships" in a DB so that they can be easily searched. If you are training a neural network, then you'd probably just dump the network state to a file. In general, I think once you have a mechanism that does the learning, the appropriate storage representation will be relatively apparent.
Maybe if you can flesh out your plan on what kind of learning you'd like to implement, people can provide more guidance on what the implementation should look like, including the state storage.
Not stupid, but a little ill-formed maybe.
What you're going to do as your program "learns" something is update the state of some data structure. If you want to retain that state, you need to persist the data structure to some external store. That means translating the data structure to some external formal that you can read back in without loss.
Java provides a straightforward way to do this via the Serializable interface; you Serialize the data by sending Serializable ojects out through an ObjectStream; the same ObjectStream will reload them later.
If you want to access and save large amounts of data maybe a database would work well. This would allow you to structure the data and look it up in an easier manner. I'm not too well versed on the subject but I think for remembering and recalling things a database would be vastly superior to a file system.
A robust/flexible solution in Java (C++ too, but I wouldn't know how) would be using a database. Java 1.6 comes with the Apache derby database which can be embedded in your apps. Using JPA (Java Persistence API) makes it easy to interface with any database you can find drivers for.
You should look into Neural Network software development. Here's a collection of nice Neural Network libraries for different languages. I am not sure if this is the easy way but once accomplished would be very handy.