File storage server - java

I'm looking for cleaner way to horizontally scale my Java app with minimal impact in sources and infrastructure. Here is my problem: I'm currently saving resources in local file system, so i need to consistently share these files among all my new processing processing nodes.
I know the existence of ehcache and terractora server array but localRestartable (persistence guaranteed) is only available on ehcache enterprise and i want to keep commercial licenses as away as possible.
Other alternatives could be memcached, redis, mongodb (persistence in mind), even nfs, but i want the opinion of those who have experience using these services as storage services, also i need to clarify: Requirements prevent to use any on-line cloud storage service although i'm open to any alternative that could be installable on my data-center of course!.

With MongoDB you can take advantage of:
replica sets to distribute data to multiple servers
sharding to scale writes (if appropriate for the volume of data and writes you need to manage)
You have a few options for storing your binary files in MongoDB:
1) You could save the files as binary data within a field in a MongoDB document. The current document size limit (as at MongoDB 2.2) is 16Mb, which seems more than adequate for your ~1Mb files.
2) You can use the GridFS API to conveniently work with larger documents or fetch binary files in smaller chunks (see also: Java docs for the GridFS class).

Related

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

Cloud Storage vs. Datastore latency

I have a simple data file I want to store. I don't need any indexes or queries performed on it, so I can put it in Cloud Storage. BUT, the latency of fetching the file is very important. What is the latency I can expect when fetching a file from Cloud Storage vs. the latency in fetching an entity from the Datastore?
I could not find a good reference for this issue...
You shouldn't expect a specific latency as it'll vary depending on a large number of things. If the file is that important, then just package it with the files when distributing the program if that's possible.
If this is a file that fits within the limits of Datastore entity (1 MB size). Then storing the file there makes sense.
I have seen lower latency on Datastore retrieval than GCS (again depends highly on the size of the object).
Another advantage using Datastore would be with is using the NDB Python interface as it will transparently cache the entity to memcache.

Storing JSON objects: SQLite vs serialization to disk

Will be building an app which will be pulling down JSON objects from a web service, in the low hundreds, each relatively small say 20kb each.
The app won't be doing much else than displaying these POJOs, downloading new and updated ones when available and deleting out of date ones. What would be the preferred method for persistent storage of these objects? I guess the two main contenders are storing them in a SQLite DB, maybe using ORMLite to cut down on the overhead, or just serialize the objects to disk, probably in one large file and use a very fast JSON parser.
Any ideas what would be the preferred method?
You could consider using CouchDB as cache between the mobile client and your webservice.
CouchDB would have to run on a service on the internet, caching the objects from the webservice. On the client you can use TouchDB-Android: https://github.com/couchbaselabs/TouchDB-iOS/wiki/Why-TouchDB%3F . TouchDB-Android can synchronize automatically with CouchDB inatance running on the Internet. The application itself would then access TouchDB solely. TouchDB automatically detects wetter or not there's an internet-connection, so your application keeps running even without internet.
Advantages:
- Caching of JSON calls
- Client remains working with internet-connection down, synchronized automatically when internetconnection is up again.
- Takes load of your webservice, and you can scale.
We used this setup before to allow Android software to work seamlessly, even when the internetconnetion would drop frequently and the service we accessed data from was quite slow and had limited capacity.
A dbms such as SQLLite should come with querying, indexing and sorting capabilities (and other standard SQL DBMS features), you should consider if you need any of these. How many objects are you planning to have in production environment? If say a million disk serialization approach might not scale.

Is storing temporary files into JackRabbit a good idea?

does anobody know how much overhead jackrabbit has, in comparison with pure FS persistence ?
I'm using it for a CMS project, but I also have to persist temporary files (that unfortunately have properies/metadata)... Don't know if I should also employ jackrabbit for that.
I think the overhead is significant enough to avoid this .... at least the IO on filesystem.
These files are the same as the rest of files in repo, but it is for sure, that they will be deleted in a minute.
Should I create a layer to persist files with properties via JAVA IO API, should I use jackrabbit or should I use database ? If so, can it be set for performance somehow ?
By default, Jackrabbit stores the binaries in the FileDataStore, which uses a FileOutputStream, so the overhead is relatively low. However, the binaries in the data store remains until garbage collected, which might be a problem for you if you create a huge number of temporary files.
Metadata: it depends how much metadata you have. The metadata is stored in the persistence manager and possibly in the search index (Lucene). The main performance problem there is usually fulltext search, so disable it if possible.
should I use jackrabbit or should I use database
That really depends on your use case. Jackrabbit does not claim to be "faster than a database", but the data model (hierarchical, key value pairs) may be better or easier to use.

How to efficiently manage files on a filesystem in Java?

I am creating a few JAX-WS endpoints, for which I want to save the received and sent messages for later inspection. To do this, I am planning to save the messages (XML files) into filesystem, in some sensible hierarchy. There will be hundreds, even thousands of files per day. I also need to store metadata for each file.
I am considering to put the metadata (just a couple of fields) into database table, but the XML file content itself into files in a filesystem in order not to bloat the database with content data (that is seldomly read).
Is there some simple library that helps me in saving, loading, deleting etc. the files? It's not that tricky to implement it myself, but I wonder if there are existing solutions? Just a simple library that already provides easy access to filesystem (preferrably over different operating systems).
Or do I even need that, should I just go with raw/custom Java?
Is there some simple library that
helps me in saving, loading, deleting
etc. the files? It's not that tricky
to implement it myself, but I wonder
if there are existing solutions? Just
a simple library that already provides
easy access to filesystem (preferrably
over different operating systems).
Java API
Well, if what you need to do is really simple, you should be able to achieve your goal with java.io.File (delete, check existence, read, write, etc.) and a few stream manipulations with FileInputStream and FileOutputStream.
You can also throw in Apache commons-io and its handy FileUtils for a few more utility functions.
Java is independent of the OS. You just need to make sure you use File.pathSeparator, or use the constructor File(File parent, String child) so that you don't need to explicitly mention the separator.
The Java file API is relatively high-level to abstract the differences of the many OS. Most of the time it's sufficient. It has some shortcomings only if you need some relatively OS-specific feature which is not in the API, e.g. check the physical size of a file on the disk (not the the logical size), security rights on *nix, free space/quota of the hard drive, etc.
Most OS have an internal buffer for file writing/reading. Using FileOutputStream.write and FileOutputStream.flush ensure the data have been sent to the OS, but not necessary written on the disk. The Java API support also this low-level integration to manage these buffering issue (example here) for system such as database.
Also both file and directory are abstracted with File and you need to check with isDirectory. This can be confusing, for instance if you have one file x, and one directory /x (I don't remember exactly how to handle this issue, but there is a way).
Web service
The web service can use either xs:base64Binary to pass the data, or use MTOM (Message Transmission Optimization Mechanism) if files are large.
Transactions
Note that the database is transactional and the file system not. So you might have to add a few checks if operations fails and are re-tried.
You could go with a complicated design involving some form of distributed transaction (see this answer), or try to go with a simpler design that provides the level of robustness that you need. A possible design could be:
Update. If the user wants to overwrite a file, you actually create a new one. The level of indirection between the logical file name and the physical file is stored in database. This way you never overwrite a physical file once written, to ensure rollback is consistent.
Create. Same story when user want to create a file
Delete. If the user want to delete a file, you do it only in database first. A periodic job polls the file system to identify files which are not listed in database, and removes them. This two-phase deletes ensures that the delete operation can be rolled back.
This is not as robust as writting BLOB in real transactional database, but provide some robustness. You could otherwise have a look at commons-transaction, but I feel like the project is dead (2007).
There is DataNucleus, a Java persistence provider. It is little too heavy for this case, but it supports JPA and JDO java standards with different datastores (RDBMS, object storage, XML, JSON, Excel, etc.). If the product is already using JPA or JDO, it might be worth considering using NataNucleus, as saving data into different datastores should be transparent. I suppose DataNucleus supports splitting the data into several files, creating the sensible directory/file structure I wanted (in my question), but this is just a guess.
Support for XML and JSON seems to be experimental.

Categories