Fastest way to read/write to Random Access Files? - java

Note: I have seen similar questions but all referring to large files. This is for small amounts reading and writing constantly, and many files will be written to and read from at once, so performance will be a issue.
Currently, I'm using a Random Access File for an "account" it's fast with basic I/O:
raf.write();
I have seen random access files with file channels and buffered I/O what is the fastest(again for small data.), and could you please supply a example of your proof.

If you want correctness across multiple read/write processes, you are going to sacrifice performance either to non-buffered APIs like RandomAccessFile, or else to inter-process locking.
You can't validly compare to what you could achieve within a single process without contention.
You could investigate MappedByteBuffer, but be aware it brings its own problems in its wake.
I personally would look into using a database. That's what they're for.

Related

Smaller scale Java distributed programming

I'm learning a bit more about hadoop and its applications, and I understand it is geared toward massive datasets and large files. Let's say I had an application in which I was processing a relatively small number of files (say 100k), which isn't a huge number for something like hadoop/hdfs. However, it does take a macro amount of time to run on a single machine, so I'd like to distribute the process.
The problem can be broken down into a map reduce style problem (e.g. each of the files can be processed independently and then I can aggregate the results). I'm open to using infrastructure such as Amazon EC2, but I'm not so sure about what technologies to be exploring for actually aggregating the results of the process. Seems like hadoop might be a bit overkill here.
Can anyone provide guidance on this type of problem?
First off, you may want to reconsider your assumption that you can't combine files. Even images can be combined- you just need to figure out how to do that in a way that allows you to break them out again in your mappers. Combining them with some sort of sentinel value or magic number between them might make it possible to turn them into one giant file.
Other options include HBase, where you could store the images in cells. HBase also has a built-in TableMapper and TableReducer, and can store the results of your processing alongside the raw data in a semi-structured way.
EDIT: As for the "is Hadoop overkill" question, you need to consider the following:
Hadoop adds at least one machine of overhead (the HDFS NameNode). You typically dont want to store data or run jobs on that machine, since it is a SPOF.
Hadoop is best suited for processing data in batch, with relatively high latency. As #Raihan mentions, there are several other FOSS distributed compute architectures that may server your needs better if you need realtime or low-latency results.
100k files isn't so very few. Even if they are 100k each, that's 10GB of data.
Other than the above, Hadoop is a relatively low-overhead way of approaching distributed computing problems. It has a huge, helpful community behind it, so you can get help quickly if you need it. And it is focused on running on cheap hardware and a free OS, so there really isnt any significant overhead.
In short, I'd try it before you discard it for something else.

java - very fast writing to a file [duplicate]

This question already has answers here:
Fastest way to write huge data in text file Java
(7 answers)
Closed 6 years ago.
I get a fast stream of data (objects) and I would like to write it to a file.
This is a stand alone process so it doesn't do anything but read the data from a socket parse it to csv and write all to a file.
What is the best way to write a lot of csv lines to a file?
Is a buffer writing my solution?
Is there a buffered File object in Java ?
Should I manage it myself and use writeLines()?
Fastest way to write huge data in text file Java
If you're dealing with a huge throughput of data then I suggest you use a set of in-memory buffers where you deposit the data arriving and then have a thread/threadpool which uses Java NIO to "consume" these buffers and write them onto disk. You will however be limited by the disk writing speed -- bear in mind that it's not unusual for the speed of network to be faster than the speed of your hard disk! so you might want to consider a threadpool which writes in different physical locations and only "pastes" these files after all the data has been received and written.
As mentioned above, chances are that its disk I/O that limits you, not Java abstractions.
But beyond using a good lib to deal with CSV, you might consider using other (even more) efficient formats like JSON; as well as compression. GZIP is good at compressing things, but relatively slow; but there are faster ones too. For example, LZF (like this Java implementation) is fast enough to compress at speeds higher than typical disk I/O (and uncompress even faster). So compressing output may well increase throughput as well as reduce disk usage.

RFC: What's a good approach to remotely edit very large binary files?

I have a number of rather large binary files (fixed length records, the layout of which is described in another –textual– file). Data files can get as big as 6 GB. Layout files (cobol copybooks) are small in size, usually less than 5 KB.
All data files are concentrated in a GNU/Linux server (although they were generated in a mainframe).
I need to provide the testers with the means to edit those binary files. There is a free product called RecordEdit (http://record-editor.sourceforge.net/), but it has two severe drawbacks:
It forces the testers to download
the huge files through SFTP, only to
upload them once again every time a slight
change has been made. Very
inefficient.
It loads the entire
file into working memory, rendering
it useless for all but the relatively small
data files.
What I have in mind is a client/server architecture based in Java:
The server would be running a permanent
process, listening for
edition-oriented requests coming from
the client. Such requests would
include stuff like
return the list of available files
lock certain file for edition
modify this data in that record
return the n-th page of records
and so on…
The client could take any form
(RCP-based in a desktop –which is my first candidate-, ncurses in the same server, a middle web
application…) as long as it is able to
send requests to the server.
I've been exploring NIO (because of its buffers) and MINA (because of protocol transparency) in order to implement the scheme. However, before any further advancement of this endeavor, I would like to collect your expert opinions.
Is mine a reasonable way to frame the problem?
Is it feasible to do it using the language and frameworks I'm thinking of? Is it convenient?
Do you know of any patterns, blue prints, success cases or open projects that resemble or have to do with what I'm trying to do?
As I see it, the tricky thing here is decoding the files on the server. Once you've written that, it should be pretty easy.
I would suggest that, whatever the thing you use client-side is, it should basically upload a 'diff' of the person's changes.
Might it make sense to make something that acts like a database (or use an existing database) for this data? Or is there just too much of it?
Depending on how many people need to do this, the quick-and-dirty solution is to run the program via X forwarding -- that eliminates a number of the issues.. as long as that server has quite a lot of RAM free.
Is mine a reasonable way to frame the problem?
IMO, yes.
Is it feasible to do it using the language and frameworks I'm thinking of?
I think so. But there are other alternatives. For example:
Put the records into a database, and access by a key consisting of a filename + a record number. Could be a full RDBMS, or a more lightweight solution.
Implement as a RESTful web service with a UI implemented in HTML + javascript.
Implement using a scalable distributed file-system.
Also, from your description there doesn't seem to be a pressing need to use a highly scalable / transport independent layer ... unless you need to support hundreds of simultaneous users.
Is it convenient?
Convenient for who? If you are talking about you the developer, it depends if you are already familiar with those frameworks.
Have you considered using a distributed file system like OpenAFS? That should be able to handle very large files. Then you can write a client-side app for editing the files as if they are local.

java io read and write lock

suppose I have a file that might gets written by one thread/process Writer and read by another thread/process Reader.
Writer updates the file every x time interval, and Reader reads it every y time interval,
if they happen to read and write to the file at the same time, will there be any issues? would the read block until writes finishes? or would the read fails? and vice versa?
What's the best practice here?
You'll need to devise your own locking protocol to implement in the applications. Specifics depend on the underlying operating system, but in general, nothing will stop one process from reading a file even when another process is writing to it.
Java has a FileLock class that can be used to coordinate access to a file. However, you'll need to read the caveats carefully, especially those relating to the system-dependence of this feature. Testing the feature on the target operating system is extremely important.
A key concept of Java's FileLock is that it is only "advisory". Your process should be able to detect that another process holds a lock on a file, but your process can ignore it and do what it likes with the file, no restrictions.
The question is ambiguous whether multiple process will use the file, or merely separate threads within a single Java process. That's a big difference. If the problem requires only thread safety within a single process, a ReentrantReadWriteLock can provide a robust, high performance solution, without any platform-specific pitfalls.
Best practice is to not use a file for communication between processes. File are not designed for this purposes. Instead you should use messaging which IS designed for communication between processes. You can use files as well to audit what has been sent/received,
If you use files alone, you could come up with a solution which is good enough, but I don't believe you will have a solution which could be considered best practice.

Loading and analyzing massive amounts of data

So for some research work, I need to analyze a ton of raw movement data (currently almost a gig of data, and growing) and spit out quantitative information and plots.
I wrote most of it using Groovy (with JFreeChart for charting) and when performance became an issue, I rewrote the core parts in Java.
The problem is that analysis and plotting takes about a minute, whereas loading all of the data takes about 5-10 minutes. As you can imagine, this gets really annoying when I want to make small changes to plots and see the output.
I have a couple ideas on fixing this:
Load all of the data into a SQLite database.
Pros: It'll be fast. I'll be able to run SQL to get aggregate data if I need to.
Cons: I have to write all that code. Also, for some of the plots, I need access to each point of data, so loading a couple hundred thousand files, some parts may still be slow.
Java RMI to return the object. All the data gets loaded into one root object, which, when serialized, is about 200 megs. I'm not sure how long it would take to transfer a 200meg object through RMI. (same client).
I'd have to run the server and load all the data but that's not a big deal.
Major pro: this should take the least amount of time to write
Run a server that loads the data and executes a groovy script on command within the server vm. Overall, this seems like the best idea (for implementation time vs performance as well as other long term benefits)
What I'd like to know is have other people tackled this problem?
Post-analysis (3/29/2011): A couple months after writing this question, I ended up having to learn R to run some statistics. Using R was far, far easier and faster for data analysis and aggregation than what I was doing.
Eventually, I ended up using Java to run preliminary aggregation, and then ran everything else in R. R was also much easier to make beautiful charts than using JFreeChart.
Databases are very scalable, if you are going to have massive amounts of data. In MS SQL we currently group/sum/filter about 30GB of data in 4 minutes (somewhere around 17 million records I think).
If the data is not going to grow very much, then I'd try out approach #2. You can make a simple test application that creates a 200-400mb object with random data and test the performance of transferring it before deciding if you want to go that route.
Before you make a decision its probably worth understanding what is going on with your JVM as well as your physical system resources.
There are several factors that could be at play here:
jvm heap size
garbage collection algorithms
how much physical memory you have
how you load the data - is it from a file that is fragmented all over the disk?
do you even need to load all of the data at once - can it be done it batches
if you are doing it in batches you can vary the batch size and see what happens
if your system has multiple cores perhaps you could look at using more than one thread at a time to process/load data
if using multiple cores already and disk I/O is the bottleneck, perhaps you could try loading from different disks at the same time
You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you aren't familiar with the settings for the VM.
If your data have a relational properties, there are nothing more natural than storing it at some SQL database. There you can solve your biggest problem -- performance, costing "just" to write your appropriate SQL code.
Seems very plain to me.
I'd look into analysis using R. It's a statistical language with graphing capabilities. It could put you ahead, especially if that's the kind of analysis you intend to do. Why write all that code?
I would recommend running a profiler to see what part of the loading process is taking the most time and if there's a possible quick win optimization. You can download an evaluation license of JProfiler or YourKit.
Ah, yes: large data structures in Java. Good luck with that, surviving "death by garbage collection" and all. What java seems to do best is wrapping a UI around some other processing engine, although it does free developers from most memory management tasks -- for a price. If it were me, I would most likely do the heavy crunching in Perl (having had to recode several chunks of a batch system in perl instead of java in a past job for performance reasons), then spit the results back to your existing graphing code.
However, given your suggested choices, you probably want to go with the SQL DB route. Just make sure that it really is faster for a few sample queries, watch the query-plan data and all that (assuming your system will log or interactively show such details)
Edit,(to Jim Ferrans) re: java big-N faster than perl (comment below): the benchmarks you referenced are primarily little "arithmetic" loops, rather than something that does a few hundred MB of IO and stores it in a Map / %hash / Dictionary / associative-array for later revisiting. Java I/O might have gotten better, but I suspect all the abstractness still makes it comparitively slow, and I know the GC is a killer. I haven't checked this lately, I don't process multi-GB data files on a daily basis at my current job, like I used to.
Feeding the trolls (12/21): I measured Perl to be faster than Java for doing a bunch of sequential string processing. In fact, depending on which machine I used, Perl was between 3 and 25 times faster than Java for this kind of work (batch + string). Of course, the particular thrash-test I put together did not involve any numeric work, which I suspect Java would have done a bit better, nor did it involve caching a lot of data in a Map/hash, which I suspect Perl would have done a bit better. Note that Java did much better at using large numbers of threads, though.

Categories