I've been reading a few articles on Gutmann method of securely wiping data. I understood that the method is designed for hard disks. I want to write my tiny app that securely wipes data (there are a few on Google Play, I know) from either phone memory or SD card.
My questions are
Question 1: Gutmann or others?
As for the above observation, is Gutmann algorithm both effective and efficient? I believe that it is indeed effective because it rewrites the data so many times that a technology like flash memory has no way to remember data 35-writes-older. I don't know if it's efficient: I mean, do I just need fewer random writes to achieve a result?
Question 2: do I really overwrite sectors?
A question that came into my mind is the following: if I overwrite a file in Java, does Linux kernel write new data on old sectors or does it allocate new sectors on physical media while deallocating the old ones? You know, this makes the difference...
Re #2, the link you cited is not relevant. new FileOutputStream() doesn't overwrite the file at all, in the sense you mean. It creates a new one, or appends to an existing one. It is therefore most unlikely to reuse the same disk blocks. However new RandomAccessFile() in "rw" mode does indeed overwrite the file, and you would reasonable expect it to reuse the same disk blocks, although it is possible to imagine a filesystem that didn't.
Related
What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.
I'm not sure if I'm asking this question right, but I want to make some sort of lyrics player using subtitle files. As I also want to make it compatible with larger files (say 10.000 lines), it's not a good idea to load the file before you play it. This might cost a lot of time and unnecessary amounts of data stored on the RAM. That's why I want to load it the way for example online videos do (they store an amount of minutes on the RAM and discard that what's been played, all while playing). I believe this is called buffering.
My question was: are there any pre-made i/o classes inside java that allow this sort of thing? I know a lot of classes with buffer in their name, but I have little to no idea what they do or what they are different from other classes (without buffer in their name).
I'm toying around with creating a pure Java audio mixing library, preferably one that can be used with Android, not entirely practical but definitely an interesting thing to have. I'm sure it's been done already, but just for my own learning experience I am trying to do this with wav files since there are usually no compression models to work around.
Given the nature of java.io, it defines many InputStream type of classes. Each implements operations that are primarily for reading data from some underlying resource. What you do with data afterward, dump it or aggregate it in your own address space, etc, is up to you. I want this to be purely Java, e.g. works on anything (no JNI necessary), optimized for low memory configurations, and simple to extend.
I understand the nature of the RIFF format and how to assemble the PCM sample data, but I'm at a loss for the best way of managing the memory required for inflating the files into memory. Using a FileInputStream, only so much of the data is read at a time, based on the underlying file system and how the read operations are invoked. FileInputStream doesn't furnish a method of indexing where in the file you are so that retrieving streams for mixing later is not possible. My goal would be to inflate the RIFF document into Java objects that allow for reading and writing of the appropriate regions of the underlying chunk.
If I allocate space for the entire thing, e.g. all PCM sample data, that's like 50 MB per average song. On a typical smart phone or tablet, how likely is it that this will affect overall performance? Would I be better off coming up with my own InputStream type that maybe keeps track of where the chunks are in the InputStream? For file's this will result in lots of blocking when fetching PCM samples, but will still cut down on the overall memory footprint on the system.
I'm not sure I understand all of your question, but I'll answer what I can. Feel free to clarify in the comments, and I'll edit.
Don't keep all file data in in memory for a DAW-type app, or any file/video player that expects to play large files. This might work on some devices depending on the memory model, but you are asking for trouble.
Instead, read the required section of the file as needed (ie on demand). It'a actually a bit more complex than that because you don't want to read the file in the audio playback thread (you don't want audio playback, which is low latency, to depend on file IO, which is high-latency). To get around that, you may have to buffer some of the file in advance. (it depends on whether you are using a callback or blocking model)
Using FileInputStream works fine, you'll just have to keep track of where everything is in the file yourself (this involves converting milliseconds or whatever to samples to bytes and taking into account the size of the header[1]). A slightly better option is RandomAccessFile because it allows you to jump arround.
My slides from a talk on programing audio software might help, especially if you are confused by callback v blocking: http://blog.bjornroche.com/2011/11/slides-from-fundamentals-of-audio.html
[1] or, more correctly, knowing the offset of the audio data in the file.
I have a Huge data file and I only need specific data from this file, and later on, I will be using these data frequently.
So which of these two methods would be more efficient :
save this data in global variables (maybe LinkedList) and use them every time I need
save them in a file, and read the file every time I need the data
I should mention that these data could be a huge amount of integers.
Which of the mentioned two ways would give better performance with respect to speed and memory ?
If the file I/O overhead is not an issue for you: Save them in a file and create an index file mapping keys to file positions so you do not have to read your huge file.
If the data fits in your RAM and you want to be able to access it quickly - go by the first approach (but maybe without an index file) but read the data into memory at startup or when needed the first time.
As long as it fits in memory, working with memory is surely some orders of magnitude faster. But do not use LinkedList - it has a huge overhead. And do not use any standard Collection at all since it means boxing and blows the memory overhead by a factor 3 at least.
You could use int[] or a specialized collection for primitive types.
I'd recommend using a file via java.nio.IntBuffer. This way the data reside primarily on the disk but get mapped into memory too.
Probably the first one.
But there really isn't enough information there to answer you properly.
Firstly a linked list is fine if you only ever traverse it in order. However, if you need random access to it (5th element, then 100th, then 12th, then 45th...), it's lousy, and you'd be better with an ArrayList or something. Secondly, if you're storing lots of ints, if you use one of the standard Java collections, each int will be boxed, which may present a performance overhead.
Then you haven't said what 'huge' means. Thousands? Millions?
So, yeah, you need to say what kind of numbers you're dealing with, and what the access patterns are likely to be. And is the 'filtering' step a one-off--or is it done quite frequently?
It depends on system spec, if you are designing your app for one machine - the task is simple, elsewhere you should take into account memory and/or disk space limit on client's computer.
I think you cannot compare these two attitudes performance, as each one has it's own benefits and drawbacks. I'm certain that there are some algorithms available that you could further investigate, connected with reading part of a file into the memory, or creating a cache (when you read a number from a file, store it in memory, so next time you load it - it will be stored in memory).
Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?
Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.
#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms
Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.
I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.
You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.
#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?
This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).
I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.
I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.