I have a use case where I might be writing a 100 G file to my new IGFS store. I want to start reading the beginning of the file before the end of the file has finished writing, as writing 100G could take a minute or two.
Since I can't speed up my hardware, I would like to speed up the software by beginning to read the file before I've closed my write stream. I have several GB written out, so there is plenty of data to start reading. When I write a simple test for this case, though, I get an exception thrown because IGFS doesn't seem to let me read from a stream when I am still writing to it. Not unreasonable... except that I know under the hood that the first segments of the file are written and done with.
Does anyone know how I might get around this? I suppose I could write a bunch of code to break files into 500M segments or something, but I am hoping that will be unessecary.
Instead of using Ignite in the IGFS mode, deploy it in the standard configuration - as separate memory-centric storage with enabled native persistence. Let Ignite store a subset of the data you have in Hadoop that is used by operations need to be accelerated. This configuration allows using all the Ignite APIs including Spark Integration.
Related
I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?
The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.
Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.
Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)
The story:
A few days ago I was thinking about inter-process communication based on file exchange. Say process A creates several files during its work and process B reads these files afterwards. To ensure that all files were correctly written, it would be convenient to create a special file, which existence will signal that all operations were done.
Simple workflow:
process A creates file "file1.txt"
process A creates file "file2.txt"
process A creates file "processA.ready"
Process B is waiting until file "processA.ready" appears and then reads file1 and file2.
Doubts:
File operations are performed by the operating system, specifically by the file subsystem. Since implementations can differ in Unix, Windows or MacOS, I'm uncertain about the reliability of file exchange inter-process communication. Even if OS will guarantee this consistency, there are things like JIT compiler in Java, which can reorder program instructions.
Questions:
1. Are there any real specifications on file operations in operating systems?
2. Is JIT really allowed to reorder file operation program instructions for a single program thread?
3. Is file exchange still a relevant option for inter-process communication nowadays or it is unconditionally better to choose TCP/HTTP/etc?
You don’t need to know OS details in this case. Java IO API is documented to guess whether file was saved or not.
JVM can’t reorder native calls. It is not written in JMM explicitly but it is implied that it can’t do it. JVM can’t guess what is impact of native call and reordering of those call can be quite generous.
There are some disadvantages of using files as a way of communication:
It uses IO which is slow
It is difficult to separate processes between different machines in case you would need it (there are ways using samba for example but is quite platform-dependant)
You could use File watcher (WatchService) in Java to receive a signal when your .ready file appears.
Reordering could apply but it shouldn't hurt your application logic in this case - refer the following link:
https://assylias.wordpress.com/2013/02/01/java-memory-model-and-reordering/
I don't know the size of your data but I feel it would still be better to use an Message Queue (MQ) solution in this case. Using a File IO is a relatively slow operation which could slow down the system.
Used file exchange based approach on one of my projects. It's based on renaming file extensions when a process is done so other process can retrieve it by file name expression checking.
FTP process downloads a file and put its name '.downloaded'
Main task processor searched directory for the files '*.downloaded'.
Before starting, job updates file name as '.processing'.
When finished then updates to '.done'.
In case of error, it creates a new supplemantary file with '.error' extension and put last processed line and exception trace there. On retries, if this file exists then read it and resume from correct position.
Locator process searches for '.done' and according to its config move to backup folder or delete
This approach is working fine with a huge load in a mobile operator network.
Consideration point is to using unique names for files is important. Because moving file's behaviour changes according to operating system.
e.g. Windows gives error when there is same file at destination, however unix ovrwrites it.
I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.
There is a file - stored on an external server which is updated very frequently by a vendor. My application polls this file every minute getting the values out. All I am doing is reading the file.
I am worried that by doing this I could inadvertently lock the file so it cant be written too by the vendor. Is this a possibility?
Kind regards
Further to Eric's answer - you could check the Last Modified Property of the temp file and only merge it with your 'working' file when it changes - that should protect you from read/write conflicts and only merge files just after the vendor has written to the temp. Though this is messy and mrab's comment is valid, a better solution should be found.
I have faced this problem several times, and as Peter Lawrey says there isn't any portable way to do this, and if your environment is Unix this should not be an issue at all as these concurrent access conditions are properly managed by the operating systems. However windows do not handle this at all (yes, that's the consequence of using an amateur OS for production work, lol).
Now that's said, there is a way to solve this if your vendor is flexible enough. They could write to a temp file and when finished move the temp file to the final destination. By doing this you avoid any concurrent access to the file between you and the vendor.
Another way is to precisely (difficult?) know the timing of your vendors update and avoid reading the file during some time frames. For instance if your vendor update the file every hour, avoid reading from five-to-the-hour to five-past-the-hour.
Hope it helps.
There is the Windows Shadow Copy service for volumes. This would allow to read the backup copy.
If the third party software is in java too, and uses a Logger, that should be tweakable: every minute writing to the next from 10 files or so.
I would try to relentlessly read the file (when modified since last read), till something goes wrong. Maybe you can make a test run with hundreds of reads in the weekend or at midnight, when no harm is done.
My answer:
Maybe you need a local watch program, a watch service for a directoryr, that waits till the file is modified, and then makes a fast cooy; after that allowing the copy to be transmitted.
I am currently writing a Java application that receives data from various sensors. How often this happens varies, but I believe that my application will receive signals about 100k times per day. I would like to log the data received from a sensor every time the application receives a signal. Because the application does much more than just log sensor data, performance is an issue. I am looking for the best and fastest way to log the data. Thus, I might not use a database, but rather write to a file and keep 1 file per day.
So what is faster? Use a database or log to files? No doubt there is also a lot of options to what logging software to use. Which is the best for my purpose if logging to file is the best option?
The data stored might be used later for analytical purposes, so please keep this in mind as well.
I would recommend you first of all to use log4j (or any other logging framework).
You can use a jdbc appender that writes into the db or any kind of file appender that writes into the file. The point is that your code will be generic enough to be changed later if you like...
In general files are much faster than db access, but there is a place for optimizations here.
If the performance is critical, you can use batching/asynchronous calls to the logging infrastructure.
A free database on a cheap PC should be able to record 10 records per second easily.
A tuned database on a good system or a logger on a cheap PC should be able to write 100 records/lines per second easily.
A tuned logger should be able to write 1000 lines per second easily.
A fast binary logger can perform 1 million records per second easily (depending on the size of the record)
Your requirement is about 1.2 records per second per signal which should be able to achieve any way you like. I assume you want to be able to query your data so you want it in a database eventually so I would put it there.
Ah the world of embedded systems. I had a similar problem when working with a hovercraft. I solved it with a separate computer(you can do this with a separate program) over the local area network that would just SIT and LISTEN as a server for logs I sent to it. The FileWriter program was written in C++. This must solve two problems of yours. First is the obvious performance gain while writing the logs. And secondly the Java program is FREED of writing any logs at all(but will act as a proxy) and can concentrate on performance critical tasks. Using a DB for this is going to be an overkill, except if you're using SQLite.
Good luck!