I am writing code to transfer files to hadoop hdfs parallel. So I have many threads calling filesystem.copyFromLocalFile.
I think the cost of opening a filesystem is not small, so I just have one filesystem opened in my project. So I though there might be a a problem when so many threads calling it at the same time. But so far, it works fine with no problem.
Could anyone please give me some information about this copy method?
Thank you very much& have a great weekend.
I see the following design points to consider:
a) Where will be bottleneck of the process? I think in 2-3 parallel copy operations local disk or 1GB Ethernet will became a bottleneck. You can do it in form of multithreaded application or you can run a few processes. In any case I do not think you need a high level of parallelism.
b) Error handling. Failure of the one thread should not stop the whole process, and, in the same time file should not be lost. What I am usually doing in such cases is to agree that in a worst case file can be copied twice. If it is Ok - system can work in simple "copy then delete" scenario.
c) If you copy from the one of the cluster nodes - HDFS will became unbalanced, since one replica will be stored on the host from where you copy. You will need to do the balance constantly.
Can you tell me what more information you want about copyFromLocalFile()?
I'm not sure but I guess in your case, threads share the same resource among themselves. Since, you have only one instance of FileSystem, each thead will probably share this object in a time sharing basis among themselves.
Related
I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?
The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.
Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.
Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)
I have two Java programs that are scheduled to run one after the other. When they run in sequence without an overlap, there are no issues. But sometimes one of them gets prolonged a little while more due to the volume it is processing and the second one starts before the first one ends. Now, when this happens at a point when the second one is half way through completion it crashes giving the Max no of files open exception. The first one finishes successfully though. When run separately, with the same volume, there are no issues with either of them. Both processes are completely independent of each other - no common resources, invoked from different scripts and are finally 2 different processes running on the same OS on 2 different JVM's - I use a HP-UNIX system and have tried to trace the open handles using the TUSC utility but there aren't any that could cause such a problem. Xmx is 2Gigs for both and I doubt that would be reached - but is there an explanation to this that I am not seeing? Can the parallel runs be an issue or is it just a coincidence?
The solution to your problem could be either to up your file descriptor limit (see link below) or to make sure that your code is properly closing resources (file input streams, sockets) so you are not leaking open file descriptors. I'll leave it up to you which approach you take.
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02261152/c02261152.pdf
Edit
If your program really is generating that many files at a time, I would certainly look into upping the open file descriptor limit. Then, you might also consider putting in a throttle/regulator into the code. Create X number of files and then back off for a few seconds, let the system reap the resources back and then continue on again. This might be helpful.
Also, this link sounds very similar to your issue:
Apache FOP 1.0 Multithreading - Too many open files err24
In a similar scenario and with resource constraints, we have used the below architecture.
There will be a monitor thread which will be started first which will have a count variable. There will be a configurable limit till which new processes will be created. For every new process, the count variable will be incremented. Once the process is completed, the count variable will be decremented. New processes will be created only if the count is lesser than the configured limit.
Above approach helped us to gain better control and able to scale up wherever it was possible.
There is a file - stored on an external server which is updated very frequently by a vendor. My application polls this file every minute getting the values out. All I am doing is reading the file.
I am worried that by doing this I could inadvertently lock the file so it cant be written too by the vendor. Is this a possibility?
Kind regards
Further to Eric's answer - you could check the Last Modified Property of the temp file and only merge it with your 'working' file when it changes - that should protect you from read/write conflicts and only merge files just after the vendor has written to the temp. Though this is messy and mrab's comment is valid, a better solution should be found.
I have faced this problem several times, and as Peter Lawrey says there isn't any portable way to do this, and if your environment is Unix this should not be an issue at all as these concurrent access conditions are properly managed by the operating systems. However windows do not handle this at all (yes, that's the consequence of using an amateur OS for production work, lol).
Now that's said, there is a way to solve this if your vendor is flexible enough. They could write to a temp file and when finished move the temp file to the final destination. By doing this you avoid any concurrent access to the file between you and the vendor.
Another way is to precisely (difficult?) know the timing of your vendors update and avoid reading the file during some time frames. For instance if your vendor update the file every hour, avoid reading from five-to-the-hour to five-past-the-hour.
Hope it helps.
There is the Windows Shadow Copy service for volumes. This would allow to read the backup copy.
If the third party software is in java too, and uses a Logger, that should be tweakable: every minute writing to the next from 10 files or so.
I would try to relentlessly read the file (when modified since last read), till something goes wrong. Maybe you can make a test run with hundreds of reads in the weekend or at midnight, when no harm is done.
My answer:
Maybe you need a local watch program, a watch service for a directoryr, that waits till the file is modified, and then makes a fast cooy; after that allowing the copy to be transmitted.
We have an interface with an external system in which we get flat files from them and process those files. At present we run a job a few times a day that checks if the file is at the ftp location and then processes if it exists.
I recently read that it is a bad idea to make use of file systems as a message broker which is why I am putting in this question. Can someone clarify if a situation like this one is a right fitment for the use of some other tool and if so which one?
Ours is a java based application.
The first question you should ask is "is it working?".
If the answer to that is yes, then you should be circumspect about change just because you read it was a bad idea. I've read that chocolate may be bad for you but I'm not giving it up :-)
There are potential problems that you can run into, such as files being deleted without your knowledge, or trying to process files that are only half-transferred (though there are ways to mitigate both of those, such as permissions in the former case, or the use of sentinel files or content checking in the latter case).
Myself, I would prefer a message queueing system such as IBM's MQ or JMS (since that's what they're built for, and they do make life a little easier) but, as per the second paragraph above, only if either:
problems appear or become evident with the current solution; or
you have some spare time and money lying around for unnecessary rework.
The last bullet needs expansion. While the work may be unnecessary (in terms of fixing a non-existent problem), that doesn't necessarily make it useless, especially if it can improve performance or security, or reduce the maintenance effort.
I would use a database to synchronize your files. Have a database that points to the file locations. Put an entry into the database only when the files have been fully transferred. This would ensure that you are picking up completed files. You can poll the database to check if new entries are present instead of polling the file system. A very easy simple set up for a polling mechanism. If you would like to be told when a new file appears on the folder, then you would need to go in for a Message Queue.
I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.
My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.
Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.