I have a requirement where in large size zipped files (size in GBs) are coming in a directory on a unix server (lets say server1) and I have to write application which will poll that directory and copy the files to another unix server (lets say server2) as they come . I have a way to know when one file is completely copied in a directory (using corresponding meta data file which will only come when copy operation of a single file is complete) . Since there are hundreds of files, we dont want to wait for all the files to be copied. Once files are copied to server2 , I have to do unzipping and some validations before I land up those files in my final repository.
Questions
What would be the appropriate tech to use for this scenario,shell scripting or java or something else in terms of speed ?
Since we will be doing the transfer operation file by file , how do we achieve parallelism (other than multithreading if we use java) ?
Any existing lib/package/tool available which can fit this scenario .
Related
I have my data on hdfs, the folder structure is something like,
hdfs://ns1/abc/20200101/00/00/
hdfs://ns1/abc/20200101/00/01/
hdfs://ns1/abc/20200101/00/02/
......
Basically, we create folder every minute and put hundreds of files in the folder.
We have a spark (2.3) application (written in java) which processes data on a daily basis, so the input path we used is like hdfs://ns1/abc/20200101, simple and straight, but sometime, a few files are corrupt or zero size, this causes the whole spark job failed.
So is there a simpe way to just ingore any bad file? have tried --conf spark.sql.files.ignoreCorruptFiles=true, but doesnt help at all.
Or can we have some 'file pattern' on command-line when submitting spark job, since those bad files are usually using different file extension.
Or, since I'm using JavaSparkContext#newAPIHadoopFile(path, ...) to read data from hdfs, any trick I can do with JavaSparkContext#newAPIHadoopFile(path, ...), so that it will ignore bad file?
Thanks.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file. This is when using BufferedOutputStream in JAVA and saving to a windows file system. I'm saving these files to a network drive, so the write time is dependent on network speed.
Will saving to a .zip folder speed up write time? In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file? Sorry if this is an ignorant question.
There are so many misconceptions in the Question, I think it is worth going through them one at a time.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file.
That is not correct. Creating a file with a ".zip" suffix does not automatically make it compressed. Writing files to a directory that has ".zip" as its filename suffix (?!?) doesn't either. Not in Java. Not in other languages.
In order to get compression, the application needs to take steps to make this happen. In Java you could use ZipOutputStream to write a file in ZIP file format. However, a ZIP file is actually an "archive" format that is designed to hold multiple files in a ZIP file. If you simply trying to compress a single file, there are better alternatives; e.g. GZIPOutputStream.
(It is also possible that this so-called "ZIP folder" you are talking about is a normal ZIP file that has been "mounted" as a loopback file system. You / someone else would have had to set that up explicitly. Anyhow, if this is what is going on here, it is nothing to do with Java. It is all happening in external software and in the operating system where the ZIP is "mounted".)
This is when using BufferedOutputStream in JAVA and saving to a windows file system.
Erm ... no. See above. However you are correct that it may be better to use a BufferedOutputStream to write files, though it only really helps if your application is writing the files in small chunks; e.g. a byte at a time. (Stream compression complicates the issue, so it is difficult to give a simple, general answer on this.)
I'm saving these files to a network drive, so the write time is dependent on network speed.
Correct. It is also dependent on network latency, the protocols used and the load on the remote file server. (If you have a ZIP "mounted", then that is going to add overheads too.)
Will saving to a .zip folder speed up write time?
Maybe. See above. It depends what you mean by a ZIP folder.
Ignoring that, writing the files (the right way) in compressed and / or archive form from Java may speed up writes. There are actually two things to consider:
For plain compression, you are trading off the time it takes the application (!!) to compress and decompress the data against the time (and disk space) you are saving by moving and storing less bytes.
For ZIP files (and similar archive formats) there is a second potential saving. Storing and retrieving lots of individual small files from a file system is slow compared with storing and retrieving a single ZIP file containing those files.
And if you are looking for optimal compression, then ZIP is not the best option.
In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file?
There are so many variables that it is hard to say for sure. But unless you have done something odd, it is likely that the bytes are sent over the network in compressed form.
Finally, I would advise you NOT to try to combine mounted ZIP files and network shares:
The combination of the two could potentially interact in ways that makes performance worse.
There is a risk that you will end up with a corrupted ZIP or lost files if the network share goes offline at an inconvenient point.
This question already exists:
Does java FileChannnel.transferTo() work cleverly when files are on network?
Closed 7 years ago.
The code is written in Java 1.7
I want to make some major modifications to a binary file on a slow network.To protect against the network connection being lost instead of writing directly to the file I write to a new file. When I have completed writing to the new file I delete the old file and rename the new file to the old file.
My question is is it better for the new file to be
1. On the same location as the original file
2. Locally on the computer
With 1. writing to the file could be slower, but the rename should be quicker in fact with most oses would be immediate . With 2 writing to the file should be quicker but then renaming the filwe would be slower.
I feel the answer is 1.
Actually if I open a Filechannel to both files and transfer files directly from one channel to another do the bytes have to come from network to my computer and back to network or can they been copied directly from one place on network to the the ther.
I'm guessing here but the files are probably mounted via some network file system (NFS, SMB) on your computer. So you can access them like local files; they are just slower.
As for the first question: You're not gaining anything by first writing the file locally. In the end, you always have to move the file to correct place in the network and that always involves a "copy all bytes" operation. For example, Java's File.rename() will fail when the two files aren't on the same harddisk / mount. So you have to manually copy the bytes to the destination folder anyway. Some IO frameworks do that for you when necessary but it always happens.
As for directly copying data between two remote hosts: There are a few network filesystems which support such operations but it's a special feature. The usual culprits (NFS and SMB) don't. They always download the whole file from the source and then upload it to the target.
At the moment I am tracing a load of files that come into my system to a directory. The issue is that I am running out of iNodes (the number of files I can store). For "replay" reasons (there are other reasons too), I would like the files to be separate files, so I can't just write to one file.
I am wondering whether I can replace this with code that will write the files to a ZIP on the fly. However, my concern is - what happens if the JVM crashes during procesing for whatever reason, do will I end up with a corrupt ZIP file? Or is there a way that I can ensure that the ZIP is valid after every "write"?
What would be a practical use for temporary files (see code below)?
File temp = File.createTempFile("temp-file-name", ".tmp");
Why can't you store the data you would keep in the file in some variables? If the file is (probably) going to be deleted on the program exit (as "temp" implies), why even create them?
An example can be such as when downloading a file, it often appears as a temporary file while the downloading completes.
The two reasons I know of:
As storage space for large chunks of memory you don't need at the moment, when doing memory-intensive tasks like video editing
A kind of hacky way of interproccess communication
Aside from the ram versus disk comment above. You may use temp files as precusor files or files about to be processed or served. For example, a server may generate a large PDF for a browser. That PDF file would be stored as a temp file while the (possibly slow) browser downloads the file. Once the communication is complete, the temp file can be destroyed.
For our little 'imagefilesystem' project (http://code.google.com/p/imagefilesystem/) we actually use the /tmp directory to store the thumbnails we created based upon the images in the local filesystem. So the thumbs were created 'on demand' and were, as the name of /tmp says it itself' temporary of nature so that it didn't create GBs of permanent data.