I have two machines with different Java applications that both run on Linux and use a common Windows share folder. One app is triggering another to generate a specific file (e.g. image/pdf). Then the first app tries to upload the generated file to S3. The problem is I sometimes get this:
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received.
OR this:
com.amazonaws.AmazonClientException: Data read has a different length than the expected: dataLength=247898; expectedLength=262062; includeSkipped=false; in.getClass()=class com.amazonaws.internal.ResettableInputStream; markedSupported=true; marked=0; resetSinceLastMarked=false; markCount=1; resetCount=0
All the processes are happening synchronously, one after another (i have also checked the logs which show no concurrent activity). Also I am not setting the md5 hash or the content length by myself, aws-sdk handles it by itself.
So my guess is that the generating application has written a file and returned but in fact it is still being written by the OS in background and that is why the first app is getting an incomplete file.
I would really appreciate suggestions on how to handle such situations. Maybe there is a way to detect if the file is not currently being modified by the OS?
I was experiencing AmazonS3Exception: The Content-MD5 you specified did not match what we received. I finally solved it by addressing the first item on the list below, not terribly obvious.
Possible Solutions For Anyone Else:
Make sure not to use the same ObjectMetadata object across multiple putObject calls.
Consider disabling ChunkedEncoding. client.setS3ClientOptions(S3ClientOptions.builder().disableChunkedEncoding().build())
Make sure the file isn't being edited while it's being uploaded.
Related
I'm using java to parse XML files which come from a FTP protocol. The problem is, the file I take may being copied/modified by the FTP. So I need a method which can check whether the file is completely written.
I've tried using File::canWrite method (which did work at all) or finding the ending tag of the XML file but none of them works correctly at any case. The File::renameTo is pretty slow and doesn't look decent although it works (not all the case either). Is there any good and fast way to check a file if it's completely copied?
Thanks alot!
Short answer no. The best practice is to write to a file with a temporary name, for example somefile.part and rename it when done. The writing program needs to do that. The workaround when you don't control the writing application is to check the modification time and ensure that some reasonable time has passed since the most recent change. Perhaps a minute. Then you assume that the file is complete.
I'm planing on using Amazon S3 to store milions of relatively small files (~100kB-2mB). To save on upload time I structured them into directories (tens/hundreds of files per directory), and decided to use TransferManager's uploadDirectory/uploadFileList. However after uploading an individual file I need to perform specific operations on my HDD and DB. Is there any way (preferably implementing observers/listeners) to notify me whenever a specific file has finished uploading or am I cursed with only being able to verify if the entire MultipleFileUpload succeeded?
For whatever it's worth I'm using the Java SDK, however I should be able to adapt a .NET/REST solution to my needs.
Realizing that this isn't exactly what you asked, it's pretty sweet and seems like an appropriate solution...
S3 does have notifications you can configure to alert you when an object has been created or deleted (or if a reduced redundancy object was lost). These can go to SQS, SNS, or Lambda (which could potentially even run the code that updates the database), and of course if you send them to SNS you can then fan them out to multiple destinations.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#notification-how-to-event-types-and-destinations
Don't make the mistake, however, of selecting only the upload subtype you assume is being used; use the s3:ObjectCreated:* event unless you have a specific reason not to.
I have a directory with files that cannot be removed because they are used by other applications or have read only properties. This means that I can't move or delete the files like Mule does as a natural file tracking system. In order to process these files through Mule once they arrive or when they get updated without deleting/moving them from the original directory I need some sort of custom tracking. To do this I think I need to add some rules and be able to track files that are:
New files
Processed files
Updated files
For this, I thought of having a log file in the same directory that would track each file by name and date modified, but I'm not sure if this is the correct way of doing this. I would need to be able to write and read this log file and compare its content with current files in the directory in order to determine which files are new or updated. This seems to be a bit too complicated and requires me to add quite a bit of programming (maybe as groovy scripts or overriding some methods).
Is there any other simpler way to do this on Mule? If not, how should I start tackling this problem? I'm guessing I can write some java to talk to File EndPoint.
As Victor Romero pointed out, Idempotent Filter does the trick. I tried two types of Idempotent Filter to see which one works best: Idempotent Message Filter and Idempotent Secure Hash Message Filter. Both of them did the job, however I ended up using Idempotent Message Filter (no Hash) to log timestamp and filename in the simple-text-file-store.
Just after the File inbound-endpoint:
<idempotent-message-filter idExpression="#[message.inboundProperties.originalFilename+'-'+message.inboundProperties.timestamp]" storePrefix="prefix" doc:name="Idempotent Message">
<simple-text-file-store name="uniqueProcessedMessages" directory="C:\yourDirectory"/>
</idempotent-message-filter>
Only new or modified files for the purposes of my process would pass through. However Idempotent Secure Hash Message Filter should do a better job at identifying different files.
I have been working on an android project in which I have to download/upload few files via HTTP. I was wondering if there is a way to have resumable downloads/uploads for the files. As in, if my file is being downloaded or uploaded and there is a subtle internet choke for very minimal time (this sometimes corrupts the file and the process is stopped and next time it starts from 0 ) the downloading/uploading is paused and once the internet is back again on my device, the downloading/uploading starts from the same point where it was stopped at so that the file does not get corrupted and the process does not start from 0.
Is there any way to achieve this functionality in android/Java ? Please do let me know. Thanks in advance.
Html itself doesn't provide such ability to load file in chunks. FileUpload is simple object which works with file as whole and so sends it from scratch. To fulfill your requirements you need more sophisticated client/server relations. Java Applet is good candidate to do so on the client side and server side is trivial. However you need to implement some protocol (like handshake, start to send file, continue from some location, validation) and this is not an easy task. Even most commonly protocols (for example ftp) don't provide such ability. And even when you create all this stuff it will be compatible only with itself. Is it really worth all the efforts? Common answer is - no. That's the reason why we don't see such approach in the wild.
We are building a service to front fetching remote static files to our android app. The service will give a readout of the current md5 checksum of a file. The concept is that we retain the static file on the device until the checksum changes. When the file changes, the service will return a different checksum and this is the trigger for the device to download the file again.
I was thinking of just laying the downloaded files down in the file system with a .md5 file next to each one. When the code starts up, I'd go over all the files and make a map of file_name (known to be unique) to checksum. Then on requests for a file I'd check the remote service (whose response would only be checked every few minutes) and compare the result against that in the map.
The more I thought about this, the more I thought someone must have already done it. So before I put time into this I was wondering if there was a project out there doing this. I did some searching but could not find any.
Yes, it's built into HTTP. You can use conditional requests and cache files based on ETags, Last-Modified, etc. If you are looking for a library that implements your particular caching scheme, it's a bit unlikely that one exists. Write one and share it on GitHub :)