Returning video information Video Intelligence API - java

I am not able to display the information of the local video, when I do the test with the videos examples are returning, but when I try with the files of the machine does not return anything.
public String consultar() throws Throwable {
requisicaoVideo("C:\\Users\\Web Designer\\Desktop\\Placas de Carros\\cat.mp4");
return "analiseForenseVideos.xhtml";
}
public void requisicaoVideo(String filePath) throws Exception {
try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
// Read file and encode into Base64
Path path = Paths.get(filePath);
byte[] data = Files.readAllBytes(path);
byte[] encodedBytes = Base64.encodeBase64(data);
System.out.println(encodedBytes + "Linha 74");
AnnotateVideoRequest request = AnnotateVideoRequest.newBuilder()
.setInputContent(ByteString.copyFrom(encodedBytes)).addFeatures(Feature.LABEL_DETECTION).build();
// Create an operation that will contain the response when the operation
// completes.
OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response = client.annotateVideoAsync(request);
System.out.println("Waiting for operation to complete...");
System.out.println(response.get().getAnnotationResultsList() + "Linha 83");
for (VideoAnnotationResults results : response.get().getAnnotationResultsList()) {
// process video / segment level label annotations
System.out.println("Locations: ");
for (LabelAnnotation labelAnnotation : results.getSegmentLabelAnnotationsList()) {
System.out.println("Video label: " + labelAnnotation.getEntity().getDescription());
// categories
for (Entity categoryEntity : labelAnnotation.getCategoryEntitiesList()) {
System.out.println("Video label category: " + categoryEntity.getDescription());
}
// segments
for (LabelSegment segment : labelAnnotation.getSegmentsList()) {
double startTime = segment.getSegment().getStartTimeOffset().getSeconds()
+ segment.getSegment().getStartTimeOffset().getNanos() / 1e9;
double endTime = segment.getSegment().getEndTimeOffset().getSeconds()
+ segment.getSegment().getEndTimeOffset().getNanos() / 1e9;
System.out.printf("Segment location: %.3f:%.2f\n", startTime, endTime);
System.out.println("Confidence: " + segment.getConfidence());
}
}

I am with Google Cloud Support. Thank you for reporting this issue. I have been doing some tests and identified some kind of bug in analyzeLabelsFile function in Detect.java file.
If you let the job run for a lot of time, it might get finished (for me takes 30sec importing the file from Google Cloud Storage and 16min using the local file), but anyway is not providing information, just "Locations: " message at the end.
I have sent all the relevant information regarding this (how to reproduce the issue, possible cause, etc.) to Google Video Intelligence API team so they can have a look.
I have not found a workaround for local files but you can process the file in GCS through its URL and the analyzeLabels function.

Related

S3 / MinIO with Java / Scala: Saving byte buffers chunks of files to object storage

So, imagine that I have a Scala Vert.x Web REST API that receives file uploads via HTTP multipart requests. However, it doesn't receive the incoming file data as a single InputStream. Instead, each file is received as a series of byte buffers handed over via a few callback functions.
The callbacks basically look like this:
// the callback that receives byte buffers (chunks) of the file being uploaded
// it is called multiple times until the full file has been received
upload.handler { buffer =>
// send chunk to backend
}
// the callback that gets called after the full file has been uploaded
// (i.e. after all chunks have been received)
upload.endHandler { _ =>
// do something after the file has been uploaded
}
// callback called if an exception is raised while receiving the file
upload.exceptionHandler { e =>
// do something to handle the exception
}
Now, I'd like to use these callbacks to save the file into a MinIO Bucket (MinIO, if you're unfamiliar, is basically self-hosted S3 and it's API is pretty much the same as the S3 Java API).
Since I don't have a file handle, I need to use putObject() to put an InputStream into MinIO.
The inefficient work-around I'm currently using with the MinIO Java API looks like this:
// this is all inside the context of handling a HTTP request
val out = new PipedOutputStream()
val in = new PipedInputStream()
var size = 0
in.connect(out)
upload.handler { buffer =>
s.write(buffer.getBytes)
size += buffer.length()
}
upload.endHandler { _ =>
minioClient.putObject(
PutObjectArgs.builder()
.bucket("my-bucket")
.object("my-filename")
.stream(in, size, 50000000)
.build())
}
Obviously, this isn't optimal. Since I'm using a simple java.io stream here, the entire file ends up getting loaded into memory.
I don't want to save the File to disk on the server before putting it into object storage. I'd like to put it straight into my object storage.
How could I accomplish this using the S3 API and a series of byte buffers given to me via the upload.handler callback?
EDIT
I should add that I am using MinIO because I cannot use a commercially-hosted cloud solution, like S3. However, as mentioned on MinIO's website, I can use Amazon's S3 Java SDK while using MinIO as my storage solution.
I attempted to follow this guide on Amazon's website for uploading objects to S3 in chunks.
That solution I attempted looks like this:
context.request.uploadHandler { upload =>
println(s"Filename: ${upload.filename()}")
val partETags = new util.ArrayList[PartETag]
val initRequest = new InitiateMultipartUploadRequest("docs", "my-filekey")
val initResponse = s3Client.initiateMultipartUpload(initRequest)
upload.handler { buffer =>
println("uploading part", buffer.length())
try {
val request = new UploadPartRequest()
.withBucketName("docs")
.withKey("my-filekey")
.withPartSize(buffer.length())
.withUploadId(initResponse.getUploadId)
.withInputStream(new ByteArrayInputStream(buffer.getBytes()))
val uploadResult = s3Client.uploadPart(request)
partETags.add(uploadResult.getPartETag)
} catch {
case e: Exception => println("Exception raised: ", e)
}
}
// this gets called for EACH uploaded file sequentially
upload.endHandler { _ =>
// upload successful
println("done uploading")
try {
val compRequest = new CompleteMultipartUploadRequest("docs", "my-filekey", initResponse.getUploadId, partETags)
s3Client.completeMultipartUpload(compRequest)
} catch {
case e: Exception => println("Exception raised: ", e)
}
context.response.setStatusCode(200).end("Uploaded")
}
upload.exceptionHandler { e =>
// handle the exception
println("exception thrown", e)
}
}
}
This works for files that are small (my test small file was 11 bytes), but not for large files.
In the case of large files, the processes inside the upload.handler get progressively slower as the file continues to upload. Also, upload.endHandler is never called, and the file somehow continues uploading after 100% of the file has been uploaded.
However, as soon as I comment out the s3Client.uploadPart(request) portion inside upload.handler and the s3Client.completeMultipartUpload parts inside upload.endHandler (basically throwing away the file instead of saving it to object storage), the file upload progresses as normal and terminates correctly.
I figured out what I was doing wrong (when using the S3 client). I was not accumulating bytes inside my upload.handler. I need to accumulate bytes until the buffer size is big enough to upload a part, rather than upload each time I receive a few bytes.
Since neither Amazon's S3 client nor the MinIO client did what I want, I decided to dig into how putObject() was actually implemented and make my own. This is what I came up with.
This implementation is specific to Vert.X, however it can easily be generalized to work with built-in java.io InputStreams via a while loop and using a pair of Piped- streams.
This implementation is also specific to MinIO, but it can easily be adapted to use the S3 client since, for the most part, the two APIs are the same.
In this example, Buffer is basically a container around a ByteArray and I'm not really doing anything special here. I replaced it with a byte array to ensure that it would still work, and it did.
package server
import com.google.common.collect.HashMultimap
import io.minio.MinioClient
import io.minio.messages.Part
import io.vertx.core.buffer.Buffer
import io.vertx.core.streams.ReadStream
import scala.collection.mutable.ListBuffer
class CustomMinioClient(client: MinioClient) extends MinioClient(client) {
def putReadStream(bucket: String = "my-bucket",
objectName: String,
region: String = "us-east-1",
data: ReadStream[Buffer],
objectSize: Long,
contentType: String = "application/octet-stream"
) = {
val headers: HashMultimap[String, String] = HashMultimap.create()
headers.put("Content-Type", contentType)
var uploadId: String = null
try {
val parts = new ListBuffer[Part]()
val createResponse = createMultipartUpload(bucket, region, objectName, headers, null)
uploadId = createResponse.result.uploadId()
var partNumber = 1
var uploadedSize = 0
// an array to use to accumulate bytes from the incoming stream until we have enough to make a `uploadPart` request
var partBuffer = Buffer.buffer()
// S3's minimum part size is 5mb, excepting the last part
// you should probably implement your own logic for determining how big
// to make each part based off the total object size to avoid unnecessary calls to S3 to upload small parts.
val minPartSize = 5 * 1024 * 1024
data.handler { buffer =>
partBuffer.appendBuffer(buffer)
val availableSize = objectSize - uploadedSize - partBuffer.length
val isMinPartSize = partBuffer.length >= minPartSize
val isLastPart = uploadedSize + partBuffer.length == objectSize
if (isMinPartSize || isLastPart) {
val partResponse = uploadPart(
bucket,
region,
objectName,
partBuffer.getBytes,
partBuffer.length,
uploadId,
partNumber,
null,
null
)
parts.addOne(new Part(partNumber, partResponse.etag))
uploadedSize += partBuffer.length
partNumber += 1
// empty the part buffer since we have already uploaded it
partBuffer = Buffer.buffer()
}
}
data.endHandler { _ =>
completeMultipartUpload(bucket, region, objectName, uploadId, parts.toArray, null, null)
}
data.exceptionHandler { exception =>
// should also probably abort the upload here
println("Handler caught exception in custom putObject: " + exception)
}
} catch {
// and abort it here as well...
case e: Exception =>
println("Exception thrown in custom `putObject`: " + e)
abortMultipartUpload(
bucket,
region,
objectName,
uploadId,
null,
null
)
}
}
}
This can all be used pretty easily.
First, set up the client:
private val _minioClient = MinioClient.builder()
.endpoint("http://localhost:9000")
.credentials("my-username", "my-password")
.build()
private val myClient = new CustomMinioClient(_minioClient)
Then, where you receive the upload request:
context.request.uploadHandler { upload =>
myClient.putReadStream(objectName = upload.filename(), data = upload, objectSize = myFileSize)
context.response().setStatusCode(200).end("done")
}
The only catch with this implementation is that you need to know the file sizes in advance for the request.
However, this can easily be solved the way I did it, especially if you're using a web UI.
Before attempting to upload the files, send a request to the server containing a map of file name to file size.
That pre-request should generate a unique ID for the upload.
The server can save group of filename->filesize using the upload ID as an index. - Server sends the upload ID back to the client.
Client sends the multipart upload request using the upload ID
Server pulls the list of files and their sizes and uses it to call .putReadStream()

Google Drive resumable upload in v3

I am looking for some help/example to perform a resumeable upload to Google Drive using the new v3 REST API in Java.
I know there is a low level description here: Upload files | Google Drive API. But at the moment I am not willing to understand any of these low level requests, if there isn't another, simpler method ( like former MediaHttpUploader, which is deprecated now...)
What I currently do is:
File fileMetadata = new File();
fileMetadata.setName(name);
fileMetadata.setDescription(...);
fileMetadata.setParents(parents);
fileMetadata.setProperties(...);
FileContent mediaContent = new FileContent(..., file);
drive.files().create(fileMetadata, mediaContent).execute();
But for large files, this isn't good if the connection interrupts.
I've just created an implementation on that recently. It will create a new file on your DriveFolder and return its metadata when the task succeeds. While uploading, it will also update the listener with uploading info. I added comments to make it auto explanable:
public Task<File> createFile(java.io.File yourfile, MediaHttpUploaderProgressListener uploadListener) {
return Tasks.call(mExecutor, () -> {
//Generates an input stream with your file content to be uploaded
FileContent mediaContent = new FileContent("yourFileMimeType", yourfile);
//Creates an empty Drive file
File metadata = new File()
.setParents(parents)
.setMimeType(yourFileMimeType)
.setName(yourFileName);
//Builds up the upload request
Drive.Files.Create uploadFile = mDriveService.files().create(metadata, mediaContent);
//This will handle the resumable upload
MediaHttpUploader uploader = uploadBackup.getMediaHttpUploader();
//choose your chunk size and it will automatically divide parts
uploader.setChunkSize(MediaHttpUploader.MINIMUM_CHUNK_SIZE);
//according to Google, this enables gzip in future (optional)
uploader.setDisableGZipContent(false); versions
//important, this enables resumable upload
uploader.setDirectUploadEnabled(false);
//listener to be updated
uploader.setProgressListener(uploadListener);
return uploadFile.execute();
});
}
And make your Activity extends MediaHttpUploaderProgressListener so you have real time updates on the file progress:
#Override
public void progressChanged(MediaHttpUploader uploader) {
String sizeTemp = "Uploading"
+ ": "
+ Formatter.formatShortFileSize(this, uploader.getNumBytesUploaded())
+ "/"
+ Formatter.formatShortFileSize(this, totalFileSize);
runOnUiThread(() -> textView.setText(sizeTemp));
}
For calculating the progress percentage, you simply do:
double percentage = uploader.getNumBytesUploaded() / totalFileSize
Or use this one:
uploader.getProgress()
It gives you the percentage of bytes that have been uploaded, represented between 0.0 (0%) and 1.0 (100%). But be sure to have your content length specified, otherwise it will throw IllegalArgumentException.

AWS Amazon S3 Java SDK - Refresh credentials / token when expired while uploading large file

I'm trying to upload a large file to a server which uses a token and the token expires after 10 minutes, so if I upload a small file it will work therefore if the file is big than I will get some problems and will be trying to upload for ever while the access is denied
So I need refresh the token in the BasicAWSCredentials which is than used for the AWSStaticCredentialsProvider therefore I'm not sure how can i do it, please help =)
Worth to mention that we use a local server (not amazon cloud) with provides the token and for convenience we use amazon's code.
here is my code:
public void uploadMultipart(File file) throws Exception {
//this method will give you a initial token for a given user,
//than calculates when a new token is needed and will refresh it just when necessary
String token = getUsetToken();
String existingBucketName = myTenant.toLowerCase() + ".package.upload";
String endPoint = urlAPI + "s3/buckets/";
String strSize = FileUtils.byteCountToDisplaySize(FileUtils.sizeOf(file));
System.out.println("File size: " + strSize);
AwsClientBuilder.EndpointConfiguration endpointConfiguration = new AwsClientBuilder.EndpointConfiguration(endPoint, null);//note: Region has to be null
//AWSCredentialsProvider
BasicAWSCredentials sessionCredentials = new BasicAWSCredentials(token, "NOT_USED");//secretKey should be set to NOT_USED
AmazonS3 s3 = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(sessionCredentials))
.withEndpointConfiguration(endpointConfiguration)
.enablePathStyleAccess()
.build();
int maxUploadThreads = 5;
TransferManager tm = TransferManagerBuilder
.standard()
.withS3Client(s3)
.withMultipartUploadThreshold((long) (5 * 1024 * 1024))
.withExecutorFactory(() -> Executors.newFixedThreadPool(maxUploadThreads))
.build();
PutObjectRequest request = new PutObjectRequest(existingBucketName, file.getName(), file);
//request.putCustomRequestHeader("Access-Token", token);
ProgressListener progressListener = progressEvent -> System.out.println("Transferred bytes: " + progressEvent.getBytesTransferred());
request.setGeneralProgressListener(progressListener);
Upload upload = tm.upload(request);
LocalDateTime uploadStartedAt = LocalDateTime.now();
log.info("Starting upload at: " + uploadStartedAt);
try {
upload.waitForCompletion();
//upload.waitForUploadResult();
log.info("Upload completed. " + strSize);
} catch (Exception e) {//AmazonClientException
log.error("Error occurred while uploading file - " + strSize);
e.printStackTrace();
}
}
Solution found !
I found a way to get this working and for to be honest I quite happy about the result, I've done so many tests with big files (50gd.zip) and in every scenario worked very well
My solution is, remove the line: BasicAWSCredentials sessionCredentials = new BasicAWSCredentials(token, "NOT_USED");
AWSCredentials is a interface so we can override it with something dynamic, the the logic of when the token is expired and needs a new fresh token is held inside the getToken() method meaning you can call every time with no harm
AWSCredentials sessionCredentials = new AWSCredentials() {
#Override
public String getAWSAccessKeyId() {
try {
return getToken(); //getToken() method return a string
} catch (Exception e) {
return null;
}
}
#Override
public String getAWSSecretKey() {
return "NOT_USED";
}
};
When uploading a file (or parts of a multi-part file), the credentials that you use must last long enough for the upload to complete. You CANNOT refresh the credentials as there is no method to update AWS S3 that you are using new credentials for an already signed request.
You could break the upload into smaller files that upload quicker. Then only upload X parts. Refresh your credentials and upload Y parts. Repeat until all parts are uploaded. Then you will need to finish by combining the parts (which is a separate command). This is not a perfect solution as transfer speeds cannot be accurately controlled AND this means that you will have to write your own upload code (which is not hard).

How to download a large file from Google Cloud Storage using Java with checksum control

I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId), during blob.reader() or during reader.read(bytes)? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}
The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read(), which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil command-line tool, which does this same sort of additional check.
As the JavaDoc of ReadChannel says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader() could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read(). You will never know and you shouldn't care.
As only read() throws an IOException and the other methods you used do not, I'd say that only calling read() will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0, not > 0. 0 just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.

Java: Watching a directory to move large files

I have been writing a program that watches a directory and when files are created in it, it changes the name and moves them to a new directory. In my first implementation I used Java's Watch Service API which worked fine when I was testing 1kb files. The problem that came up is that in reality the files getting created are anywhere from 50-300mb. When this happened the watcher API would find the file right away but could not move it because it was still being written. I tried putting the watcher in a loop (which generated exceptions until the file could be moved) but this seemed pretty inefficient.
Since that didn't work, I tried up using a timer that checks the folder every 10s and then moves files when it can. This is the method I ended up going for.
Question: Is there anyway to signal when a file is done being written without doing an exception check or continually comparing the size? I like the idea of using the Watcher API just once for each file instead of continually checking with a timer (and running into exceptions).
All responses are greatly appreciated!
nt
I ran into the same problem today. I my usecase a small delay before the file is actually imported was not a big problem and I still wanted to use the NIO2 API. The solution I choose was to wait until a file has not been modified for 10 seconds before performing any operations on it.
The important part of the implementation is as follows. The program waits until the wait time expires or a new event occures. The expiration time is reset every time a file is modified. If a file is deleted before the wait time expires it is removed from the list. I use the poll method with a timeout of the expected expirationtime, that is (lastmodified+waitTime)-currentTime
private final Map<Path, Long> expirationTimes = newHashMap();
private Long newFileWait = 10000L;
public void run() {
for(;;) {
//Retrieves and removes next watch key, waiting if none are present.
WatchKey k = watchService.take();
for(;;) {
long currentTime = new DateTime().getMillis();
if(k!=null)
handleWatchEvents(k);
handleExpiredWaitTimes(currentTime);
// If there are no files left stop polling and block on .take()
if(expirationTimes.isEmpty())
break;
long minExpiration = min(expirationTimes.values());
long timeout = minExpiration-currentTime;
logger.debug("timeout: "+timeout);
k = watchService.poll(timeout, TimeUnit.MILLISECONDS);
}
}
}
private void handleExpiredWaitTimes(Long currentTime) {
// Start import for files for which the expirationtime has passed
for(Entry<Path, Long> entry : expirationTimes.entrySet()) {
if(entry.getValue()<=currentTime) {
logger.debug("expired "+entry);
// do something with the file
expirationTimes.remove(entry.getKey());
}
}
}
private void handleWatchEvents(WatchKey k) {
List<WatchEvent<?>> events = k.pollEvents();
for (WatchEvent<?> event : events) {
handleWatchEvent(event, keys.get(k));
}
// reset watch key to allow the key to be reported again by the watch service
k.reset();
}
private void handleWatchEvent(WatchEvent<?> event, Path dir) throws IOException {
Kind<?> kind = event.kind();
WatchEvent<Path> ev = cast(event);
Path name = ev.context();
Path child = dir.resolve(name);
if (kind == ENTRY_MODIFY || kind == ENTRY_CREATE) {
// Update modified time
FileTime lastModified = Attributes.readBasicFileAttributes(child, NOFOLLOW_LINKS).lastModifiedTime();
expirationTimes.put(name, lastModified.toMillis()+newFileWait);
}
if (kind == ENTRY_DELETE) {
expirationTimes.remove(child);
}
}
Write another file as an indication that the original file is completed.
I.g 'fileorg.dat' is growing if done create a file 'fileorg.done' and check
only for the 'fileorg.done'.
With clever naming conventions you should not have problems.
Two solutions:
The first is a slight variation of the answer by stacker:
Use a unique prefix for incomplete files. Something like myhugefile.zip.inc instead of myhugefile.zip. Rename the files when upload / creation is finished. Exclude .inc files from the watch.
The second is to use a different folder on the same drive to create / upload / write the files and move them to the watched folder once they are ready. Moving should be an atomic action if they are on the same drive (file system dependent, I guess).
Either way, the clients that create the files will have to do some extra work.
I know it's an old question but maybe it can help somebody.
I had the same issue, so what I did was the following:
if (kind == ENTRY_CREATE) {
System.out.println("Creating file: " + child);
boolean isGrowing = false;
Long initialWeight = new Long(0);
Long finalWeight = new Long(0);
do {
initialWeight = child.toFile().length();
Thread.sleep(1000);
finalWeight = child.toFile().length();
isGrowing = initialWeight < finalWeight;
} while(isGrowing);
System.out.println("Finished creating file!");
}
When the file is being created, it will be getting bigger and bigger. So what I did was to compare the weight separated by a second. The app will be in the loop until both weights are the same.
Looks like Apache Camel handles the file-not-done-uploading problem by trying to rename the file (java.io.File.renameTo). If the rename fails, no read lock, but keep trying. When the rename succeeds, they rename it back, then proceed with intended processing.
See operations.renameFile below. Here are the links to the Apache Camel source: GenericFileRenameExclusiveReadLockStrategy.java and FileUtil.java
public boolean acquireExclusiveReadLock( ... ) throws Exception {
LOG.trace("Waiting for exclusive read lock to file: {}", file);
// the trick is to try to rename the file, if we can rename then we have exclusive read
// since its a Generic file we cannot use java.nio to get a RW lock
String newName = file.getFileName() + ".camelExclusiveReadLock";
// make a copy as result and change its file name
GenericFile<T> newFile = file.copyFrom(file);
newFile.changeFileName(newName);
StopWatch watch = new StopWatch();
boolean exclusive = false;
while (!exclusive) {
// timeout check
if (timeout > 0) {
long delta = watch.taken();
if (delta > timeout) {
CamelLogger.log(LOG, readLockLoggingLevel,
"Cannot acquire read lock within " + timeout + " millis. Will skip the file: " + file);
// we could not get the lock within the timeout period, so return false
return false;
}
}
exclusive = operations.renameFile(file.getAbsoluteFilePath(), newFile.getAbsoluteFilePath());
if (exclusive) {
LOG.trace("Acquired exclusive read lock to file: {}", file);
// rename it back so we can read it
operations.renameFile(newFile.getAbsoluteFilePath(), file.getAbsoluteFilePath());
} else {
boolean interrupted = sleep();
if (interrupted) {
// we were interrupted while sleeping, we are likely being shutdown so return false
return false;
}
}
}
return true;
}
While it's not possible to be notificated by the Watcher Service API when the SO finish copying, all options seems to be 'work around' (including this one!).
As commented above,
1) Moving or copying is not an option on UNIX;
2) File.canWrite always returns true if you have permission to write, even if the file is still being copied;
3) Waits until the a timeout or a new event occurs would be an option, but what if the system is overloaded but the copy was not finished? if the timeout is a big value, the program would wait so long.
4) Writing another file to 'flag' that the copy finished is not an option if you are just consuming the file, not creating.
An alternative is to use the code below:
boolean locked = true;
while (locked) {
RandomAccessFile raf = null;
try {
raf = new RandomAccessFile(file, "r"); // it will throw FileNotFoundException. It's not needed to use 'rw' because if the file is delete while copying, 'w' option will create an empty file.
raf.seek(file.length()); // just to make sure everything was copied, goes to the last byte
locked = false;
} catch (IOException e) {
locked = file.exists();
if (locked) {
System.out.println("File locked: '" + file.getAbsolutePath() + "'");
Thread.sleep(1000); // waits some time
} else {
System.out.println("File was deleted while copying: '" + file.getAbsolutePath() + "'");
}
} finally {
if (raf!=null) {
raf.close();
}
}
}
This is a very interesting discussion, as certainly this is a bread and butter use case: wait for a new file to be created and then react to the file in some fashion. The race condition here is interesting, as certainly the high-level requirement here is to get an event and then actually obtain (at least) a read lock on the file. With large files or just simply lots of file creations, this could require a whole pool of worker threads that just periodically try to get locks on newly created files and, when they're successful, actually do the work. But as I am sure NT realizes, one would have to do this carefully to make it scale as it is ultimately a polling approach, and scalability and polling aren't two words that go together well.
I had to deal with a similar situation when I implemented a file system watcher to transfer uploaded files. The solution I implemented to solve this problem consists of the following:
1- First of all, maintain a Map of unprocessed file (As long as the file is still being copied, the file system generates Modify_Event, so you can ignore them if the flag is false).
2- In your fileProcessor, you pickup a file from the list and check if it's locked by the filesystem, if yes, you will get an exception, just catch this exception and put your thread in wait state (i.e 10 seconds) and then retry again till the lock is released. After processing the file, you can either change the flag to true or remove it from the map.
This solution will be not be efficient if the many versions of the same file are transferred during the wait timeslot.
Cheers,
Ramzi
Depending on how urgently you need to move the file once it is done being written, you can also check for a stable last-modified timestamp and only move the file one it is quiesced. The amount of time you need it to be stable can be implementation dependent, but I would presume that something with a last-modified timestamp that hasn't changed for 15 secs should be stable enough to be moved.
For large file in linux, the files gets copied with a extension of .filepart. You just need to check the extension using commons api and register the ENTRY_CREATE event. I tested this with my .csv files(1GB) and add it worked
public void run()
{
try
{
WatchKey key = myWatcher.take();
while (key != null)
{
for (WatchEvent event : key.pollEvents())
{
if (FilenameUtils.isExtension(event.context().toString(), "filepart"))
{
System.out.println("Inside the PartFile " + event.context().toString());
} else
{
System.out.println("Full file Copied " + event.context().toString());
//Do what ever you want to do with this files.
}
}
key.reset();
key = myWatcher.take();
}
} catch (InterruptedException e)
{
e.printStackTrace();
}
}
If you don't have control over the write process, log all ENTRY_CREATED events and observe if there are patterns.
In my case, the files are created via WebDav (Apache) and a lot of temporary files are created but also two ENTRY_CREATED events are triggered for the same file. The second ENTRY_CREATED event indicates that the copy process is complete.
Here are my example ENTRY_CREATED events. The absolute file path is printed (your log may differ, depending on the application that writes the file):
[info] application - /var/www/webdav/.davfs.tmp39dee1 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.davfs.tmp054fe9 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.DAV/__db.document.docx was created
As you see, I get two ENTRY_CREATED events for document.docx. After the second event I know the file is complete. Temporary files are obviously ignored in my case.
So, I had the same problem and had the following solution work for me.
Earlier unsuccessful attempt - Trying to monitor the "lastModifiedTime" stat of each file but I noticed that a large file's size growth may pause for some time.(size does not change continuously)
Basic Idea - For every event, create a trigger file(in a temporary directory) whose name is of the following format -
OriginalFileName_lastModifiedTime_numberOfTries
This file is empty and all the play is only in the name. The original file will only be considered after passing intervals of a specific duration without a change in it's "last Modified time" stat. (Note - since it's a file stat, there's no overhead -> O(1))
NOTE - This trigger file is handled by a different service(say 'FileTrigger').
Advantage -
No sleep or wait to hold the system.
Relieves the file watcher to monitor other events
CODE for FileWatcher -
val triggerFileName: String = triggerFileTempDir + orifinalFileName + "_" + Files.getLastModifiedTime(Paths.get(event.getFile.getName.getPath)).toMillis + "_0"
// creates trigger file in temporary directory
val triggerFile: File = new File(triggerFileName)
val isCreated: Boolean = triggerFile.createNewFile()
if (isCreated)
println("Trigger created: " + triggerFileName)
else
println("Error in creating trigger file: " + triggerFileName)
CODE for FileTrigger (cron job of interval say 5 mins) -
val actualPath : String = "Original file directory here"
val tempPath : String = "Trigger file directory here"
val folder : File = new File(tempPath)
val listOfFiles = folder.listFiles()
for (i <- listOfFiles)
{
// ActualFileName_LastModifiedTime_NumberOfTries
val triggerFileName: String = i.getName
val triggerFilePath: String = i.toString
// extracting file info from trigger file name
val fileInfo: Array[String] = triggerFileName.split("_", 3)
// 0 -> Original file name, 1 -> last modified time, 2 -> number of tries
val actualFileName: String = fileInfo(0)
val actualFilePath: String = actualPath + actualFileName
val modifiedTime: Long = fileInfo(1).toLong
val numberOfTries: Int = fileStats(2).toInt
val currentModifiedTime: Long = Files.getLastModifiedTime(Paths.get(actualFilePath)).toMillis
val differenceInModifiedTimes: Long = currentModifiedTime - modifiedTime
// checks if file has been copied completely(4 intervals of 5 mins each with no modification)
if (differenceInModifiedTimes == 0 && numberOfTries == 3)
{
FileUtils.deleteQuietly(new File(triggerFilePath))
println("Trigger file deleted. Original file completed : " + actualFilePath)
}
else
{
var newTriggerFileName: String = null
if (differenceInModifiedTimes == 0)
{
// updates numberOfTries by 1
newTriggerFileName = actualFileName + "_" + modifiedTime + "_" + (numberOfTries + 1)
}
else
{
// updates modified timestamp and resets numberOfTries to 0
newTriggerFileName = actualFileName + "_" + currentModifiedTime + "_" + 0
}
// renames trigger file
new File(triggerFilePath).renameTo(new File(tempPath + newTriggerFileName))
println("Trigger file renamed: " + triggerFileName + " -> " + newTriggerFileName)
}
}
I speculate that java.io.File.canWrite() will tell you when a file has been done writing.

Categories