How to copy large amount of files from S3 folder to another - java

I'm trying to move large amount of files(around 300Kb max size each file) from S3 folder to another.
I'm using AWS sdk for java, and tried to move around 1500 files.
it took too much time, and the number of files may be increase to 10,000.
for each copy of file, need to delete from the source folder as there is no method to move file.
this what i tried:
public void moveFiles(String fromKey, String toKey) {
Stream<S3ObjectSummary> objectSummeriesStream = this.getObjectSummeries(fromKey);
objectSummeriesStream.forEach(file ->
{
this.s3Bean.copyObject(bucketName, file.getKey(), bucketName, toKey);
this.s3Bean.deleteObject(bucketName, file.getKey());
});
}
private Stream<S3ObjectSummary> getObjectSummeries(String key) {
// get the files that their prefix is "key" (can be consider as Folders).
ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName(this.bucketName)
.withPrefix(key);
ObjectListing outFilesList = this.s3Bean.listObjects(listObjectsRequest);
return outFilesList.getObjectSummaries()
.stream()
.filter(x -> !x.getKey()
.equals(key));
}

If you are using Java application you can try to use several threads to copy files:
private ExecutorService executorService = Executors.fixed(20);
public void moveFiles(String fromKey, String toKey) {
Stream<S3ObjectSummary> objectSummeriesStream =
this.getObjectSummeries(fromKey);
objectSummeriesStream.forEach(file ->
{
executorService.submit(() ->
this.s3Bean.copyObject(bucketName, file.getKey(), bucketName, toKey);
this.s3Bean.deleteObject(bucketName, file.getKey());
)};
});
}
This should speed up the process.
An alternative might be using AWS-lambda. Once the file appear in source bucket you can, for example, put event in the SQS FIFO queue. The lambda will start single file copy by this event. If I am not mistaken in parallel you can start up to 500 instances of lambdas. Should be fast.

Related

How to delete multiple objects in an Amazon S3 bucket using Java V2

So I want to delete all the objects that could be inside a folder in s3 (basically with a certain prefix).
How do I do that?
I am currently using this while (true) loop, but I am told it is not a good approach to use while (true).
This is what I am using right now.
while (true) {
for (S3ObjectSummary objectSummary: objectListing.getObjectSummaries()) {
this.s3Client.deleteObject(bucketName, objectSummary.getKey());
}
if (objectListing.isTruncated()) {
objectListing = s3Client.listNextBatchOfObjects(objectListing);
} else {
break;
}
}
When using the AWS SDK for Java V2 S3 Client, you can set up your code to use a List of ObjectIdentifier objects to delete. Add a new entry to the List for each object to delete. Specify the path where the object is located in the ObjectIdentifier key value. This is where you specify the path in a bucket where the object is located.
You need to populate the LIST with the number of objects that you want to delete. So if you have 20 objects - then you need to add 20 entries to the List each with a valid key value that references the object to delete.
Then call deleteObjects(). This is a cleaner way to delete many objects. That is, you can delete multiple objects in 1 call vs many calls.
See this code.
public static void deleteBucketObjects(S3Client s3, String bucketName, String objectName) {
ArrayList<ObjectIdentifier> toDelete = new ArrayList<>();
toDelete.add(ObjectIdentifier.builder()
.key(objectName)
.build());
try {
DeleteObjectsRequest dor = DeleteObjectsRequest.builder()
.bucket(bucketName)
.delete(Delete.builder()
.objects(toDelete).build())
.build();
s3.deleteObjects(dor);
} catch (S3Exception e) {
System.err.println(e.awsErrorDetails().errorMessage());
System.exit(1);
}
System.out.println("Done!");
}

Java NIO Read Folder's content's attributes at once

I'm writing a backup program using Java and the package NIO.
As far as I found, I could do it as in the code example below, i.e. read the folder content list and then for each file I have to do file attributes request... this is not an effective approach, especially if someone has big folders with thousands of files there in one folder, so maybe there is another way to do it by reading folder's content with all its attributes?
Or maybe should I use something else instead of NIO for this case? Thank You very much
public void scan(Path folder) throws IOException {
try (DirectoryStream<Path> ds = Files.newDirectoryStream(folder)) {
for (Path path : ds) {
//Map<String, Object> attributes = Files.readAttributes(path, "size,lastModifiedTime");
}
}
}
Thanks to DuncG, the answer is very simple:
HashMap <Path, BasicFileAttributes> attrs = new HashMap<>();
BiPredicate<Path, BasicFileAttributes> predicate = (p, a) -> {
return attrs.put(p, a) == null;
};
Stream<Path> stream = Files.find(folder, Integer.MAX_VALUE, predicate);
I made a benchmark to compare, this example executes 3x times faster than the example from the question text, so it seems that it invokes fewer filesystem I/O operations... in theory...

How to handle split Streams functionally

Given the following code, how can I simplify it to a single, functional line?
// DELETE CSV TEMP FILES
final Map<Boolean, List<File>> deleteResults = Stream.of(tmpDir.listFiles())
.filter(tempFile -> tempFile.getName().endsWith(".csv"))
.collect(Collectors.partitioningBy(File::delete));
// LOG SUCCESSES AND FAILURES
deleteResults.entrySet().forEach(entry -> {
if (entry.getKey() && !entry.getValue().isEmpty()) {
LOGGER.debug("deleted temporary files, {}",
entry.getValue().stream().map(File::getAbsolutePath).collect(Collectors.joining(",")));
} else if (!entry.getValue().isEmpty()) {
LOGGER.debug("failed to delete temporary files, {}",
entry.getValue().stream().map(File::getAbsolutePath).collect(Collectors.joining(",")));
}
});
This is a common pattern I run into, where I have a stream of things, and I want to filter this stream, creating two streams based off that filter, where I can then do one thing to Stream A and another thing to Stream B. Is this an anti-pattern, or is it supported somehow?
If you particularly don't want the explicit variable referencing the interim map then you can just chain the operations:
.collect(Collectors.partitioningBy(File::delete))
.forEach((del, files) -> {
if (del) {
LOGGER.debug(... files.stream()...);
} else {
LOGGER.debug(... files.stream()...);
});
If you want to log all files of the either category together, there is no way around collecting them into a data structure holding them, until all elements are known. Still, you can simplify your code:
Stream.of(tmpDir.listFiles())
.filter(tempFile -> tempFile.getName().endsWith(".csv"))
.collect(Collectors.partitioningBy(File::delete,
Collectors.mapping(File::getAbsolutePath, Collectors.joining(","))))
.forEach((success, files) -> {
if (!files.isEmpty()) {
LOGGER.debug(success? "deleted temporary files, {}":
"failed to delete temporary files, {}",
files);
}
});
This doesn’t collect the files into a List but into the intended String for the subsequent logging action in the first place. The logging action also is identical for both cases, but only differs in the message.
Still, the most interesting thing is why deleting a file failed, which a boolean doesn’t tell. Since Java 7, the nio package provides a better alternative:
Create helper method
public static String deleteWithReason(Path p) {
String problem;
IOException ioEx;
try {
Files.delete(p);
return "";
}
catch(FileSystemException ex) {
problem = ex.getReason();
ioEx = ex;
}
catch(IOException ex) {
ioEx = ex;
problem = null;
}
return problem!=null? problem.replaceAll("\\.?\\R", ""): ioEx.getClass().getName();
}
and use it like
Files.list(tmpDir.toPath())
.filter(tempFile -> tempFile.getFileName().toString().endsWith(".csv"))
.collect(Collectors.groupingBy(YourClass::deleteWithReason,
Collectors.mapping(p -> p.toAbsolutePath().toString(), Collectors.joining(","))))
.forEach((failure, files) ->
LOGGER.debug(failure.isEmpty()? "deleted temporary files, {}":
"failed to delete temporary files, "+failure+ ", {}",
files)
);
The disadvantage, if you want to call it that way, is does not produce a single entry for all failed files, if they have different failure reasons. But that’s obviously unavoidable if you want to log them with the reason why they couldn’t be deleted.
Note that if you want to exclude “being deleted by someone else concurrently” from the failures, you can simply use Files.deleteIfExists(p) instead of Files.delete(p) and being already deleted will be treated as success.

Java - processing documents in parallel

I have 5 documents(say) and I have some processing on each of them. Processing here includes open the document/file, read the data, do some document manipulation(edit text etc). For document manipulation I will probably be using docx4j or apache-poi. But my use case is this - I want to somehow process these 4-5 documents in parallel utilizing multiple cores available to me on my CPU. The processing on each document is independent of each other.
What would be the best way to achieve this parallel processing in Java. I have used ExecutorService in java before and Thread class too. But I dont have much idea about the newer concepts like Streams or RxJava. Can this task be achieved by using Parallel Stream in Java as introduced in Java 8? What would be better to use Executors/Streams/Thread Class etc. If Streams can be used please provide a link where I can find some tutorial on how to do that. Thanks for your help!
You can process in parallel using Java Streams using the following pattern.
List<File> files = ...
files.parallelStream().forEach(f -> process(f));
or
File[] files = dir.listFiles();
Stream.of(files).parallel().forEach(f -> process(f));
Note: process cannot throw a CheckedException in this example. I suggest you either log it or return a result object.
If you want to learn about ReactiveX, I would recomend use rxJava Observable.zip http://reactivex.io/documentation/operators/zip.html
Where you can run multiple process on parallel here an example:
public class ObservableZip {
private Scheduler scheduler;
private Scheduler scheduler1;
private Scheduler scheduler2;
#Test
public void testAsyncZip() {
scheduler = Schedulers.newThread();//Thread to open and read 1 file
scheduler1 = Schedulers.newThread();//Thread to open and read 1 file
scheduler2 = Schedulers.newThread();//Thread to open and read 1 file
Observable.zip(obAsyncString(file1), obAsyncString1(file2), obAsyncString2(file3), (s, s2, s3) -> s.concat(s2)
.concat(s3))
.subscribe(result -> showResult("All files in one:", result));
}
public void showResult(String transactionType, String result) {
System.out.println(result + " " +
transactionType);
}
public Observable<String> obAsyncString(File file) {
return Observable.just(file)
.observeOn(scheduler)
.doOnNext(val -> {
//Here you read your file
});
}
public Observable<String> obAsyncString1(File file) {
return Observable.just(file)
.observeOn(scheduler1)
.doOnNext(val -> {
//Here you read your file 2
});
}
public Observable<String> obAsyncString2(File file) {
return Observable.just(file)
.observeOn(scheduler2)
.doOnNext(val -> {
//Here you read your file 3
});
}
}
Like I said, just in case that you want to learn about ReactiveX, because if it not, add this framework in your stack to solve the issue would be a little overkill, and I would much rather the previous stream parallel solution

What is a performant way to make Java8 streams produce a formatted string

Context: given a directory, I'd like to list all the files in there that contain a pattern in their name, ordered by lastModified timestamp, and format this list in a Json string where I'd get the name and timestamp of each file:
[{"name": "somefile.txt", "timestamp": 123456},
{"name": "otherfile.txt", "timestamp": 456789}]
I've got the following code:
private StringBuilder jsonFileTimestamp(File file) {
return new StringBuilder("{\"name\":\"")
.append(file.getName())
.append("\", \"timestamp\":")
.append(file.lastModified())
.append("}");
}
public String getJsonString(String path, String pattern, int skip, int limit) throws IOException {
return Files.list(Paths.get(path))
.map(Path::toFile)
.filter(file -> {
return file.getName().contains(pattern);
})
.sorted((f1, f2) -> {
return Long.compare(f2.lastModified(), f1.lastModified());
})
.skip(skip)
.limit(limit)
.map(f -> jsonFileTimestamp(f))
.collect(Collectors.joining(",", "[", "]"));
}
This is working well. I am just concerned on the performance of the StringBuilder instantiation (or String concatenation). It is OK as long as the number of files keeps small (which is my case, so I'm fine) but I am curious : what would you suggest as an optimization ? I feel like I should use reduce with the correct accumulator and combiner but I can't get my brain around it.
Thanks.
UPDATE
I finally went with the following "optimization":
private StringBuilder jsonFileTimestampRefactored(StringBuilder res, File file) {
return res.append(res.length() == 0 ? "" : ",")
.append("{\"name\":\"")
.append(file.getName())
.append("\", \"timestamp\":")
.append(file.lastModified())
.append("}");
}
public String getJsonStringRefactored(String path, String pattern, int skip, int limit) throws IOException {
StringBuilder sb = Files.list(Paths.get(path))
.map(Path::toFile)
.filter(file -> file.getName().contains(pattern))
.sorted((f1, f2) -> Long.compare(f2.lastModified(), f1.lastModified()))
.skip(skip)
.limit(limit)
.reduce(new StringBuilder(),
(StringBuilder res, File file) -> jsonFileTimestampRefactored(res, file),
(StringBuilder a, StringBuilder b) -> a.append(a.length() == 0 || b.length() == 0 ? "" : ",").append(b))
;
return new StringBuilder("[").append(sb). append("]").toString();
}
This version is creating only 2 instances of StringBuilder when the older one is instantiating as many of them as there are files in the directory.
On my workstation, the first implementation is taking 1289 ms to complete over 3379 files, when the second one takes 1306 ms. The second implementation is costing me 1% more time when I was expecting (very small) savings.
I don't feel like the new version is easier to read or maintain so I'll keep the old one.
Thanks all.
String formatting is such a trivial portion of your application's performance that it's almost never worth optimizing; only think about it if profiling shows an actual hot spot. In fact, most applications use reflective JSON mappers, and their bottlenecks are elsewhere (usually I/O). The StringBuilder approach you're using is the most efficient way you can do this in Java without manually twiddling character arrays, and it's even going farther than I would myself (I'd use String#format()).
Write your code for clarity instead. The current version is fine.

Categories