Lazily recurse Java 8 stream - java

I'm using the Google Cloud Java API to get objects out of Google Cloud Storage (GCS). The code for this reads something like this:
Storage storage = ...
List<StorageObject> storageObjects = storage.objects().list(bucket).execute().getItems();
But this will not return all items (storage objects) in the GCS bucket, it'll only return the first 1000 items in the first "page". So in order to get the next 1000 items one should do:
Storage.Objects.List list = storage.objects().list(bucket).execute();
String nextPageToken = objects.getNextPageToken();
List<StorageObject> itemsInFirstPage = objects.getItems();
if (nextPageToken != null) {
// recurse
}
What I want to do is to find an item that matches a Predicate while traversing all items in the GCS bucket until the predicate is matched. To make this efficient I'd like to only load the items in the next page when the item wasn't found in the current page. For a single page this works:
Predicate<StorageObject> matchesItem = ...
takeWhile(storage.objects().list(bucket).execute().getItems().stream(), not(matchesItem));
Where takeWhile is copied from here.
And this will load the storage objects from all pages recursively:
private Stream<StorageObject> listGcsPageItems(String bucket, String pageToken) {
if (pageToken == null) {
return Stream.empty();
}
Storage.Objects.List list = storage.objects().list(bucket);
if (!pageToken.equals(FIRST_PAGE)) {
list.setPageToken(pageToken);
}
Objects objects = list.execute();
String nextPageToken = objects.getNextPageToken();
List<StorageObject> items = objects.getItems();
return Stream.concat(items.stream(), listGcsPageItems(bucket, nextPageToken));
}
where FIRST_PAGE is just a "magic" String that instructs the method not to set a specific page (which will result in the first page items).
The problem with this approach is that it's eager, i.e. all items from all pages are loaded before the "matching predicate" is applied. I'd like this to be lazy (one page at a time). How can I achieve this?

I would implement custom Iterator<StorageObject> or Supplier<StorageObject> which would keep current page list and next page token in its internal state producing StorageObjects one by one.
Then I would use the following code to find the first match:
Optional<StorageObject> result =
Stream.generate(new StorageObjectSupplier(...))
.filter(predicate)
.findFirst();
Supplier will only be invoked until the match is found, i.e. lazily.
Another way is to implement supplier by-page, i.e. class StorageObjectPageSupplier implements Supplier<List<StorageObject>> and use stream API to flatten it:
Optional<StorageObject> result =
Stream.generate(new StorageObjectPageSupplier(...))
.flatMap(List::stream)
.filter(predicate)
.findFirst();

Related

Java Stream - Retrieving repeated records from CSV

I searched the site and didn't find something similar. I'm newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.
Initial CSV file
Final result
In addition, the same pair cannot appear in the final result inversely, as shown in the table below:
This shouldn't happen
Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?
Thanks in advance.
You can do it in one pass as a stream with O(n) efficiency:
class PersonKey {
// have a field for every column that is used to detect duplicates
String center, name, mother, birthdate;
public PersonKey(String line) {
// implement String constructor
}
// implement equals and hashCode using all fields
}
List<String> lines; // the input
Set<PersonKey> seen = new HashSet<>();
List<String> unique = lines.stream()
.filter(p -> !seen.add(new PersonKey(p))
.distinct()
.collect(toList());
The trick here is that a HashSet has constant time operations and its add() method returns false if the value being added is already in the set, true otherwise.
What I understood from your examples is you consider an entry as duplicate if all the attributes have same value except the ID. You can use anymatch for this:
list.stream().filter(x ->
list.stream().anyMatch(y -> isDuplicate(x, y))).collect(Collectors.toList())
So what does the isDuplicate(x,y) do?
This returns a boolean. You can check whether all the entries have same value except the id in this method:
private boolean isDuplicate(CsvEntry x, CsvEntry y) {
return !x.getId().equals(y.getId())
&& x.getName().equals(y.getName())
&& x.getMother().equals(y.getMother())
&& x.getBirth().equals(y.getBirth());
}
I've assumed you've taken all the entries as String. Change the checks according to the type. This will give you the duplicate entries with their corresponding ID

Java 8 Streams for List iteration

I have a HashMap that contains List<Dto> and List<List<String>>:
Map<List<Dto>, List<List<String>>> mapData = new HashMap();
and an Arraylist<Dto>.
I want to iterate over this map, get the keys-key1, key2 etc and get the value out of it and set it to the Dto object and thereafter add it to a List. So i am able to successfully iterate using foreach and get it added to lists but not able to get it correctly done using Java 8. So i need some help on that. Here is the sample code
List<DTO> dtoList = new ArrayList();
DTO dto = new DTO();
mapData.entrySet().stream().filter(e->{
if(e.getKey().equals("key1")){
dto.setKey1(e.getValue())
}
if(e.getKey().equals("key2")){
dto.setKey2(e.getValue())
}
});
Here e.getValue() is from List<List<String>>()
so first thing is I need to iterate over it to set the value.
And second is I need to add dto to a Arraylist dtoList. So how to achieve this.
Basic Snippet that i tried without adding to a HashMap where List has keys, multiList has values and Dto list is where finally i add into
for(List<Dto> dtoList: column) {
if ("Key1".equalsIgnoreCase(column.getName())) {
index = dtoList.indexOf(column);
}
}
for(List<String> listoflists: multiList) {
if(listoflists.contains(index)) {
for(String s: listoflists) {
dto.setKey1(s);
}
dtoList.add(dto);
}
}
See https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines. A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
So in your snippet above, filter isn't really doing anything. To trigger it, you'd add a collect operation at the end. Notice that the filter lambda function needs to return a boolean for your code to compile in the first place.
mapData.entrySet().stream().filter(entry -> {
// do something here
return true;
}).collect(Collectors.toList());
Of course you don't need to abuse intermediate operations - or generate a bunch of new objects - for straightforward tasks, something like this should suffice:
mapData.entrySet().stream().forEach(entry -> {
// do something
});

Collect stream only if allMatch filter and process stream once in Java

I have the following stream code:
List<Data> results = items.stream()
.map(item -> requestDataForItem(item))
.filter(data -> data.isValid())
.collect(Collectors.toList());
Data requestDataForItem(Item item) {
// call another service here
}
The problem is that I want to call
requestDataForItem only when all elements in the stream are valid.
For example,
if the first item is invalid I don't wont to make the call for any element in the stream.
There is .allMatch in the stream API,
but it returns a boolean.
I want to do the same as .allMatch than
.collect the result when everything matched.
Also, I want to process stream only once,
with two loops it is easy.
Is this possible with the Java Streams API?
This would be a job for Java 9:
List<Data> results = items.stream()
.map(item -> requestDataForItem(item))
.takeWhile(data -> data.isValid())
.collect(Collectors.toList());
This operation will stop at the first invalid element. In a sequential execution, this implies that no subsequent requestDataForItem calls are made. In a parallel execution, some additional elements might get processed concurrently, before the operation stops, but that’s the price for efficient parallel processing.
In either case, the result list will only contain the elements before the first encountered invalid element and you can easily check using results.size() == items.size() whether all elements were valid.
In Java 8, there is no such simple method and using an additional library or rolling out your own implementation of takeWhile wouldn’t pay off considering how simple the non-stream solution would be
List<Data> results = new ArrayList<>();
for(Item item: items) {
Data data = requestDataForItem(item);
if(!data.isValid()) break;
results.add(data);
}
You could theoretically use .allMatch then collect if .allMatch returns true, but then you'd be processing the collection twice. There's no way to do what you're trying to do with the streams API directly.
You could create a method to do this for you and simply pass your collection to it as opposed to using the stream API. This is slightly less elegant than using the stream API but more efficient as it processes the collection only once.
List<Data> results = getAllIfValid(
items.stream().map(item ->
requestDataForItem(item).collect(Collectors.toList())
);
public List<Data> getAllIfValid(List<Data> items) {
List<Data> results = new ArrayList<>();
for (Data d : items) {
if (!d.isValid()) {
return new ArrayList<>();
}
results.add(d);
}
return results;
}
This will return all the results if every element passes and only processes the items collection once. If any fail the isValid() check, it'll return an empty list as you want all or nothing. Simply check to see if the returned collection is empty to see whether or not all items passed the isValid() check.
Implement a two step process:
test if allMatch returns true.
If it does return true, do the collect with a second stream.
Try this.
List<Data> result = new ArrayList<>();
boolean allValid = items.stream()
.map(item -> requestDataForItem(item))
.allMatch(data -> data.isValid() && result.add(data));
if (!allValid)
result.clear();

Java stream use in loop

I am quite new to java streams. Do I need to re-create the stream each time in this loop or is there a better way to do this? Creating the stream once and using the .noneMatch twice results in "stream already closed" exception.
for ( ItemSetNode itemSetNode : itemSetNodeList )
{
Stream<Id> allUserNodesStream = allUserNodes.stream().map( n -> n.getNodeId() );
Id nodeId = itemSetNode.getNodeId();
//if none of the user node ids match the node id, the user is missing the node
if ( allUserNodesStream.noneMatch( userNode -> userNode.compareTo( nodeId ) == 0 ) )
{
isUserMissingNode = true;
break;
}
}
Thank you !
I would suggest you make a list of all the user ids outside the loop. Just make sure the class Id overrides equals() function.
List<Id> allUsersIds = allUserNodes.stream().map(n -> n.getNodeId()).collect(Collectors.toList());
for (ItemSetNode itemSetNode : itemSetNodeList)
{
Id nodeId = itemSetNode.getNodeId();
if (!allUsersIds.contains(nodeId))
{
isUserMissingNode = true;
break;
}
}
The following code should be equivalent, except that the value of the boolean is reversed so it's false if there are missing nodes.
First all the user node Ids are collected to a TreeSet (if Id implements hashCode() and equals() you should use a HashSet). Then we stream itemSetNodeList to see if all those nodeIds are contained in the set.
TreeSet<Id> all = allUserNodes
.stream()
.map(n -> n.getNodeId())
.collect(Collectors.toCollection(TreeSet::new));
boolean isAllNodes = itemSetNodeList
.stream()
.allMatch(n -> all.contains(n.getNodeId()));
There are many ways to write equivalent (at least to outside eyes) code, this uses a Set to improve the lookup so we don't need to keep iterating the allUserNodes collection constantly.
You want to avoid using a stream in a loop, because that will turn your algorithm into O(n²) when you're doing a linear loop and a linear stream operation inside it. This approach is O(n log n), for the linear stream operation and O(log n) TreeSet lookup. With a HashSet this goes down to just O(n), not that it matters much unless you're dealing with large amount of elements.
You also could do something like this:
Set<Id> allUserNodeIds = allUserNodes.stream()
.map(ItemSetNode::getNodeId)
.collect(Collectors.toCollection(TreeSet::new));
return itemSetNodeList.stream()
.anyMatch(n -> !allUserNodeIds.contains(n.getNodeId())); // or firstMatch
Or even:
Collectors.toCollection(() -> new TreeSet<>(new YourComparator()));
Terminal operations of a Stream such as noneMatch() close the Stream and make it so not reusable again.
If you need to reuse this Stream :
Stream<Id> allUserNodesStream = allUserNodes.stream().map( n -> n.getNodeId() );
just move it into a method :
public Stream<Id> getAllUserNodesStream(){
return allUserNodes.stream().map( n -> n.getNodeId());
}
and invoke it as you need it to create it :
if (getAllUserNodesStream().noneMatch( userNode -> userNode.compareTo( nodeId ) == 0 ))
Now remember that Streams become loops in the byte code after compilation.
Performing multiple times the same loop may not be desirable. So you should consider this point before instantiating multiple times the same stream.
As alternative to create multiple streams to detect match with nodeId :
if (allUserNodesStream.noneMatch( userNode -> userNode.compareTo( nodeId ) == 0 ) ) {
isUserMissingNode = true;
break;
}
use rather a structure of type Set that contains all id of allUserNodes :
if (idsFromUserNodes.contains(nodeId)){
isUserMissingNode = true;
break;
}
It will make the logic more simple and the performance better.
Of course it supposes that compareTo() be consistent with equals() but it is strongly recommended (though not required).
It will take each item from the itemSetNodeList and check it if present in the using the noneMatch(). If it is not present will get true returned. The anyMatch if atleast once item is not found will stop the search and return false. If all the item is found, we will return true.
Stream<Id> allUserNodesStream = allUserNodes.stream().map( n -> n.getNodeId() );
boolean isUserMissing=itemSetNodeList.stream()
.anyMatch(n-> allUserNodes.stream().noneMatch(n));

Amazon s3 returns only 1000 entries for one bucket and all for another bucket (using java sdk)?

I am using below mentioned code to get list of all file names from s3 bucket. I have two bucket in s3. For one of the bucket below code returns all the file names (more than 1000), but the same code returns only 1000 file names for another bucket. I just don't get what is happening.
Why same code running for one bucket and not for other ?
Also my bucket have hierarchy structure folder/filename.jpg.
ObjectListing objects = s3.listObjects("bucket.new.test");
do {
for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
String key = objectSummary.getKey();
System.out.println(key);
}
objects = s3.listNextBatchOfObjects(objects);
} while (objects.isTruncated());
Improving on #Abhishek's answer.
This code is slightly shorter and variable names are fixed.
You have to get the object listing,
add its' contents to the collection,
then get the next batch of objects from the listing.
Repeat the operation until the listing will not be truncated.
List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing objects = s3.listObjects("bucket.new.test");
keyList.addAll(objects.getObjectSummaries());
while (objects.isTruncated()) {
objects = s3.listNextBatchOfObjects(objects);
keyList.addAll(objects.getObjectSummaries());
}
For Scala developers, here it is recursive function to execute a full scan and map of the contents of an AmazonS3 bucket using the official AWS SDK for Java
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{S3ObjectSummary, ObjectListing, GetObjectRequest}
import scala.collection.JavaConversions.{collectionAsScalaIterable => asScala}
def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T) = {
def scan(acc:List[T], listing:ObjectListing): List[T] = {
val summaries = asScala[S3ObjectSummary](listing.getObjectSummaries())
val mapped = (for (summary <- summaries) yield f(summary)).toList
if (!listing.isTruncated) mapped.toList
else scan(acc ::: mapped, s3.listNextBatchOfObjects(listing))
}
scan(List(), s3.listObjects(bucket, prefix))
}
To invoke the above curried map() function, simply pass the already constructed (and properly initialized) AmazonS3Client object (refer to the official AWS SDK for Java API Reference), the bucket name and the prefix name in the first parameter list. Also pass the function f() you want to apply to map each object summary in the second parameter list.
For example
val keyOwnerTuples = map(s3, bucket, prefix)(s => (s.getKey, s.getOwner))
will return the full list of (key, owner) tuples in that bucket/prefix
or
map(s3, "bucket", "prefix")(s => println(s))
as you would normally approach by Monads in Functional Programming
I have just changed above code to use addAll instead of using a for loop to add objects one by one and it worked for me:
List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing object = s3.listObjects("bucket.new.test");
keyList = object.getObjectSummaries();
object = s3.listNextBatchOfObjects(object);
while (object.isTruncated()){
keyList.addAll(current.getObjectSummaries());
object = s3.listNextBatchOfObjects(current);
}
keyList.addAll(object.getObjectSummaries());
After that you can simply use any iterator over list keyList.
An alternative way by using recursive method
/**
* A recursive method to wrap {#link AmazonS3} listObjectsV2 method.
* <p>
* By default, ListObjectsV2 can only return some or all (UP TO 1,000) of the objects in a bucket per request.
* Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
* <p>
* However, this method can return unlimited {#link S3ObjectSummary} for each request.
*
* #param request
* #return
*/
private List<S3ObjectSummary> getS3ObjectSummaries(final ListObjectsV2Request request) {
final ListObjectsV2Result result = s3Client.listObjectsV2(request);
final List<S3ObjectSummary> resultSummaries = result.getObjectSummaries();
if (result.isTruncated() && isNotBlank(result.getNextContinuationToken())) {
final ListObjectsV2Request nextRequest = request.withContinuationToken(result.getNextContinuationToken());
final List<S3ObjectSummary> nextResultSummaries = this.getS3ObjectSummaries(nextRequest);
resultSummaries.addAll(nextResultSummaries);
}
return resultSummaries;
}
If you want to get all of object (more than 1000 keys) you need to send another packet with the last key to S3. Here is the code.
private static String lastKey = "";
private static String preLastKey = "";
...
do{
preLastKey = lastKey;
AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());
String bucketName = "bucketname";
ListObjectsRequest lstRQ = new ListObjectsRequest().withBucketName(bucketName).withPrefix("");
lstRQ.setMarker(lastKey);
ObjectListing objectListing = s3.listObjects(lstRQ);
// loop and get file on S3
for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
// get oject and do something.....
}
}while(lastKey != preLastKey);
In Scala:
val first = s3.listObjects("bucket.new.test")
val listings: Seq[ObjectListing] = Iterator.iterate(Option(first))(_.flatMap(listing =>
if (listing.isTruncated) Some(s3.listNextBatchOfObjects(listing))
else None
))
.takeWhile(_.nonEmpty)
.toList
.flatten
Paolo Angioletti's code can't get all the data, only the last batch of data.
I think it might be better to use ListBuffer.
This method does not support setting startAfterKey.
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary}
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T): List[T] = {
def scan(acc: ListBuffer[T], listing: ObjectListing): List[T] = {
val r = acc ++= listing.getObjectSummaries.asScala.map(f).toList
if (listing.isTruncated) scan(r, s3.listNextBatchOfObjects(listing))
else r.toList
}
scan(ListBuffer.empty[T], s3.listObjects(bucket, prefix))
}
The second method is to use awssdk-v2
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
<version>2.1.0</version>
</dependency>
import software.amazon.awssdk.services.s3.S3Client
import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request, S3Object}
import scala.collection.JavaConverters._
def listObjects[T](s3: S3Client, bucket: String,
prefix: String, startAfter: String)(f: (S3Object) => T): List[T] = {
val request = ListObjectsV2Request.builder()
.bucket(bucket).prefix(prefix)
.startAfter(startAfter).build()
s3.listObjectsV2Paginator(request)
.asScala
.flatMap(_.contents().asScala)
.map(f)
.toList
}
By default the API returns up to 1,000 key names. The response might contain fewer keys but will never contain more.
A better implementation would be use the newer ListObjectsV2 API:
List<S3ObjectSummary> docList=new ArrayList<>();
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withPrefix(folderFullPath);
ListObjectsV2Result listing;
do{
listing=this.getAmazonS3Client().listObjectsV2(req);
docList.addAll(listing.getObjectSummaries());
String token = listing.getNextContinuationToken();
req.setContinuationToken(token);
LOG.info("Next Continuation Token for listing documents is :"+token);
}while (listing.isTruncated());
The code given by #oferei works good and I upvote that code. But I want to point out the root issue with the #Abhishek's code. Actually, the problem is with your do while loop.
If you carefully observe, you are fetching the next batch of objects in the second last statement and then you check is you have exhausted the total list of files. So, when you fetch the last batch, isTruncated() becomes false and you break out of loop and don't process the last X%1000 records. For eg: if in total you had 2123 records, you will end up fetching 1000 and then 1000 i.e 2000 records. You will miss the 123 records because your isTruncated value will break the loop as you are processing the next batch after checking the isTruncated value.
Apologies I cant post a comment, else I would have commented on the upvoted answer.
The reason you are getting only first 1000 objects, because thats how listObjects is desgined to work.
This is from its JavaDoc
Returns some or all (up to 1,000) of the objects in a bucket with each request.
You can use the request parameters as selection criteria to return a subset of the objects in a bucket.
A 200 OK response can contain valid or invalid XML. Make sure to design your application to parse the contents of the response and handle it appropriately.
Objects are returned sorted in an ascending order of the respective key names in the list. For more information about listing objects, see Listing object keys programmatically
To get paginated results automatically, use listObjectsV2Paginator method
ListObjectsV2Request listReq = ListObjectsV2Request.builder()
.bucket(bucketName)
.maxKeys(1)
.build();
ListObjectsV2Iterable listRes = s3.listObjectsV2Paginator(listReq);
// Helper method to work with paginated collection of items directly
listRes.contents().stream()
.forEach(content -> System.out.println(" Key: " + content.key() + " size = " + content.size()));
You can opt for manual pagination as well if needed.
Reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html

Categories