GAE: Exception while allocating 16 digit ids - java

I notice problem with allocating IDs on google app engine while using datastore. In my application I have a set of data that have to be initially uploaded. Data has been prepared on test appengine environment so it has autogenerated values for ID fields. Since I want to preserve these values I'm recreating entities by using remote API with Objectify as a separate process. After upload I want to make sure that used IDs will be removed from value range for autogenerator. I'm using DatastoreService.allocateIdRange with range of single long value. Everything works fine on dev server but on appspot for some values (16 digits values) I receive "Exceeded maximum allocated IDs" IllegalArgumentException.
Is there any limitation of allocateIdRange call (I have found none in documentation)?
Below is a sample code I'm using for id allocation for datastore after upload:
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
String kind = Key.getKind(clazz);
PreparedQuery query = datastore.prepare(new Query(kind).setKeysOnly());
KeyRange keyRange = null;
Long id = null;
for (Entity entity : query.asIterable()) {
id = (Long) entity.getKey().getId();
keyRange = new KeyRange(null, kind, id, id);
DatastoreService.KeyRangeState state = datastore.allocateIdRange(keyRange);
}

This is a known issue with allocateIdRange(). A better error message would be "You can't call allocateIdRange() on scattered ids".
Scattered ids are the default since 1.8.1 and have values >= 2^52. Unfortunately we don't currently expose an API to reserve these ids.

It sounds like you may be trying to allocate an ID larger than the max allowed ID. This is limited by the largest integer size in javascript, which is 2^53.
Here is the page describing the App Engine limitation and the largest javascript int.

Related

Spring Data JPA: Efficiently Query The Database for A Large Dataset

I have written an application to scrape a huge set of reviews. For each review i store the review itself Review_Table(User_Id, Trail_Id, Rating), the Username (Id, Username, UserLink) and the Trail which is build previously in the code (Id, ...60 other attributes)
for(Element card: reviewCards){
String userName = card.select("expression").text();
String userLink = card.select("expression").attr("href");
String userRatingString = card.select("expression").attr("aria-label");
Double userRating;
if(userRatingString.equals("NaN Stars")){
userRating = 0.0;
}else {
userRating = Double.parseDouble(userRatingString.replaceAll("[^0-9.]", ""));
}
User u;
Rating r;
//probably this is the bottleneck
if(userService.getByUserLink(userLink)!=null){
u = new User(userName, userLink, new HashSet<Rating>());
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}else {
u = userService.getByUserLink(userLink);
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}
i = i +1;
ratingSet.add(r);
userSet.add(u);
}
saveToDb(userSet, t, link, ratingSet);
savedEntities = savedEntities + 1;
log.info(savedEntities + " Saved Entities");
}
The code works fine for small-medium sized dataset but i encounter a huge bottleneck for larger datasets. Let's suppose i have 13K user entities already stored in the PostgresDB and another batch of 8500 reviews comes to be scraped, i have to check for every review if the user of that review is already stored. This is taking forever
I tried to define and index on the UserLink attribute in Postgres but the speed didn't improve at all
I tried to take and collect all the users stored in the Db inside a set and use the contains method to check if a particular user already exists in the set (in this way I thought I could bypass the database bottleneck of 8k write and read but in a risky way because if the users inside the db table were too much i would have encountered a memory overflow). The speed, again, didn't improve
At this point I don't have any other idea to improve this
Well for one, you would certainly benefit from not querying for each user individually in a loop. What you can do is query & cache for only the UserLink or UserName meaning get & cache the complete set of only one of them because that's what you seem to need to differentiate in the if-else.
You can actually query for individual fields with Spring Data JPA #Query either directly or even with Spring Data JPA Projections to query subset of fields if needed and cache & use them for the lookup. If you think the users could run into millions or billions then you could think of using a distributed cache like Apache Ignite where your collection could scale easily.
Btw, the if-else seem to be inversed is it not?
Next you don't store each review individually which the above code appears to imply. You can write in batches. Also since you are using Postgres you can use Postgres CopyManager provided by Postgres for bulk data transfer by using it with Spring Data Custom repositories. So you can keep writing to a new text/csv file locally at a set schedule (every x minutes) and use this to write that batched text/csv to the table (after that x minutes) and remove the file. This would be really quick.
The other option is write a stored procedure that combines the above & invoke it again in a custom repository.
Please let me know which one you had like elaborated..
UPDATE (Jan 12 2022):
One other item i missed is when you querying for UserLink or UserName you can use a very efficient form of select query that Postgres supports instead of using an IN clause like below,
#Select("select u from user u where u.userLink = ANY('{:userLinks}'::varchar[])", nativeQuery = true)
List<Users> getUsersByLinks(#Param("userLinks") String[] userLinks);

Timeouts in datastore queries

I am using objectify v5.1.11 from auto scaled app engine instances in java8 runtime environment.
I have an API which IOT devices call periodically to upload statistics information. In this API, I insert an entity to datastore to store the statistics information. This entity uses the auto generated IDs of datastore. Entity definition is as follows:
#Entity(name = "Stats")
public class StatsEntity {
#Id
private Long statisticsId;
#Index
private Long deviceId;
#Index
private String statsKey;
#Index
private Date creationTime;
}
But I had a requirement of checking for duplicates before inserting the entity. I switched to custom generated (String) Ids. I came up with a mechanism of appending the deviceId to statsKey (unique for each statistic within device) string provided by the device to generate the ID.
This is to avoid the eventual consistency behaviour if I use query to check if the entity already exists. Since get by ID is strongly consistent I can use it to check for duplicates.
There is another API to fetch the statistics uploaded by a device. In this API, I list the entities by filtering on deviceId and order by creationTime in descending order (newest first) with a page size of 100. This request times out since the request exceeds the 60s limit of appengine. I see the following exception in the logs:
Task was cancelled.
java.util.concurrent.CancellationException: Task was cancelled.
at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:1355)
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:555)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:436)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:99)
at com.google.appengine.tools.development.TimedFuture.get(TimedFuture.java:42)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:93)
at com.google.appengine.api.datastore.FutureHelper.getInternal(FutureHelper.java:69)
at com.google.appengine.api.datastore.FutureHelper.quietGet(FutureHelper.java:33)
at com.google.appengine.api.datastore.BaseQueryResultsSource.loadMoreEntities(BaseQueryResultsSource.java:243)
at com.google.appengine.api.datastore.BaseQueryResultsSource.loadMoreEntities(BaseQueryResultsSource.java:180)
at com.google.appengine.api.datastore.QueryResultIteratorImpl.ensureLoaded(QueryResultIteratorImpl.java:173)
at com.google.appengine.api.datastore.QueryResultIteratorImpl.hasNext(QueryResultIteratorImpl.java:70)
at com.googlecode.objectify.impl.KeysOnlyIterator.hasNext(KeysOnlyIterator.java:29)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:580)
at com.google.common.collect.TransformedIterator.hasNext(TransformedIterator.java:42)
at com.googlecode.objectify.impl.ChunkIterator.hasNext(ChunkIterator.java:39)
at com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:50)
at com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:50)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1105)
at com.googlecode.objectify.impl.ChunkingIterator.hasNext(ChunkingIterator.java:51)
at com.ittiam.cvml.dao.repository.PerformanceStatsRepositoryImpl.list(PerformanceStatsRepositoryImpl.java:154)
at com.ittiam.cvml.service.PerformanceStatsServiceImpl.listPerformanceStats(PerformanceStatsServiceImpl.java:227)
The statsKey provided by the device is based on time and hence monotonically increasing (step increase of 15 mins) which is bad as per this link.
But my traffic is not large enough to warrant this behaviour. Each device makes 2 to 3 requests every 15 minutes and there are about 300 devices.
When I try to list entities for devices which haven't made any request since I made this switch to custom ID, I still observe this issue.
Edit
My code to list the entity is as follows:
Query<StatsEntity> query = ofy().load().type(StatsEntity.class);
List<StatsEntity> entityList =
new ArrayList<StatsEntity>();
query = query.filter("deviceId", deviceId);
query = query.order("-creationTime");
query = query.limit(100);
QueryResultIterator<StatsEntity> iterator = query.iterator();
while (iterator.hasNext()) {
entityList.add(iterator.next());
}
This error usually occurs because of the write contention. The logic behind this is simple if you're having multiple transactions such as writing and reading some stuff from the same entity group concurrently.
There are various approaches to solve this problem:
A query lives for only 30 secs but you can extend it by converting your API into a task queue. Usually handling such write contention issues you should always use a task queue which lasts for around 10 mins.
If possible, make your entity group smaller.
You can find more approaches here.
Hope this answers your question!!!

Is there other way to implement search function without elasticsearch in Java

I am a newbie developer for web server side. I have develop an app for shop to manage their orders. Here comes a question.
I have an order table like:
orderId,orderNumber …
orderProduct table like
orderProductId, productId, productNumber, productName, productDescription.
I have a search function, that get all orders by searchString.
The api is like
Get /api/orders?productNumberSearch={searchStr}&productNameSearch={searchStr2}&productDescriptionSearch={searchStr3}
My Impl is like
String queryStr1 = getParameterFromRequestWithDefault(“productNumberSearch”,”");
String queryStr2 = getParameterFromRequestWithDefault(“productNameSearch”,”");
String queryStr3 = getParameterFromRequestWithDefault(“productDescriptionSearch”,”");
List<OrderProduct> orderProducts = getAllOrderProductsFromDatabase();
List<Interger> filterOrderIds = orderProducts.stream().filter(item->{
return item.getName().contains(queryStr1) && item.getNumber().contains(queryStr2) && item.getDescription().contains(queryStr3)
}).collect(Collectors.toList());
List<Order> orders = getOrdersByIds(filterOrderIds);
I use spring mvc and mysql. Codes above works. However, if there are many requests arriving at the same time, out of memory exception will be thrown. Case there are Chinese character in database , mysql full text search does not work well?
So is there other way to implement the search function without elasticsearch

Dynamodb AWS Java scan withLimit is not working

I am trying to use the DynamoDBScanExpression withLimit of 1 using Java aws-sdk version 1.11.140
Even if I use .withLimit(1) i.e.
List<DomainObject> result = mapper.scan(new DynamoDBScanExpression().withLimit(1));
returns me list of all entries i.e. 7. Am I doing something wrong?
P.S. I tried using cli for this query and
aws dynamodb scan --table-name auditlog --limit 1 --endpoint-url http://localhost:8000
returns me just 1 result.
DynamoDBMapper.scan will return a PaginatedScanList - Paginated results are loaded on demand when the user executes an operation that requires them. Some operations, such as size(), must fetch the entire list, but results are lazily fetched page by page when possible.
Hence, The limit parameter set on DynamoDBScanExpression is the maximum number of items to be fetched per page.
So in your case, a PaginatedList is returned and when you do PaginatedList.size it attempts to load all items from Dynamodb, under the hood the items were loaded 1 per page (each page is a fetch request to DynamoDb) till it get's to the end of the PaginatedList.
Since you're only interested in the first result, a good way to get that without fetching all the 7 items from Dynamo would be :
Iterator it = mapper.scan(DomainObject.class, new DynamoDBScanExpression().withLimit(1)).iterator();
if ( it.hasNext() ) {
DomainObject dob = (DomainObject) it.next();
}
With the above code, only the first item will fetched from dynamodb.
The take away is that : The limit parameter in DynamoDBQueryExpression is used in the pagination purpose only. It's a limit on the amount of items per page not a limit on the number of pages that can be requested.

How to get entries from the second level query cache?

In my grails application, I want to display all the current entries of the second-level cache from all regions.
My code is as following :
def getCacheStats() {
StatisticsImpl stats = sessionFactory.statistics
for (regionName in stats.secondLevelCacheRegionNames) {
log.debug stats.getSecondLevelCacheStatistics(regionName).entries
}
}
However everything works fine as long as the region name is not org.hibernate.cache.StandardQueryCache (region used for Query Cache). In that case, an exception is thrown :
java.lang.ClassCastException: org.hibernate.cache.QueryKey cannot be cast to org.hibernate.cache.CacheKey
Having googling around, I didn't find any clues about how to display the list of entries of the cached query result sets associated with regions StandardQueryCache and UpdateTimestampsCache.
Could you please help me find a solution for this?
It's fairly complicated but this should get you further. You can access the query cache via the SessionFactory, so assuming you have access to that (e.g. via 'def sessionFactory') then you can get to the underlying caches like this:
def cache = sessionFactory.queryCache
def realCache = cache.region.#underlyingCache.backingCache
def keys = realCache.keys
for (key in keys) {
def value = realCache.get(key).value
// do something with the value
}
Note that the values will be a List of Long values. I'm not sure what the first one signifies (it's a large value, e.g. 5219682970079232), but the remaining are the IDs of the cached domain class instances.

Categories