How to generate report from huge mongodb document [duplicate]

How to generate report from huge mongodb document [duplicate] - java

Using the code:
all_reviews = db_handle.find().sort('reviewDate', pymongo.ASCENDING)
print all_reviews.count()
print all_reviews[0]
print all_reviews[2000000]
The count prints 2043484, and it prints all_reviews[0].
However when printing all_reviews[2000000], I get the error:
pymongo.errors.OperationFailure: database error: Runner error: Overflow sort stage buffered data usage of 33554495 bytes exceeds internal limit of 33554432 bytes
How do I handle this?

You're running into the 32MB limit on an in-memory sort:
https://docs.mongodb.com/manual/reference/limits/#Sort-Operations
Add an index to the sort field. That allows MongoDB to stream documents to you in sorted order, rather than attempting to load them all into memory on the server and sort them in memory before sending them to the client.

As said by kumar_harsh in the comments section, i would like to add another point.
You can view the current buffer usage using the below command over the admin database:
> use admin
switched to db admin
> db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )
{ "internalQueryExecMaxBlockingSortBytes" : 33554432, "ok" : 1 }
It has a default value of 32 MB(33554432 bytes).In this case you're running short of buffer data so you can increase buffer limit with your own defined optimal value, example 50 MB as below:
> db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes:50151432})
{ "was" : 33554432, "ok" : 1 }
We can also set this limit permanently by the below parameter in the mongodb config file:
setParameter=internalQueryExecMaxBlockingSortBytes=309715200
Hope this helps !!!
Note:This commands supports only after version 3.0 +

solved with indexing
db_handle.ensure_index([("reviewDate", pymongo.ASCENDING)])

If you want to avoid creating an index (e.g. you just want a quick-and-dirty check to explore the data), you can use aggregation with disk usage:
all_reviews = db_handle.aggregate([{$sort: {'reviewDate': 1}}], {allowDiskUse: true})
(Not sure how to do this in pymongo, though).

JavaScript API syntax for the index:
db_handle.ensureIndex({executedDate: 1})

In my case, it was necessary to fix nessary indexes in code and recreate them:
rake db:mongoid:create_indexes RAILS_ENV=production
As the memory overflow does not occur when there is a needed index of field.
PS Before this I had to disable the errors when creating long indexes:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
Also may be needed reIndex:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> use your_db
switched to db your_db
> db.getCollectionNames().forEach( function(collection){ db[collection].reIndex() } )

Related

Cassandra, Java and MANY Async request : is this good?

I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you

Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.

I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )

MongoDB: Query has implicit limit(256)?

I've created (in code) a default collection in MongoDB and am querying it, and have discovered that while the code will return all the data when I run it locally, it won't when I query it on a deployment server. It returns a maximum of 256 records.
Notes:
This is not a capped collection.
Locally, I'm running 3.2.5, the remote MongoDB version is 2.4.12
I am not using the limit parameter. When I use it, I can limit both the local and deployment server, but the deployment server will still never return more than 256 records.
The amount of data being fetched from the server is <500K. Nothing huge.
The code is in Clojure, using Monger, which itself just calls the Java com.mongodb stuff.
I can pull in more than 256 records from the remote server using Robomongo though I'm not sure how it does this, as I cannot connect to the remote from the command line (auth failed using the same credentials, so I'm guessing version incompatibility there).
Any help is appreciated.
UPDATE: Found the thing that triggers the problem: When I sort the output, it reduces the output to 256—but only when I pull from Mongo 2.4! I don't know if this is a MongoDB itself, the MongoDB java class, or Monger, but here is the code that illustrates the issue, as simple as I could make it:
(ns mdbtest.core
(:require [monger.core :as mg]
[monger.query :as mq]))
(defn get-list []
(let [coll (mq/with-collection
(mg/get-db
(mg/connect {:host "old-mongo"}) "mydb") "saves"
(mq/sort (array-map :createdDate -1)))] ;;<<==remove sort
coll))

You need to specify a bigger batch-size, the default is 256 records.
Here's an example from my own code:
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1}) ))
256
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1})
(q/batch-size 1000) ))
688
See more info here: http://clojuremongodb.info/articles/querying.html#setting_batch_size

Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"

I am running this code on EMR 4.6.0 + Spark 1.6.1 :
val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)
try {
inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
logger.info("DONE!")
} catch {
case e : Throwable => logger.error("ERROR" + e.getMessage)
}
In the last stage of saveAsTextFile, it fails with this error:
16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */ return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)
What could be the reason? Thanks

Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.
Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.
The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.

This is due to known limitation of Java for generated classes to go beyond 64Kb.
This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3 - will be released in Jan/2018.

For future reference, this issue was fixed in spark 2.3 (As Andrew noted).
If you encounter this issue on Amazon EMR, upgrade to release version 5.13 or above.

Inconsistent counter values between replicas in Cassandra

I've got a 3 machine Cassandra cluster using rack unaware placements strategy with a replication factor of 2.
The column family is defined as follows:
create column family UserGeneralStats with comparator = UTF8Type and default_validation_class = CounterColumnType;
Unfortunately after a few days of production use I got some inconsistent values for the counters:
Query on replica 1:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=96545030198)
=> (counter=downloads, value=1013)
=> (counter=previews, value=10304)
Query on replica 2:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=9140386229)
=> (counter=downloads, value=339)
=> (counter=previews, value=1321)
As the standard read repair mechanism doesn't seem to repair the values I tried to force an
anti-entropy repair using nodetool repair. It did't have any effect on the counter values.
Data inspection showed that the lower values for the counters are the correct ones so I suspect that either Cassandra (or Hector which I used as API to call Cassandra from Java) retried some increments.
Any ideas how to repair the data and possibly prevent the sittuation from happening again?

If neither RR nor repair fixes it, it's probably a bug.
Please upgrade to 0.8.3 (out today) and verify it's still present in that version, then you can file a ticket at https://issues.apache.org/jira/browse/CASSANDRA.

How do I limit the number of results when using the Java driver for mongo db?

http://api.mongodb.org/java/2.1/com/mongodb/DBCollection.html#find(com.mongodb.DBObject,com.mongodb.DBObject,int,int)
Using this with Grails and the mongo db plugin.
Here's the code I'm using... not sure why but the cursor is returning the entire set of data. In this case, I'm just trying to return the first 20 matches (with is_processed = false):
def limit = {
def count = 1;
def shape_cursor = mongo.shapes.find(new BasicDBObject("is_processed", false),new BasicDBObject(),0,20);
while(shape_cursor.hasNext()){
shape_cursor.next();
render "<div>" + count + "</div"
count++;
}
}
Anyone have an idea?

limit is a method of DBCursor: DBCursor.limit(n).
So you simply need to do
def shape_cursor = mongo.shapes.find(...).limit(20);

According to the JavaDoc you linked to the second int parameter is not the maximum number to return, but
batchSize - if positive, is the # of objects per batch sent back from the db. all objects that match will be returned. if batchSize < 0, its a hard limit, and only 1 batch will either batchSize or the # that fit in a batch
Maybe a negative number (-20) would do what you want, but I find the statement above too confusing to be sure about it, so I would set the batchSize to 20 and then filter in your application code.
Maybe file this as a bug / feature request. There should be a way to specify skip/limit that works just like on the shell interface. (Update: and there is, on the cursor class, see the other answer).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to generate report from huge mongodb document [duplicate] - java

solved with indexing db_handle.ensure_index([("reviewDate", pymongo.ASCENDING)])

If you want to avoid creating an index (e.g. you just want a quick-and-dirty check to explore the data), you can use aggregation with disk usage: all_reviews = db_handle.aggregate([{$sort: {'reviewDate': 1}}], {allowDiskUse: true}) (Not sure how to do this in pymongo, though).

JavaScript API syntax for the index: db_handle.ensureIndex({executedDate: 1})

Related

Cassandra, Java and MANY Async request : is this good?

MongoDB: Query has implicit limit(256)?

Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"

Inconsistent counter values between replicas in Cassandra

How do I limit the number of results when using the Java driver for mongo db?

Categories

Resources