I've created (in code) a default collection in MongoDB and am querying it, and have discovered that while the code will return all the data when I run it locally, it won't when I query it on a deployment server. It returns a maximum of 256 records.
Notes:
This is not a capped collection.
Locally, I'm running 3.2.5, the remote MongoDB version is 2.4.12
I am not using the limit parameter. When I use it, I can limit both the local and deployment server, but the deployment server will still never return more than 256 records.
The amount of data being fetched from the server is <500K. Nothing huge.
The code is in Clojure, using Monger, which itself just calls the Java com.mongodb stuff.
I can pull in more than 256 records from the remote server using Robomongo though I'm not sure how it does this, as I cannot connect to the remote from the command line (auth failed using the same credentials, so I'm guessing version incompatibility there).
Any help is appreciated.
UPDATE: Found the thing that triggers the problem: When I sort the output, it reduces the output to 256—but only when I pull from Mongo 2.4! I don't know if this is a MongoDB itself, the MongoDB java class, or Monger, but here is the code that illustrates the issue, as simple as I could make it:
(ns mdbtest.core
(:require [monger.core :as mg]
[monger.query :as mq]))
(defn get-list []
(let [coll (mq/with-collection
(mg/get-db
(mg/connect {:host "old-mongo"}) "mydb") "saves"
(mq/sort (array-map :createdDate -1)))] ;;<<==remove sort
coll))
You need to specify a bigger batch-size, the default is 256 records.
Here's an example from my own code:
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1}) ))
256
=> (count (with-db (q/find {:keywords "lisa"})
(q/sort {:datetime 1})
(q/batch-size 1000) ))
688
See more info here: http://clojuremongodb.info/articles/querying.html#setting_batch_size
Related
I've got a streaming Dataflow pipeline, written in Java with BEAM 2.35. It commits data to BigQuery via StorageWriteApi. Initially the code looks like
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(20) // want to make this dynamic
This code runs in different environment eg Dev & Prod. When I deploy in Dev I want 2 StorageWriteApiStreams, in Prod I want 20, and I'm trying to pass/resolve these values at the moment I deploy with a Cloudbuild.
The cloudbuild-dev.yaml looks like
steps:
- lots-of-steps
args:
--numStorageWriteApiStreams=${_NUM_STORAGEWRITEAPI_STREAMS}
substitutions:
_PROJECT: dev-project
_NUM_STORAGEWRITEAPI_STREAMS: '2'
I expose the substitution in the job code with an interface
ValueProvider<String> getNumStorageWriteApiStreams();
void setNumStorageWriteApiStreams(ValueProvider<String> numStorageWriteApiStreams);
I then refactor the writeTableRows() call to invoke getNumStorageWriteApiStreams()
BigQueryIO.writeTableRows()
.withTimePartitioning(/* some column */)
.withClustering(/* another column */)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withTriggeringFrequency(Duration.standardSeconds(30))
.withNumStorageWriteApiStreams(Integer.parseInt(String.valueOf(options.getNumStorageWriteApiStreams())))
Now it's dynamic but I get a build failure on account of java.lang.IllegalArgumentException: methods with same signature getNumStorageWriteApiStreams() but incompatible return types: [class java.lang.Integer, interface org.apache.beam.sdk.options.ValueProvider]
My understanding was that Integer.parseInt returns an int, which I want so I can pass it to withNumStorageWriteApiStreams() which requires an int.
I'd appreciate any help I can get here thanks
Turns out BigQueryOptions.java already has a method getNumStorageWriteApiStreams() that returns an Integer. I was unknowingly trying to rewrite it with a different return, oops.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java#L95-L98
I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you
Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.
I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )
Following is what I am doing.
I am using mule MS-Dynamics connector to create a contact
I get records from mysql Database (Inserted from source file)
Transform it to CRM specific object in dataweave
This works for over 10 Million records. But for a few hundred records
I am getting the following error:
Problem writing SAAJ model to stream: Invalid white space character (0x1f) in text to output (in xml 1.1, could output as a character entity)
With some research I found out that (0x1f) represents US "Unit separator".
I tried replacing this character in my dataweave like this
%var replaceSaaj = (x) -> (x replace /\"0x1f"/ with "" default "")
but the issue persists.
I even tried to look for these characters in my source file and database with no luck.
I am aware that this connector internally uses SOAP services.
Using the code:
all_reviews = db_handle.find().sort('reviewDate', pymongo.ASCENDING)
print all_reviews.count()
print all_reviews[0]
print all_reviews[2000000]
The count prints 2043484, and it prints all_reviews[0].
However when printing all_reviews[2000000], I get the error:
pymongo.errors.OperationFailure: database error: Runner error: Overflow sort stage buffered data usage of 33554495 bytes exceeds internal limit of 33554432 bytes
How do I handle this?
You're running into the 32MB limit on an in-memory sort:
https://docs.mongodb.com/manual/reference/limits/#Sort-Operations
Add an index to the sort field. That allows MongoDB to stream documents to you in sorted order, rather than attempting to load them all into memory on the server and sort them in memory before sending them to the client.
As said by kumar_harsh in the comments section, i would like to add another point.
You can view the current buffer usage using the below command over the admin database:
> use admin
switched to db admin
> db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )
{ "internalQueryExecMaxBlockingSortBytes" : 33554432, "ok" : 1 }
It has a default value of 32 MB(33554432 bytes).In this case you're running short of buffer data so you can increase buffer limit with your own defined optimal value, example 50 MB as below:
> db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes:50151432})
{ "was" : 33554432, "ok" : 1 }
We can also set this limit permanently by the below parameter in the mongodb config file:
setParameter=internalQueryExecMaxBlockingSortBytes=309715200
Hope this helps !!!
Note:This commands supports only after version 3.0 +
solved with indexing
db_handle.ensure_index([("reviewDate", pymongo.ASCENDING)])
If you want to avoid creating an index (e.g. you just want a quick-and-dirty check to explore the data), you can use aggregation with disk usage:
all_reviews = db_handle.aggregate([{$sort: {'reviewDate': 1}}], {allowDiskUse: true})
(Not sure how to do this in pymongo, though).
JavaScript API syntax for the index:
db_handle.ensureIndex({executedDate: 1})
In my case, it was necessary to fix nessary indexes in code and recreate them:
rake db:mongoid:create_indexes RAILS_ENV=production
As the memory overflow does not occur when there is a needed index of field.
PS Before this I had to disable the errors when creating long indexes:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
Also may be needed reIndex:
# mongo
MongoDB shell version: 2.6.12
connecting to: test
> use your_db
switched to db your_db
> db.getCollectionNames().forEach( function(collection){ db[collection].reIndex() } )
I've got a 3 machine Cassandra cluster using rack unaware placements strategy with a replication factor of 2.
The column family is defined as follows:
create column family UserGeneralStats with comparator = UTF8Type and default_validation_class = CounterColumnType;
Unfortunately after a few days of production use I got some inconsistent values for the counters:
Query on replica 1:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=96545030198)
=> (counter=downloads, value=1013)
=> (counter=previews, value=10304)
Query on replica 2:
[default#StatsKeyspace] list UserGeneralStats['5261666978': '5261666978'];
Using default limit of 100
-------------------
RowKey: 5261666978
=> (counter=bandwidth, value=9140386229)
=> (counter=downloads, value=339)
=> (counter=previews, value=1321)
As the standard read repair mechanism doesn't seem to repair the values I tried to force an
anti-entropy repair using nodetool repair. It did't have any effect on the counter values.
Data inspection showed that the lower values for the counters are the correct ones so I suspect that either Cassandra (or Hector which I used as API to call Cassandra from Java) retried some increments.
Any ideas how to repair the data and possibly prevent the sittuation from happening again?
If neither RR nor repair fixes it, it's probably a bug.
Please upgrade to 0.8.3 (out today) and verify it's still present in that version, then you can file a ticket at https://issues.apache.org/jira/browse/CASSANDRA.