MongoDB collection size on windows, how insert large data? - java

I am using MongoDB last version(64 bit), I want to insert large data into collections, for example 3 millions. I am using java mongo driver.
Collections one item is
"_id": ObjectId("54ef7b31fa435e54747f3975"),
"splitRatio": "1.0",
"adjClose": "8.4199657167121",
"ex_Dividend": "0.0",
"date": "1985-09-27",
"close": "8.81",
"adjVolume": "3000.0",
"id": NumberLong(13379),
"open": "8.81",
"simbol": "AAME",
"adjOpen": "8.4199657167121",
"adjHigh": "8.4199657167121",
"volume": "1200.0",
"high": "8.81",
"reportDate": ISODate("1985-09-26T19: 00: 00.0Z"),
"low": "8.81",
"adjLow": "8.4199657167121"
Memory and CPU is a lot, not important i do bulk insert or insert one object from another, every time it stops when items size 13381 .
I made collection by hand and set size to 10 GB .
I do not know with which configuration run mongodb for save large data. I have one node.
For example for mysql 3 mln row in the table no problem.
Whats wrong?
also mongodb created two files for database, each size is 2,146,435,072 , may be some limit?
![stats][2]
[2]:

I created standalone project, which just insert into mongodb, no Spring, no Tomcat, no beans,and its work now:)

Related

how to programmatically dump data(bulk write) to elastic search

I am trying to learn the ropes of elastic search. As part of QA testing, I want to write massive number of records to ES(say 10K records). Each record is a geo location (x,y) coordinate. Each write will arbitrarily increase the value of (x,y). I can have a counter that I can update in every loop operation in Java and write to ES. But I am guessing there may be a better way (because in ES documentation, I come across _bulk keyword).
Is there any ES way of doing massive programmatic writes to ES
You can use the bulk API in Elasticsearch to add multiple documents at once. The API and examples are described in Elasticsearch: The Definitive guide which can be found here https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html
I'm guessing you would want to build a string like this:
{ "index": { "_index": "locationstorage", "_type": "geolocation" }} \n
{ "xcoord": 1, "ycoord": 2 } \n
{ "index": { "_index": "locationstorage", "_type": "geolocation" }} \n
{ "xcoord": 2, "ycoord": 3 } \n
and POST it to the /_bulk API

Sql query on json data contains arraylist

Table - ABC
Record 1 -- {"count": 0, "Groups": ["PADDY TRADERS", "WHEET TRADERS", "DAL TRADERS"], "apmcId": "33500150180006", "isSent": 0, "userId": "M0000020PRFREG2015050636624494USS1"}
Record 2 -- {"count": 0, "Groups": ["X TRADERS", "Y TRADERS", "Z TRADERS"], "apmcId": "566565656", "isSent": 0, "userId": "5435435435345435435435"}
These are the records in ABC table, now i am querying as below to get first record as expected return but not able to do so. Pls help me for querying on records which contains list data inside.
"SELECT * FROM ABC WHERE data->>'Groups'->'PADDY TRADERS'";
MySQL does not yet support JSON directly, unless you are on version >= 5.7 (see here for a nice blog post concerning JSON in mysql 5.7)
Therefore all you can do is get the field in which the JSON is stored, interpret it with the JSON library of your choice and do whatever you need doing.

How to write query with index intersection with Mongo java driver

I googled and read the official doc of mongodb (http://docs.mongodb.org/manual/core/index-intersection/), but didn't find any tutorial or indications on syntax of query using index intersection.
Does mongodb apply automatically index intersection when the query involves 2 fields which are separately indexed by a single index? I don't think so.
Here is what cursor.explain() show when i run a query between 2 dates and a given "name" ("name" is a field, both date and name are indexed.)
{
"cursor": "BtreeCursor Name_1",
"isMultiKey": false,
"n": 99330,
"nscannedObjects": 337500,
"nscanned": 337500,
"nscannedObjectsAllPlans": 337601,
"nscannedAllPlans": 337705,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 18451,
"nChunkSkips":
"millis": 15430,
"indexBounds": {
"Name": [
[
"blabla",
"blabla"
]
]
},
"allPlans": [
{
"cursor": "BtreeCursor Name_1",
"isMultiKey": false,
"n": 99330,
"nscannedObjects": 337500,
"nscanned": 337500,
"scanAndOrder": false,
"indexOnly": false,
"nChunkSkips": 0,
"indexBounds": {
"Name": [
[
"blabla",
"blabla"
]
]
}
},
{
"cursor": "BtreeCursor Date_1",
"isMultiKey": false,
"n": 0,
"nscannedObjects": 101,
"nscanned": 102,
"scanAndOrder": false,
"indexOnly": false,
"nChunkSkips": 0,
"indexBounds": {
"Date": [
[
"2014-08-23 10:28:50.221",
"2014-08-23 13:28:50.221"
]
]
}
},
{
"cursor": "Complex Plan",
"n": 0,
"nscannedObjects": 0,
"nscanned": 103,
"nChunkSkips": 0
}
The complex plan shows nothing. And the elapsed time is 16s. If I query only by name without date, it takes only 0.9s
I want to learn how to write query using index intersection in mongojava driver, something like hint() in mongo shell. Any example or tutorial link is welcome.
I know about writing basic queries with Mongodb java driver. You can just post the essential code example if it saves ur time.
Thanks in advance.
After reading these links: http://docs.mongodb.org/manual/core/query-plans/#index-filters
https://jira.mongodb.org/browse/SERVER-3071
I come to conclude that there is no way for now to force query to use index intersection.
In fact, when several candidate index are possible for a query, mongodb runs them in parallel and waits a index to "win the match". The winner index is the one that completes the whole query first or returns a threshold number of matching result first. Then mongodb uses this index to query.
In the case that your queries are very variant and you cannot build many compound index, its dead. You can only trust mongodb's test.
Sometimes, one index is more selective than another. But it doesn't mean that it returns more quickly the result. Like my case, the "name" index is more selective. It may fetch less documents. But it requires a date comparaison to determine if the fetched document matches the whole query. On the other side, the "date" index fetches more documents from the disque but only does a simple equality test on the "name" field to determine if the document matches the query. That is possibly why it can win the test.
About the index intersection, it has never been used in my several query tests. I doubt if it is useful and expect mongodb to improve its performance in future version.
If my conclusion is wrong, please point it out. Still learning about MongoDB :)
Does mongodb apply automatically index intersection when the query
involves 2 fields which are separately indexed by a single index?
has been answered here: MongoDB index intersection
You can't force MongoDB to apply index intersections rather you could modify your queries to allow MongoDB query optimizer to apply index intersection strategy on your query.
To learn how your query parameters affect the indexing process, see this link, though it is for compound indexes.
http://java.dzone.com/articles/optimizing-mongodb-compound
And Java API provides two methods to use hint() with the find() operation:
MongoDB Java API
public DBCursor hint(String indexName)
public DBCursor hint(DBObject indexKeys)
Informs the database of indexed fields of the collection in order to
improve performance.
which can be used as below,
List obj = collection.find( query ).hint(indexName);

MongoDb - Update collection atomically if set does not exist

I have the following document in my collection:
{
"_id":NumberLong(106379),
"_class":"x.y.z.SomeObject",
"name":"Some Name",
"information":{
"hotelId":NumberLong(106379),
"names":[
{
"localeStr":"en_US",
"name":"some Other Name"
}
],
"address":{
"address1":"5405 Google Avenue",
"city":"Mountain View",
"cityIdInCitiesCodes":"123456",
"stateId":"CA",
"countryId":"US",
"zipCode":"12345"
},
"descriptions":[
{
"localeStr":"en_US",
"description": "Some Description"
}
],
},
"providers":[
],
"some other set":{
"a":"bla bla bla",
"b":"bla,bla bla",
}
"another Property":"fdfdfdfdfdf"
}
I need to run through all documents in collection and if "providers": [] is empty I need to create new set based on values of information section.
I'm far from being MongoDB expert, so I have the few questions:
Can I do it as atomic operation?
Can I do this using MongoDB console? as far as I understood I can do it using $addToSet and $each command?
If not is there any Java based driver that can provide such functionality?
Can I do it as atomic operation?
Every document will be updated in an atomic fashion. There is no "atomic" in MongoDB in the sense of RDBMS, meaning all operations will succeed or fail, but you can prevent other writes interleaves using $isolated operator
Can I do this using MongoDB console?
Sure you can. To find all empty providers array you can issue a command like:
db.zz.find(providers :{ $size : 0}})
To update all documents where the array is of zero length with a fixed set of string, you can issue a query such as
db.zz.update({providers : { $size : 0}}, {$addToSet : {providers : "zz"}})
If you want to add a portion to you document based on a document's data, you can use the notorious $where query, do mind the warnings appearing in that link, or - as you had mentioned - query for empty provider array, and use cursor.forEach()
If not is there any Java based driver that can provide such functionality?
Sure, you have a Java driver, as for each other major programming language. It can practically do everything described, and basically every thing you can do from the shell. Is suggest you to get started from the Java Language Center.
Also there are several frameworks which facilitate working with MongoDB and bridge the object-document world. I will not give a least here as I'm pretty biased, but I'm sure a quick Google search can do.
db.so.find({ providers: { $size: 0} }).forEach(function(doc) {
doc.providers.push( doc.information.hotelId );
db.so.save(doc);
});
This will push the information.hotelId of the corresponding document into an empty providers array. Replace that with whatever field you would rather insert into the providers array.

MongoDB data model to support unique visitors, per event, per date range

I've got multiple websites, where each website has visitors that "trigger" multiple events I want to track. I have a log of those events, from all websites, each event is filled with the website-id, the event-name and the user-id that did the event (for the sake of simplicity, let's say that's it).
The requirements:
Be able to get, per website-id and event-name, how many unique visitors got it.
This should support also date range (distinct unique visitors on the range).
I was thinking of creating a collection per "website-id" with the following data model (as example):
collection ev_{websiteId}:
[
{
_id: "error"
dailyStats: [
{
_id: 20121005 <-- (yyyyMMdd int, should be indexed!)
hits: 5
users: [
{
_id: 1, <-- should be indexed!
hits: 1
},
{
_id: 2
hits: 3
},
{
_id: 3,
hits: 1
}
]
},
{
_id: 20121004
hits: 8
users: [
{
_id: 1,
hits: 2
},
{
_id: 2
hits: 3
},
{
_id: 3,
hits: 3
}
]
},
]
},
{
_id: "pageViews"
dailyStats: [
{
_id: 20121005
hits: 500
users: [
{
_id: 1,
hits: 100
},
{
_id: 2
hits: 300
},
{
_id: 3,
hits: 100
}
]
},
{
_id: 20121004
hits: 800
users: [
{
_id: 1,
hits: 200
},
{
_id: 2
hits: 300
},
{
_id: 3,
hits: 300
}
]
},
]
},
]
I'm using the _id to hold the event-id.
I'm using dailyStats._id to hold when it happened (an integer in yyyyMMdd format).
I'm using dailySattes.users._id to represent a user's unique-id hash.
In order to get the unique users, I should basically be able to run (mapreduce?) distinct count number of items in the array(s), per the given date range (I will convert the date range to yyyyMMdd).
My questions:
does this data model makes sense to you? I'm concerned about scalability of this model over time (if I've got a lot of daily unique visitors in some client, it make cause a huge document).
I was thinking of deleting dailyStats documents by _id < [date as yyyyMMdd]. This way I can keep my documents size to a sane number, but still, there are limits here.
Is there an easy way to run "upsert" that will also create the dailyStats if not already created, add the user, if not already created and increment "hits" property for both?
what about map-reduce? how would you approach it (need to run distinct on the users._id for all subdocuments in the given date range)? is there an easier way with the new aggregation framework?
btw - another option to solve unique visitors is using Redis Bitmaps but I am not sure it's worth holding multiple data storage (maintenance-wise).
Few comments on the current above architecture.
I'm slightly worried as you've pointed out about the scalability and how much pre-aggregation you're really doing.
Most of the Mongo instances I've worked with when doing metrics have similar cases to what you pointed out but you really seem to be relying heavily on doing updates to a single document and upserting various parts of it which is going to slow down and potentially cause a bit of locking..
I might suggest a different path, one that Mongo even suggests when talking with them about doing metrics. Seeing as you already have a structure that you're looking to do I'd create something along the lines of:
{
"_id":"20121005_siteKey_page",
"hits":512,
"users":[
{
"uid":5,
"hits":512,
}
}
This way you are limiting your document sizes to something that is going to be reasonable to do quick upserts on. From here you can do mapreduce jobs in batches to further extend out what you're looking to see.
It also depends on your end goal, are you looking to provide realtime metrics? What sort of granularity are you attemtping to get? Redis Maps may be something you want to at least look at: Great article here.
Regardless it is a fun problem to solve :)
Hope this has helped!

Categories