I am creating reports on mongodb using java . So here I need to use map reduce to create reports .I am having 3 replicas in production . For reports queries I do not want make request to primary mongo database . I want to make request to only secondary replica ,So here if we use map reduce it will create a temporary collection.
1) Here is there any problem if i set read preferences as secondary
for reports using map reduce?
2) will create temporary collection on
secondary replica?
3) Is there any other way to use secondary
replica for report purpose since i do not want to create traffic on
primary database?4) will i get correct desired results since having
huge data?
Probably the easiest way to do this is to just connect to the secondary directly, instead of connecting to the Replica Set with ReadPreference.SECONDARY_ONLY. In that case, it will definitely create a temporary on the secondary and you should have the correct results (of course!).
I would also advice you to look at the Aggregation Framework though, as it's a lot faster and often easier to use and debug than Map Reduce jobs. It's not as powerful, but I have yet had to find a situation where I couldn't use the Aggregation Framework for my aggregation and reporting needs.
Related
If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.
The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate
We are using Mongo DB for our Java Java EE based application.
We are using one primary and two secondary as part of cluster of Mongo DB
The Mongo DB is continuously read / Updated / Write as part of our architecture.
Due to this we are getting Dirty Reads ( zero values ) When we do a fetch operation from Mongo DB.
TO solve this, I am trying to configure it in such a way that the read operation always fetches from primary itself and writes / updates can go to primary / secondary.
Please let me know if this is possible (always read from primary and use any thing primary/secondary for write/update)
And will it have any negative impact with such a design?
Due to mongoDB uses staged writing in inserts, you can have lag in write operations even in a single node.
http://docs.mongodb.org/manual/reference/glossary/#term-journal
Use write concern to avoid problems (with dirty data): http://api.mongodb.org/java/2.6/com/mongodb/WriteConcern.html
You can read more about write operations:
http://docs.mongodb.org/manual/core/write-operations/
Which states that using writes without error check is not recommended for production.
To exclude slave reads you have to set a corresponding ReadPreference. See documentation on Mongo connection class.
Something like this:
mongo.setReadPreference(ReadPreference.primary());
The documentation says, however, that by default all reads should already go to primary:
By default, all read and write operations will be made on the primary, but it's possible to read from secondaries by changing the read preference:
Are you sure you didn't turn slave reads on in the app?
What is the best way to implement the following scenario?
I need to call/query a data base table containing millions of records from a java application. Then for each records in the table, my application should call a third party API and get a status field as response. Then my application should again update each row in the table with the information (status) from the API.
Note - I am trying to figure out a method to do this in the best possible way. I understand that querying all the records together is not the best way forward.
Do not try to eat the elephant in one bite. Chunk it. Heard of pagination? Use it. See here: MySQL pagination without double-querying?
you can use oracle feature such as SQL loader, Data pumping Called via JDBC or script..
Databases are not designed to update millions of records via Java API repeatedly. This can take many minutes. If this is not enough, you may need to use a dataset embedded in Java (either caching or replacing your database)
My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng
Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.
Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.