I have a database full of two different types of users (Mentors and Mentees), whereby I want the second group (Mentees) to be able to "search" for people in the first group (Mentors) who match their profile. Mentors and Mentees can both go in and change items in their profile at any point in time.
Currently, I am using Apache Mahout for the user matching (recommender.mostSimilarIDs()). The problem I'm running into is that I have to reload the user data every single time anyone searches. By itself, this doesn't take that long, but when Mahout processes the data it seems to take a very long time (14 minutes for 3000 Mentors and 3000 Mentees). After processing, matching takes mere seconds. I also get the same INFO message over and over again while it's processing ("Processed 2248 users"), while looking at the code shows that the message should only be outputted every 10000 users.
I'm using the GenericUserBasedRecommender and the GenericDataModel, along with the NearestNUserNeighborhood, AveragingPreferenceInferrer and PearsonCorrelationSimilarity. I load mentors from the database, add the mentee to the list of POJOs and convert them to a FastByIDMap to give to the DataModel.
Is there a better way to be doing this? The product owner needs the data to be current for every search.
(I'm the author.)
You shouldn't need to ask it to reload the data every time, why's that?
14 minutes sounds way, way too long to load such a small amount of data too, something's wrong. You might follow up with more info at user#mahout.apache.org.
You are seeing log messages from a DataModel, which you can disable in your logging system of choice. It prints one final count. This is nothing to worry about.
I would advise you against using a PreferenceInferrer unless you absolutely know you want it. Do you actually have ratings here? I might suggest LogLikelihoodSimilarity if not.
Related
I am having an Android app and I am planning to use Kinvey Database to store some data.
One of the record in the entry would be having the last used time period.
The last used time period will be set by my app when ever the app is opened.
What basically I am trying to achieve is to run a code at the end of each month and clear all the record whos value of the last used period is more than 10 days.
Can any one please tell me whether it is possible to do this?
The reason for doing this is to use the least Server storage space as they provide only 1 GB/Monthly in the free plan.
What I understand from this...
What basically I am trying to achieve is to run a code at the end of each month and clear all the record whos value of the last used period is more than 10 days.
Can any one please tell me whether it is possible to do this?
... is that you would like to have scheduled code, which will be executed once a month. Please take a further look at the Scheduled Code feature of Kinvey.
https://devcenter.kinvey.com/html5/tutorials/scheduled-code-getting-started
Edit: more information on the topic...
Kinvey Scheduled code allows you to execute one of your custom endpoints on a specific date in the future. It is commonly used to
Aggregate, archive, and cleanup data.
Pull data from a third-party API into Kinvey.
Send out a batch of e-mails or push notifications.
I would not bother describing step-by-step initializing of Scheduled Code, since those steps may change in the future. Please follow the steps from the link above, those should be fine to get you further.
I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...
I have a group of nodes who send measurements to a bootstrap server. In the end I want the bootstrap server to sum all the measurements and write it to a file. One way to do that is to over-write the data to the file each time a measurement message is received(after summing up the current measurements). But this would be very inefficient. I want to store the measurement data and write it to file only once after the simulation is completed.
But the problem is that the simulator code that I am using is not under my control, its a library that I am using. So, I cant tell when exactly the simulation is going to end (and hence I cant tell which measurement message will be the last one).
I naively tried to store the measurement data in a static class but this data is not accessible when the simulation terminates. Is there any other way that I can do this ?
Thanks,
I would find the last message using a timeout.
Write to disk if you have new data but you haven't got anything for a while e.g. a second.
If you cannot store the data you need in the process (which it seems you can't, since the static class failed), you need to persist the data some other way. To an on-disk file is one option, and another common one would be to a database.
I've been going round in circles with what must be a very simple challenge but I want to do it the most efficient way from the start. So, I've watched Brett Slatkin's Google IO videos (2008 & 2009) about building scalable apps including http://www.youtube.com/watch?v=AgaL6NGpkB8 and read the docs but as a n00b, I'm still not sure.
I'm trying to build an app on GAEJ similar to the original 'hotornot' where a user is presented with an item which they rate. Once they rate it, they are presented with another one which they haven't seen before.
My question is this; is it most efficient to do a query up front to grab x items (say 100) and put them in a list (stored in memcache?) or is it better to simply make a query for a new item after each rating.
To keep track of the items a user has seen, I'm planning to keep those items' keys in a list property of the user's entity. Does that sound sensible?
I've really got myself confused about this so any help would be much appreciated.
I would personally do something like:
When a user logs in, create a list of 100 random IDs that they have not seen. Then as they click to the next item, do a query to the datastore and pull back the one at the front of the list.
If this ends up too slow you can try to cache, but it is really hard to memcache you entire database. Even loading the 100 guys they need will be hard (as the number of users scale out). Pulling back 1 entry for 1 webpage load is not slow. Each click will be post 1 comment and pull 1 item back. Simple, only a few MS from the datastore. Doing the 100 random IDs they haven't seen can be slow, so that makes sense to do ahead of time and keep around (in their request or session depending on how you are doing that...)
I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
Via the routing framework I recieve a file.
Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
Once a year we have 1-3 gb of data instead.
Running on an "appserver"
I can fork subprocesses as I like. (within reason)
I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
We're using java.
Requirements
I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
It must be able to complete within two days
Nice to have
Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top.
...
How would you design this?
I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilename and status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'.
When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.
Simplify as much as possible.
The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.
Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)
Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.
When you receive the file, parse it and put the information in the database.
Make one table with a record per line that will need a getPerson request.
Have one or more threads get records from this table, perform the request and put the completed record back in the table.
Once all records are processed, generate the complete file and return it.
if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.
By splitting the larger file into smaller files then you could also use more servers to process the files.
You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.
The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.