I am maintaining a lottery website with more than millions of users. Some active user(Perhaps more than 30,000) will buy more than 1000 lotteries within 1 second.
Now the current logics use select .... for update to make sure the account balance, but meantime the database server is over-loaded and very slow to deal with? We have to process them in real-time.
Have anyone met the similar scene before?
First, you need to design a transactional system that satisfies your business rules. For the moment, forget about disk and memory, and what goes where. Try to design a system that is as lightweight as possible, that does the minimum required amount of locking, that satisfies your business rules.
Now, run the system, what happens? If performance is acceptable, congratulations, you're done.
If performance is not acceptable, avoid the temptation to guess at the problem, and start making adjustments. You need to profile the system. You need to understand where the most time is being spent, so that you know what areas to focus your tuning efforts on. The easiest way to do this, is to trace it, using SQL_TRACE. You've not made any mention of Oracle edition, version, or platform. So, I'll assume you're at least on some version of 10gR2. So, use DBMS_MONITOR to start/end traces. Now, scoping is important here. What I mean is, it's critically important that you start the trace, run the code that you want to profile and then immediately shut off the trace. This way, you trace only what you're interested in, and the profile won't contain any extraneous information. Once you have the trace file, you need to process it. There are several tools. The most common is TkProf, which is provided by Oracle, but really doesn't do a very good job. The best free profiler that I'm aware of, is OraSRP. Download a copy of OraSRP, and check your results. The data in the report should point you in the right direction.
Once you've done all that, if you still have questions, ask a new question here, and I'm sure we can help you interpret the output of OraSRP, to help you understand where your bottlenecks are.
Hope that helps.
Personally, I would lock/update the accounts in memory and update the database as a background task. Using this approach you can easily support thousands of updates and accounts.
A. Speed up things without modifying the code:
1 - You can keep the table entirely in the memory(that is SGA - because it is also on disks):
alter table t storage ( buffer_pool keep )
(discuss with your dba before to do this)
2 - if the table is too big and you update same rows again and again, probably it is sufficient to use the cache attribute:
alter table t cache
This command put the blocks of your table when they are used with best priority in the LRU list, so it is less chance to be aged from the SGA.
Here is it a discusion about differences: ask tom
3 - Another solution, advanced, that need more analysis and resources is TimesTen
B.Speed up your database operations:
Identify top querys and:
create indexes where you update or select only one row or a small set of rows.
partition large tables scanned for only a segment of data.
Have you identified a top query?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a project, where I was provided a Java matrix-multiplication program which can run in a distributed system , which is run like so :
usage: java Coordinator maxtrix-dim number-nodes coordinator-port-num
For example:
java blockMatrixMultiplication.Coordinator 25 25 54545
Here's a snapshot of how output looks like :
I want to extend this code with some kind of failsafe ability - and am curious about how I would create checkpoints within a running matrix multiplication calculation. The general idea is to recover to where it was in a computation (but it doesn't need to be so fine grained - just recover to beginning, i.e row 0 column 0 )
My first idea is to use log files (like Apache log4j ), where I would be logging the relevant matrix status. Then, if we forcibly shut down the app in the middle of a calculation, we could recover to a reasonable checkpoint.
Should I use MySQL for such a task (or maybe a more lightweight database)? Or would a basic log file ( and using some useful Apache libraries) be good enough ? any tips appreciated, thanks
source-code :
MatrixMultiple
Coordinator
Connection
DataIO
Worker
If I understand the problem correctly, all you need to do is recover your place in a single matrix calculation in the event of a crash or if the application is quit half way through.
Minimum Viable Solution
The simplest approach would be to recover just the two matrixes you were actively multiplying, but none of your progress, and multiply them from the beginning next time you load the application.
The Process:
At the beginning of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, create a file, let's call it recovery_data.txt with the state of the two arrays being multiplied (parameters a and b). Alternatively, you could use a simple database for this.
At the end of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, right before you return, clear the contents of the file, or wipe you database.
When the program is initially run, most likely near the beginning of the main(String[] args) you should check to see if the contents of the text file is non-null, in which case you should multiply the contents of the file, and display the output, otherwise proceed as usual.
Notes on implementation:
Using a simple text file or a full fledged relational database is a decision you are going to have to make, mostly based on the real world data that only you could really know, but in my mind, a textile wins out in most situations, and here are my reasons why. You are going to want to read the data sequentially to rebuild your matrix, and so being relational is not that useful. Databases are harder to work with, not too hard, but compared to a text file there is no question, and since you would not be much use of querying, that isn't balanced out by the ways they usually might make a programmers life easier.
Consider how you are going to store your arrays. In a text file, you have several options, my recommendation would be to store each row in a line of text, separated by spaces or commas, or some other character, and then put an extra line of blank space before the second matrix. I think a similar approach is used in crAlexander's Answer here, but I have not tested his code. Alternatively, you could use something more complicated like JSON, but I think that would be too heavy handed to justify. If you are using a database, then the relational structure should make several logical arrangements for your data apparent as well.
Strategic Checkpoints
You expressed interest in saving some calculations by taking advantage of the possibility that some of the calculations will have already been handled on last time the program ran. Lets look first look at the Pros and Cons of adding in checkpoints after every row has been processed, best I can see them.
Pros:
Save computation time next time the program is run, if the system had been closed.
Cons:
Making the extra writes will either use more nodes if distributed (more on that later) or increase general latency from calculations because you now have to throw in a database write operation for every checkpoint
More complicated to implement (but probably not by too much)
If my comments on the implementation of the Minimum Viable Solution about being able to get away with a text file convinced you that you would not have to add in RDBMS, I take back the parts about not leveraging queries, and everything being accessed sequentially, so a database is now perhaps a smarter choice.
I'm not saying that checkpoints are definitely not the better solution, just that I don't know if they are worth it, but here is what I would consider:
Do you expect people to be quitting half way through a calculation frequently relative to the total amount of calculations they will be running? If you think this feature will be used a lot, then the pro of adding checkpoints becomes much more significant relative to the con of it slowing down calculations as a whole.
Does it take a long time to complete a typical calculation that people are providing the program? If so, the added latency I mentioned in the cons is (percentage wise) smaller, and so perhaps more tolerable, but users are already less happy with performance, and so that cancels out some of the effect there. It also makes the argument for checkpointing more significant because it has the potential to save more time.
And so I would only recommend checkpointing like this if you expect a relatively large amount of instances where this is happening, and if it takes a relatively large amount of time to complete a calculation.
If you decide to go with checkpoints, then modify the approach to:
after every row has been processed on the array that you produce the content of that row to your database, or if you use the textile, at the end of the textile, after another empty line to separate it from the last matrix.
on startup if you need to finish a calculation that has already been begun, solve out and distribute only the rows that have yet to be considered, and retrieve the content of the other rows from your database.
A quick point on implementing frequent checkpoints: You could greatly reduce the extra latency from adding in frequent checkpoints by pushing this task out to an additional thread. Doing this would use more processes, and there is always some latency in actually spawning the process or thread, but you do not have to wait for the entire write operation to be completed before proceeding.
A quick warning on the implementation of any such failsafe method
If there is an unchecked edge case that would mean some sort of invalid matrix would crash the program, this failsafe now bricks the program it entirely by trying it again on every start. To combat this, I see some obvious solutions, but perhaps a bit of thought would let you modify my approaches to something you prefer:
Use a lot of try and catch statements, if you get any sort of error that seems to be caused by malformed data, wipe your recovery file, or modify it to add a note that tells your program to treat it as a special case. A good treatment of this special case may be to display the two matrixes at start with an explanation that your program failed to multiply them likely due to malformed content.
Add data in your file/database on how many times the program has quit while solving the current problem, if this is not the first resume, treat it like the special case in the above option.
I hope that this provided enough information for you to implement your failsafe in the way that makes the most sense given what you suspect the realistic use to be, and note that there are perhaps other ways you could approach this problem as well, and these could equally have their own lists of pros and cons to take into consideration.
I'm fairly new to programming, at least when it comes to anything substantial. I am about to start work on a management software for my employer which draws it's data from, and stores it's data to, an SQL database. I will likely be using JDBC to interact with it.
To try and accurately describe the problem I am going to focus on a very small portion of the program. In the database, there is a table that stores Job records. There are a couple of thousand of them. I want to display all available Jobs (as a text reference from the table) in a scroll-able panel in the program with a search function.
So, my question is... Should I create Job objects from each record in one go and have the program work with the objects to display them, OR should I simply display strings taken directly from the records? The first method would mean that other details of each job are stored in advanced so that when I open a record in the UI the load times should be minimal, however it also sounds like it would take a great deal of resources when it initially populates the panel and generates the objects. The second method would mean issuing a large quantity of queries to the Database, but might avoid the initial resource overhead, but I don't want to put too much strain on the SQL Server because other software in-house relies on it.
Really, I don't know anything about how I should be doing this. But that really is my question. Apologies if I am displaying my ignorance in this post, and thank you in advanced for any help you can offer.
"A couple thousand" is a very small number for modern computers. If you have any sort of logic to perform on these records (they're not all modified solely via stored procedures), you're going to have a much easier time using an object-relational mapping (ORM) tool like Hibernate. Look into the JPA specification, which allows you to create Java classes that represent database objects and then simply annotate them to describe how they're stored in the database. Using an ORM like this system does have some overhead, but it's nearly always worthwhile, since computers are fast and programmers are expensive.
Note: This is a specific example of the rule that you should do things in the clearest and easiest-to-understand way unless you have a very specific reason not to, and in particular that you shouldn't optimize for speed unless you've measured your program's performance and have determined that a specific section of the code is causing problems. Use the abstractions that make the code easy to understand and come back later if you actually have to speed things up.
The essence of my problem is that there are too many solutions, and I would like to find which one wins out in pros and cons before I build an infrastructure around it.
(Simplified for the purpose of this forum) This is an auction site where five auctions are stored in a rank #1-5, #1 being the currently featured auction. The other four are simply "on deck." After either a couple hours or the completion of that auction, #2-5 move up to #1-4 and a new one is chosen to be #5
I'm using a dedicated server and I've been considering just storing the data in the servlet or maybe adding a column in the database as a boolean for each auction...like "isFeatured = 1"
Suffice it to say the data is read about 5 times+ more often than it is written, which is why I'm leaning towards good old SQL.
When you can retrieve the relevant auctions from DB with a simple query with ORDER BY and TOP or something similar then try this. If no performance issues occur then KISS and you're done.
Otherwise when these 5 auctions are valid for a while then cache them in memory. Have a singleton holding these auctions and provide methods for updating for example. Maybe you want to use a caching lib instead. Update these Top5 whenever necessary but serve them directly out of memory without hiting a DB or something similar expensive.
What kind of scale are you looking for? How many application servers need access to the data?
I think you're probably making this more complicated than it is. Just use a database, take a hit of ACID, and move onto whatever else you need to work on. :P
Have you taken a look at SQLite? It allows for "good old SQL" without all of the hassles of setting up a separate database server. As long as the data isn't too huge (to be fair, I haven't tested the size limits, but I've skimmed blog entries mentioning the use of SQLite to process files of several dozen MB in size quickly and with no problems), you should be fine.
It isn't a perfect solution for all needs (frankly, I sometimes find the dynamic typing to be a pain), but since it relies on locally stored files, reads will be much faster than firing up a network connection to talk to a more "traditional" RDBMS.
sorry for poor topic name, i could not think for any thing better ;)
i am working on a news broadcast web site project, and the stake holder asked me to create a unique html file for each article and save it on disk instead of using a dbms like mysql , so the users can access the file directly and no computing will be needed so there wont be any bottle neck in that case.
and i did so.
and my question is , is this(what he asked me) a good and popular practice in programming?
what are the pros and cons?
thank you all and sorry for my poor English writing :P
If you got a template and can generate these pages automatically, it can be a good practise. Like you say, it prevents your server from having to generate the page. It only needs to put through the plain page.
And if you need to change the layout, or need to edit an article, you can just regenerate the page.
It is quite common, although lots of pages always have some dynamic content, like a date, user info or other session or time specific data. In this case you cannot cache the entire page. Of course you can combine both. Have dynamic index pages and front page, and only cache the actual articles themselves. But I read in your question that that is what you've done now.
Pros:
Faster retrieval of pages
Less load on your webserver
Less load on your database server
Cons:
Need to do some extra work to update the cache when an article is modified
Cannot have any dynamic content in the page
There probably isn't a problem at all. Most webservers are able to server large amounts of dynamic pages (premature optimization is the root of all evil).
There are other ways to speed things up, that don't have the above cons. You could cache query results in Memcache and/or use APC cache to speed up your PHP code and decrease disk I/O.
But there are web hosting companies dedicated entirely onto serving static content. That static content can be server from in-memory too, making it even faster than APC cached dynamic content, so if you really really really need the performance, yes, this is the way to go. But I seriously doubt you do.
Static pages are good for small websites. If you have the chance, go for it but if you need complex operations, dynamic page structure should be the way to go.
For an article site, I'd go with dynamic pages since the concept is dynamic (You'll need to update the site, add new articles, maybe add new features like commenting, user activity etc).
It is easier to add/delete/edit an article directly from an admin panel, with static pages, you'd have to find your way through the html code.
The list would go on and on...
Without a half-decent templating system, you'd have to store the full article AND the page layout and styles in the one file.
This means, it'd be difficult to update look and feel across all the published articles, and if you wanted to query the article list and return a list (such as those form a specific author or in a specific category), you'd be a bit stuck too.
If you think of it as a replacement for your database: No, that's not good pratice. You loose a lot of information, editing pages later will be harder as well es setting up indexed search functions,...
If you think of it as a caching solution: Then yes, this is good practice and also a common technique. But think on how to do the caching, when to replace the files with new versions and only do it if you have few write accesses and a lot of read accesses to your pages (which is typical for an article site ^^)
Definitely not a common practice, and I would not do it this way. Especially for the reasons of having a bottleneck - you won't have any bottletneck there. Nor any performance problem. How much unique visitors is your site likely to be getting? Hundreds of thousands?
In fact, reading from the disk is more likely to be a problem. DB operations can be optimized, cached in memory, etc - the db server performs various optimizations. On the other hand, you read the file each time (or handle the caching yourself).
The usual and preferred way to do it is:
store and load content from DB
have a template (header + footer) for the page, and only insert the content
have an admin panel with an editor (as rich as possible) where you can modify the content of the articel
I started out asking myself why a stakeholder might be asking you to implement a system this way. Why would he / she care, as long as your system meets the requirements? There are two possible answers to this:
The stakeholder is a bit of a control freak; e.g. an ex-techie who likes to interfere with what his developers do.
The stakeholder has had a bad experience in the past; e.g. with a previous system where the content was "locked into" a database with an unwieldy front end that made life hell for the users.
From this standpoint, how would you address the problem? My take is that you need to get to the bottom of why the stakeholder is asking for this. Does he have some genuine concern? Can you address that concern in the system design?
The bottom line is that "is this best practice" is not the overriding criterion here. Arguably, "what the customer wants" or "what the customer needs" are more important.
What I think you need to do is:
Find out what the stakeholder's real concern is.
Discuss with him / her (and other stakeholders) the design options that will address those concerns. Present them with the alternatives and an honest assessment of their implications, and involve them in the decision making.
I am attempting to model a realistic social network (Facebook). I am a Computer Science Graduate student so I have a grasp on basic data structures and algorithms.
The Idea:
I began this project in java. My idea is to create multiple Areas of Users. Each User in a given area will have a random number of friends with a normal distribution around a given mean. Each User will have a large percentage or cluster of "Friends" from the Area that they belong to. The remainder of their "Friends" will be smaller clusters from a few different random Areas.
Initial Structure
I wanted to create an ArrayList of areas
ArrayList<Area> areas
With each Area holding an ArrayList of Users
ArrayList<User> users
And each User holding an ArrayList of "Friends"
ArrayList<User> friends
From there I can go through each Area, and each User in that Area and give that user most of their friends from that Area, as well as a few friends from a few random Areas. This is easy enough as long as my data set remains small.
The problem:
When I try to create large data sets, I get an OutOfMemoryError due to no more memory in the heap. I now realize that this way of doing it will be impossible if I want to create, say, 30 Area's with 1 millions users per area, and 200 friends per User. I eat up almost 2gb with 1 Area...So now what. My algorithm would work if I could create all the users ahead of time, then simply "give" friends to each user. But I need the Areas and Users created first. There needs to be a User in an Area before it can be made a "friend".
Next Step:
I like my algorithm, it is simple and easy to understand. What I need is a better way to store this data, since it cant be stored and held in memory all at once. I am going to need to not only access the Area a user belongs too, but also a few random areas as well, for each user.
My Questions:
1. What technology/data structure should I be putting this data into. In the end I basically want a User->Friends relationship. The "Area" idea is a way to make this relationship realistic.
2. Should I be using a different language all together. I know that technologies such as Lucene, Hadoop, etc. were created with Java, and are used for large amounts of data...But I have never used them and would like some guidance before I dive into something new.
3. Where should I begin? Obviously I cannot use only java with the data in memory. But I also need to create these Areas of Users before I can give a User a list of Friends.
Sorry for the semi-long read, but I wanted to lay out exactly where I am so you could guide me in the right direction. Thank you to everyone that took the time to read/help me with this topic.
You need a searchable storage solution to hold your data (rather than holding it all in memory). Either a relational database (such as Oracle, MySQL, or SQL Server) with an O/RM (such as Hibernate) or a nosql database such as mongodb will work just fine.
Use a database with some ORM tool[JPA with Hibernate etc.] ,
Load data Lazily, when they are really needed
Unload them when them from Cache/Session when they are not really required or inactive.
Feel comfortable to let me know in case there is any difficulty to understand.
http://puspendu.wordpress.com/
There is probably no benefit keeping it all in memory, unless you are planning on using every node in some visual algorithm to display relationships.
So, if you use a database then you can build your relationships, give random demographic information, if you want to model that also, and then it is a matter of just writing your queries.
But, if you do need a large amount of data then by using 64-bit Java then you can set the memory to a much larger number, depending on what is on your computer.
So, once you built your relationships, then you can begin to write the queries to relate the information in different ways.
You may want to look at using Lists instead of Arrays, when sizes are different, so that you aren't wasting memory when you read the data back. I expect that is the main reason you are running out of memory, if you assume that there are 100 users and the largest number of friends for any of these is 50, but most will have 10, then for the vast majority of users you are wasting space, especially when you are dealing with millions, as the pointer for each object will become non-trivial.
You may want to re-examine your data structures, I expect you have some ineffiencies there.
You may also want to use some monitoring tools, and this page may help:
http://www.scribd.com/doc/42817553/Java-Performance-Monitoring
Even something as simple as jconsole would help you to see what is going on with your application.
Well you are not breaking new ground here and there are a lot of existing models that you can pull great amounts of information from and tailor to suit your needs. Especially if you are open to the technologies used. I understand your desire to have it fill this huge number from the start but keep in mind a solid foundation can be built upon and changed as needed without a complete rewrite.
There is some good info and many links to additional good info as to what FB, LinkedIn, Digg, and others are doing here at Stackoverflow question 1009025