Approximate/Fuzzy matching when we have a huge list

Approximate/Fuzzy matching when we have a huge list - java

I have a table in mySql database containing the complete information of a user. Now, I want to find the record of the user on the basis of user name entered by user. And I want to make my matching intelligent for example if user enters "Bilal Ahmed" and the actual entry in table is "Bilal Ahmad". Notice there is the difference of just single character.
Soundex will be time consuming and will not effective too from accuracy point of view as I have 17 lakh records increasing dat by day ...
kindly suggest me how I can approach to this issue ?

Related

Generate Unique Enrollment Number

I have table named - College and Student
I want to generate unique Enrollment number for the student at the time of registration
In single College there are multiple registrations are possible at the same time For example College - ABC have many persons who can register student
My logic for generate Enrollment id is YY_College-Pk_Last-five-digit-increment
YY_COLFK_DDDDD
At the registration time of student I will first fire Max query like
select Max(Enrollment_No) from student where College_Fk=101
And get last Enrollment_No and split last five digit and increment by 1 and insert it
When there is a chance to submit two students' data at the same time there is chance of generating single Enrollment_No for two students
How to manage this problem

On the Java side of things you could draw some inspiration from concepts such as UUIDs (see https://www.baeldung.com/java-uuid for example).
But as you are using a database, you should rather use the capabilities of that part, see How to generate unique id in MySQL? for some examples.
In other words: the database is your single source of truth. It offers you the ability to have IDs that are guaranteed to be unique!

if you want to add two student data at same time then u must be using insert statement twice, so for each data you have to yo

Fetching sorted data from server in chunk?

I need to implement the feature where I need to display the customer names in ascending or descending fashion (along with other customer data) from oracle database.
Say I display first 100 names from DB in desc order.
There is button show more which will display next 100 names .
I am planning to fetch next records based on last index . So in step 2 I will fetch 101 to 200 names
But problem here is what if just before step 2, name was updated by some other user.
In that case name can be skipped(if name was updated to X to A) or duplicated((if name was updated to A to Z)) if I fetch records by index in step 2
Consider on first page displayed records names are from Z to X.
How can I handle this scenario where i can display the correct records without skip or duplicate ?
One way I Can think of is to fetch all records ID's in memory (either at webserver memory or cursor memory), store it as temporary result and then return the data from there instead of real data.But if i have million of records athen it will be load on memory either webserver or DB memory.
What is best approach and how do other sites handle this kind of scenario ?

If you really want each user to view a fixed snapshot of the table data, then you will have to do some caching behind the scenes. You have a valid concern of what would happen if, when requesting page 2, serveral new records landed on what would have been page 1, thus causing the same information to be viewed again on page 2. Well, playing the devil's advocate, I could also argue that a user might also be viewing records which were deleted and are no longer there. This could be equally bad in terms of user experience.
The way I have usually seen this problem handled is to just do a fresh query for each page. Since you are using Oracle, you would likely be using OFFSET and FETCH. It is possible that there could be a duplicated/missing record problem, but unless your data is very rapidly changing, it may be a minor one.

How to find the value that has maximum duplicates in app engine datastore using Java?

I have a datastore that stores the cab booking details of the customers. In the admin console I need to display the statistics to the admin, like busiest location, peak hours, total bookings in a particular location in a particular day. For the busiest location i need to retrieve the location from where most number of cabs has been booked. Should I iterate through the entire datastore and keep a count or is there any method to know which location has maximum and minimum duplicates.
I am using a ajax call to java servlet which should return the busiest location.
And I also need a suggestion for maintaining such a stats page. Should I keep a separate Entity kind just for counters and stats and update it everytime when a customer books a cab or is the logic correct for iterating through the entire datastore for the stats page. Thanks in advance.

There are too many unknowns about your data model and usage patterns to offer a specific solution, but I can offer a few tips.
Updating a counter every time you create a new record will increase your writing costs by 2 write operations, which may or may not be significant.
Using keys-only queries is very cheap and fast. It is the preferred method for counting something, so you should try to model your data in such a way that a keys-only query can give you an answer. For example, if a "trip" entity has a property for "id of a starting point", and this property is indexed, you can loop through your locations using a keys-only query to count the number of trips that started from each location.
Assuming that you record a lot of trips, and that an admin page will be visited/refreshed not very frequently, the keys-only queries approach is the way to go. If the admin page is visited/refreshed many times per hour, you may be better off with the counters.

Creating report from 1 million + records in MySQL and display in Java JSP page

I am working on a MySQL database with 3 tables - workout_data, excercises and sets tables. I'm facing issues related to generating reports based on these three tables.
To add more information, a number of sets make up an excercise and a number of excercises will be a workout.
I currently have the metrics to which a report is to be generated from the data in these tables. I've to generate reports for the past 42 days including this week. The queries run for a long time by the time I get the report by joining these tables.
For example - the sets table has more than 1 million records just for the past 42 days. The id in this table is the excercise_id in excercise table. The id of excercise table is the workout_id in workout_data table.
I'm running this query and it takes more than 10 minutes to get the data. I have to prepare a report and show it to the user in the browser. But due to this long running query the webpage times out and the user is not able to see the report.
Any advice on how to achieve this?
SELECT REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER
FROM EXCERCISES
INNER JOIN REPORTSETS ON EXCERCISES.ID=REPORTSETS.EXCERCISE_ID
where user_id=(select id from users where email='testuser1#gmail.com')
and substr(set_date,1,10)='2013-10-29'
GROUP BY REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER

Two things:
First, You have the following WHERE clause item to pull out a single day's data.
AND substr(set_date,1,10)='2013-10-29'
This definitively defeats the use of an index on the date. If your set_date column has a DATETIME datatype, what you want is
AND set_date >= `2013-10-09`
AND set date < `2013-10-09` + INTERVAL 1 DAY
This will allow the use of a range scan on an index on set_date. It looks to me like you might want a compound index on (user_id, set_date). But you should muck around with EXPLAIN to figure out whether that's right.
Second, you're misusing GROUP BY. That clause is pointless unless you have some kind of summary function like SUM() or GROUP_CONCAT() in your query. Do you want ORDER BY?

Comments on your SQL that you might want to look into:
1) Do you have an index on USER_ID and SET_DATE?
2) Your datatype for SET_DATE looks wrong, is it a varchar? Storing it as a date will mean that the db can optimise your search much more efficiently. At the moment the substring method will be called countless times per query as it has to be run for every row returned by the first part of your where clause.
3) Is the group by really required? Unless I'm missing something the 'group by' part of the statement brings nothing to the table ;)

It should make a significant difference if you could store the date either as a date, or in the format you need to make the comparison. Performing a substr() call on every date must be time consuming.

Surely the suggestions with tuning the query would help to improve the query speed. But I think the main point here is what can be done with more than 1 million plus records before session timed out. What if you have like 2 or 3 million records, will some performance tuning solve the problem? I don't think so. So:
1) If you want to display on browser, use pagination and query (for example) the first 100 record.
2) If you want to generate a report (like pdf), then use asynchronous method (JMS)

How to update user on effective date?

We are currently developing a user based web application in java.
A user can have many properties like firstname, lastname, Region (eg. Asia, Europe, US, UK), country, Division, Branch, Product etc.
There are normal CRUD screens for this user with all the above fields.
The add/edit screen for user will also have a date field called effective date. The way this add/edit user is different from normal add/edit in regular CRUD is that the update made to the user will not reflect until the effective date.
Lets say today is 6 April and I add a new user with Region say Asia and effective date 10 April. And now I go and change same user and change his region from Asia to US but effective date say 15 May.
So till 15th may the system should show his region as Asia only and on 15th may his region should change to US automatically.
You can think of it as the user who is working in Asia as on April but from 15th may he is moving to work in US i.e. Region changed from Asia to US effective date 15th may. So till 15th may he should be shown as a user in Asia only.
So I can not just update his region from Asia to us in database as a simple update operation.
And this applies to lot of other fields like division , branch and product as well.
I can not have multiple records for the same user.
EDIT 1:
We should also be able to see the user history or the future updates.
Like on the 6 april, I should see that users region will change from 15 May and his division will change from A to B starting 10 may.
I should also be able to delete updates say I come to know later that the proposed transfer of user from Asia to Us effective date 15 may is not going to happen now so I should be able to delete that update.
EDIT 2:-
Given the above situations, How do I make changes in my original user table and the user change log table?
Say for a user with region asia in the system which is going to change from asia to Us in next few weeks. user will have same update for user. He changes region from asia to user and choose effecftive date as some future date.
Now How do I check if region is changed from asia to us (there can be namy mre fields like region). Shall I do it at the code level or is it possible to do it at the datbase level using triggers etc?
Please help me out with designing the system and database for this.

I will suggest you can implement this system maintaining a CHANGELOG table and a scheduler which will run at a specific time everyday.
CHANGELOG table
Attributes :
targetTable
targetPrimaryKey
targetColumn
targetValue
effectiveDate
Now, whenever a update is made on required fields, insert a corresponding entry in changelog table.
Example,
targetTable : User
targetPrimaryKey : 3 (primary row id of user record)
targetColumn : Region
targetValue : US
effectiveDate : 15 May 2012
Now, a scheduler will run every day say(at 12:00 AM) and will check for scheduled changes made to be done that day from CHANGELOG table and make those changes in respective target tables.
Note : This approach has 2 advantages :
You dont have to maintain two entries for same record.
You can target multiple fields(region,division,branch) and multiple tables as well.

You can create a mapping table as User_Region containing the User_ID, Region_ID and Change_Date. Change_Date will give when the switch in the region for the user takes place. When the change date is null that could imply that the user is currently in the very region.
Then you can have a list of regions for a User_ID along with the dates, which can be displayed according to your convenience.
CREATE TABLE User_Regions{
User_ID INT,
Region_ID INT,
Change_Date DATE,
FOREIGN KEY User_ID REFERENCES User(ID),
FOREIGN KEY Region_ID REFERENCES Region(ID)
};

Historical table would do the trick. That means, having as many history records with date column and only treating the one that is closest to current date as current. Otherwise, you have to keep update history in a separate table and have a batch update process to update changes. I would recommend to try to overcome the impediment of unavailability to have multiple records for the same user by getting to normalized structure of user entity: users table will have key columns (e.g. id) and a joined table containing the historical updates with the rest of columns.

You have two possibilities here:
You have a different table that looks like your user table and with an additional column 'validFrom' and then have a cron (quartz) job update every morning.
You add the column to your regular user table together with a valid to and change all your queries to reflect that.
As you already said you cannot have multiple records, so 2. would be a bad solution. Stick to 1.

A "pure" data model would be to include a date in the key, and have multiple editions, but there are some issues with changing future events.
I think i would try to model future changes as a list of "operations" and changing a field in the future is in effect scheduling an operation that will modify the field on that date.
The application periodically applies the operations as they become due.
In the GUI, it's trivial to show a list of pending operations, and it's easy to cancel or modify future operations.
You may set up rules to limit or issue warnings for conflicting operations, but if the list is clearly visible, that is less of a problem.

In my opinion you should use Event statement. Here are the 2 events :
CREATE EVENT myevent
ON SCHEDULE AT DATE_CHANGING
DO
UPDATE myschema.mytable SET mycol = mycol + 1;
CREATE EVENT myevent
CREATE EVENT myevent
ON SCHEDULE AT DATE_BACKUP
DO
UPDATE myschema.mytable SET mycol = mycol - 1;

I can think of two approaches:
Keep all versions in main table
| user_name (PK) | effective_date_from (PK) | ...remaining columns
In this approach every user is represented by several rows in User table but only single row is current. In order to find current data you need to add some extra querying:
SELECT *
FROM User
WHERE user_name = ?
AND effective_date >= NOW()
ORDER BY effective_date DESC
LIMIT 1
So we are fetching all users with given user_name (there are multiple versions of such user with different effective_date, so effective_date must also be part of primary key). We limit the result to most recent version (but with effective_date not in the future).
This approach has several drawbacks:
what to do with foreign keys? Logically there is only one user in the system
poor performance, complicating queries
what to do with outdated versions?
Pending versions table
Keep the original (main) User table schema without any changes and have a second table called Pending_user_changes. The latter will also have the same schema + effective_date column. If the granularity of changes is one day, write a job (in the database or in the application) that looks for any pending changes that should take effect starting from today and eplace the main table.
I find this solution much cleaner: the primary/foreign keys in main table never change, also the main table is not cluttered with old and duplicated data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.