I have millions of names stored in my database these names are nothing but customer names,
I Have to cluster names which are phonetically similar to each other internally,
one approach that i am using is matching each name with some selective similar names fetched from database based on sound-ex,meta-phone,initials..etc
But it is very slow ,
now i am thinking about generating unique id for each names and clustering similar unique ids,
but i am not able to generate unique ids.
there Names are Indian names and written using English Alphabet.
Is there any algorithm for clustering similar names.
please help
The key problem here is "phonetically similar". You'd need to know how to generate a unique ID from phonemes.
You don't say which language and alphabet these names are stored in.
Maybe the problem has more in common with speech synthesis algorithms:
http://social.msdn.microsoft.com/Forums/da/netfxbcl/thread/b6b88747-9616-462e-9cf6-78c19da32f38
Or this one for Java:
http://voce.sourceforge.net/
Related
I'd like to use the Java CWS web service to perform searches against Metadata stored in categories.
When I execute the getFieldInfo method the searchable fields in my categories are not listed. I'm trying to figure out the structure of the query expression for searching this info.
Any help greatly appreciated.
In case anyone comes across this, the attributes for categories are not returned by the getFieldInfo method. They have special names in the format of:
attr_xxxxx_yyy, where xxxxx is the id of the category and yyy is a sequential number, incremented for each attribute in the category.
I want to design Address book with following fields
UID Name PhoneNumber1 PhoneNumber2
UID is to identify the name uniquely. Lets say i want to save 2 million records.
Now i want to structure how to save this records, so that it can be searchable by both Name and phoneNumber.
Which data structure and search technique i should go with.
Thanks in advance
What if you have conflicting names?
John Smith could return multiple times.
It appears that you are better off just using PhoneNumber1/PhoneNumber2 as your search variables.
I'd recommend a HashTable to do this, as it allows O(1) for searching, and with 2 million records, you don't want it to take forever to find someone.
Normalise that to the following tables and columns:
Names: UID, Name
PhoneNumbers: UID, SN, PhoneNumber
SN serial number, so 1 or 2 (and in the future, 3 to 1000 as well)
Each search you do should run two queries, one for each table (or one UNION query on both tables)
SELECT UID, Name
FROM Names
WHERE Name = '%<search string>%'
SELECT UID, PhoneNumber
FROM PhoneNumbers
WHERE PhoneNumber = '%<search string>%'
ORDER BY UID # so that multiple matches with same user appear together
Combining the results of both queries can be done in Java.
Why don't you design a class AddressBook
class AddressBook{
private Integer uuid;
private String name;
private Integer phoneNumber1;
private Integer phoneNumber2;
//getters & setters
}
Create a AddressBook Table in your database with the corresponding fields. uuid will be the primary key. Persist the AddressBook object.
To search by name
select * from AddressBook where name ="something";
To search by phone number
select * from AddressBook where phoneNumber1="something";
That depends on what are your main targets :
If requirement dvelopment is done and you have decided to use a relational data model for data storage and retrieval then #aneroid answer is an option.
Have in mind that:
Using WHERE Name = '%<search string>%' will force a considerable cost on RDMS engine. You may seek advanced full text search techniques in large scale data, based on your RDBMS.
If performance is the main target, using relational in memory databases will be an option.
In case RDBMS can be skipped, then java lang data structures
will come in handy, see here they are forged in terms of time complexity.
I am wondering in how to develop a dynamic way to generate column prefix patterns. The main idea is to standardize corporation patterns while defining column names. For example:
If I have to create a column that is a date, so the prefix will be DT_*column_name*;
If it is a name column, so it will be NM_*column_name*;
But if you don't have a well defined pattern, you can suggest a name that need to be approved.
Has anyone ever thought about something like this?
Thank you in advance
**EDIT**
Sorry, I think I didn't explained it enough. It's not exactly for handling type prefixes, but specific business/corporation names. For example (again):
Column customer should be prefixed with CSTM_
Column digit should be prefixed with DIGT_
Column franchising should be prefixed with FRCH_
Dont think is a good idea.
Consider your organization use different DBs like oracle and mysql with different data types.
The column name should be use to describe the column. You alredy have a type of the column defined.
Another drawback is if you want your application to be supported by multipe database engines. Are you going to change the schema ?
In my database I have a table containing localized cities.
Cities
_id |name_en |name_de |name_it
0 |Rome |Rom |Roma
1 |Munich |München |Monaco
...
Now I want show a ListView where each line exists of all names started by the name in the users language. Also the whole list should be sorted by the city in the users language.
Which is the right design-pattern for this kind of problem?
This is quite a broad question, but here's one general approach:
Get the user's current language ( Get the current language in device )
Query your database with this language code
Bind the returned Cursor to your ListView
Please post your relevant code if you want specific help.
Obviously you need to decide what column to use for SQL request (for both stating which column to retrieve and by which column to sort). So your column names should be public constants. And you need to have a method to return a column name (one of the constants) depending on the current device locale. Use one of the constants as a fallback if the locale does not match any known for the application locales.
Please suggest me Java product (I would prefer open-source) which does do:
data deduplication
deduplication scoring
allows to customize deduplication rules and scoring rules.
Please see the example:
I have an input staging database named "INPUT_DB"
I have a table named "INPUT_PERSONS"
There are several fields in this table:
ID (some meaningless surrogate primary key)
FIRST_NAME
LAST_NAME
SECOND_NAME
BIRTH_DATE
PASSPORT_SERIES (PASSPORT_SERIES + PASSPORT_NUM is a unique identifier of a citizen)
PASSPORT_NUM
I have to look through all records in INPUT_PERSONS and find duplicates and matches.
Several rules should be created:
if PASSPORT_SERIES+PASSPORT_NUM equals to some record it means these two records are duplicates. The score for such situation is 100 out of 100
If FIRST_NAME, LAST_NAME are equal, but PASSPORT_SERIES+PASSPORT_NUM has one different character (misprint for example), then these records are possible duplicates and their score is 90 out of 100.
And so on....
Is it possible to find some ready solution and use it as a base?
Ive done this in the past and based it on the fellEgi-sunter algo. See this question: Is there a open source implementation for Fellegi-Sunter?
The DUKE project may fill your requirement: https://github.com/larsga/Duke