Address Book Search - Which Data structure i should use when large data

Address Book Search - Which Data structure i should use when large data - java

I want to design Address book with following fields
UID Name PhoneNumber1 PhoneNumber2
UID is to identify the name uniquely. Lets say i want to save 2 million records.
Now i want to structure how to save this records, so that it can be searchable by both Name and phoneNumber.
Which data structure and search technique i should go with.
Thanks in advance

What if you have conflicting names?
John Smith could return multiple times.
It appears that you are better off just using PhoneNumber1/PhoneNumber2 as your search variables.
I'd recommend a HashTable to do this, as it allows O(1) for searching, and with 2 million records, you don't want it to take forever to find someone.

Normalise that to the following tables and columns:
Names: UID, Name
PhoneNumbers: UID, SN, PhoneNumber
SN serial number, so 1 or 2 (and in the future, 3 to 1000 as well)
Each search you do should run two queries, one for each table (or one UNION query on both tables)
SELECT UID, Name
FROM Names
WHERE Name = '%<search string>%'
SELECT UID, PhoneNumber
FROM PhoneNumbers
WHERE PhoneNumber = '%<search string>%'
ORDER BY UID # so that multiple matches with same user appear together
Combining the results of both queries can be done in Java.

Why don't you design a class AddressBook
class AddressBook{
private Integer uuid;
private String name;
private Integer phoneNumber1;
private Integer phoneNumber2;
//getters & setters
}
Create a AddressBook Table in your database with the corresponding fields. uuid will be the primary key. Persist the AddressBook object.
To search by name
select * from AddressBook where name ="something";
To search by phone number
select * from AddressBook where phoneNumber1="something";

That depends on what are your main targets :
If requirement dvelopment is done and you have decided to use a relational data model for data storage and retrieval then #aneroid answer is an option.
Have in mind that:
Using WHERE Name = '%<search string>%' will force a considerable cost on RDMS engine. You may seek advanced full text search techniques in large scale data, based on your RDBMS.
If performance is the main target, using relational in memory databases will be an option.
In case RDBMS can be skipped, then java lang data structures
will come in handy, see here they are forged in terms of time complexity.

Related

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.

I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

DynamoDB query on two ranges and then sort

Let’s say I have an album table where partition key is author and sort key is album. Each item also has a price, startDate and endDate attributes. Let say I want to find all the albums that “author=a”, “album=b”, “startDate < c”, “endDate > d” and “price is between e and f”, sorted by price. Is the most efficient way to do that is query on partition key and sort key, and then filter the results on conditions c, d, e and f, and then sort by price? Can secondary index help here? (It seems one secondary index can only be used for query on one or two non-key attributes, but my use case requires < and > operations on multiple non-key attributes and then sorting)
Thanks!

I am working through a similar schema design process.
The short answer is it will depend on exactly how much data you have that falls into the various categories, as well as on the exact QUERIES you hope to run against that data.
The main thing to remember is that you can only ever QUERY based on your Sort Key (where you know the Partition Key) but you ALSO have to maintain uniqueness in order to not overwrite needed data.
A good way to visualize this in your case would be as follows:
Each Artist is Unique (Artist seems to me like a good Partition Key)
Each Artist can have Mutliple Albums making this a good Sort Key (in cases where you will search for an Album for a known Artist)
In the above case your Sort Key is being combined with your Partition Key to create your Hash Key per the following answer (which is worth a read!) to allow you to write a query where you know the artist but only PART of the title.
Ie. here artist = "Pink Floyd" QUERY where string album contains "Moon"
That would match "Pink Floyd" Dark Side of the Moon.
That being said you would only ever have one "Price" for Pink Floyd - Dark Side of the Moon since the Partition Key and Sort Key combine to handle uniqueness. You would overwrite the existing object when you updated the entry with your second price.
So the real question is, what are the best Sort Keys for my use case?
To answer that you need to know what your most frequent QUERIES will be before you build the system.
Price Based Queries?
In your question you mentioned Price attributes in a case where you appear to know the artist and album.
“author=a”, “album=b”, “startDated” and “price is between e and f”, sorted by price"
To me in this case you probably DO NOT know the Artist, or if you do you probably do not know the Album, since you are probably looking to write a Query that returns albums from multiple artists or at least multiple Albums from the same artist.
HOWEVER
That may not be the case if you are creating a database that contains multiple entries (say from multiple vendors selling the same artist / album at different prices). In which case I would say the easiest way to either store only ONE entry for an Artist-Album (partition key) at a given price (sort key) but you would lose all other entries that match that same price for the Artist-Album.
Multiple Queries MAY require Multiple Tables
I had a similar use case and ended up needing to create multiple tables in order to handle my queries. Data is passed / processed from one table and spit out into another one using a Lambda that is triggered on insertion. I then send some queries to one table and some other queries to the initial table.

JPA CriteriaBuilder like on double

I'm trying to retrieve data out of a legacy database.
The column in the table is defined as a DECIMAL(13,0) containing account numbers.
The column data type cannot be changed as it will have a major impact on the legacy system. Essentially all programs using the table need to be changed and then recompiled which is not an option.
We have a requirement to find all records where the account number contains a value, for example the user could search for 12345 and all accounts with an account number containing 12345 should be returned.
If this was a CHAR/VARCHAR, I would use:
criteriaBuilder.like(root.<String>get(Record_.accountNumber), searchTerm)
As a result of the column defined as DECIMAL(13,0), the accountNumber property is a double.
Is there a way to perform a like on a DECIMAL/double field?
The SQL would be
SELECT * FROM ACCOUNTS WHERE accountNumber LIKE '%12345%'

I have not actually tried this, but I believe it should work
criteriaBuilder.like(
root.get(Record_.accountNumber).as(String.class),
searchTerm)
This should generate a query kind of like this:
SELECT * FROM ACCOUNTS WHERE CAST(accountNumber AS text) LIKE '%12345%'

Deduplication with scoring framework/application/server on Java to work with database input staging

Please suggest me Java product (I would prefer open-source) which does do:
data deduplication
deduplication scoring
allows to customize deduplication rules and scoring rules.
Please see the example:
I have an input staging database named "INPUT_DB"
I have a table named "INPUT_PERSONS"
There are several fields in this table:
ID (some meaningless surrogate primary key)
FIRST_NAME
LAST_NAME
SECOND_NAME
BIRTH_DATE
PASSPORT_SERIES (PASSPORT_SERIES + PASSPORT_NUM is a unique identifier of a citizen)
PASSPORT_NUM
I have to look through all records in INPUT_PERSONS and find duplicates and matches.
Several rules should be created:
if PASSPORT_SERIES+PASSPORT_NUM equals to some record it means these two records are duplicates. The score for such situation is 100 out of 100
If FIRST_NAME, LAST_NAME are equal, but PASSPORT_SERIES+PASSPORT_NUM has one different character (misprint for example), then these records are possible duplicates and their score is 90 out of 100.
And so on....
Is it possible to find some ready solution and use it as a base?

Ive done this in the past and based it on the fellEgi-sunter algo. See this question: Is there a open source implementation for Fellegi-Sunter?

The DUKE project may fill your requirement: https://github.com/larsga/Duke

How do you store a collection of Strings in SQLite on Android?

I'm quite inexperienced with using databases in applications, so I need a bit of guidance.
I have a Java object with several primitive fields, and several references to Strings and ArrayList objects. The primitives and Strings map nicely to available SQLite fields, but I'm not sure how I can persist the ArrayLists.
I was entertaining two ideas, one of which is to serialise the ArrayLists and store them in a Text field, the other is to have a column which points to a table with arity 1, in which I can store the individual strings, but I'm unsure of how to implement this in android. I'm open to different approaches, but I wouldn't know how to implement the latter in java using SQLite, so a solution would be lovely. Thanks.

Without knowing further details, I can say that the textbook way to do this is to create a second table for the array list, and then include the id of the primary record in the array list.
For example, suppose your object consists of
String name;
int age;
ArrayList<String> hobbies;
You would create tables like this:
create table person (personid int, name varchar(30), age int);
create table hobby (hobbyid int, personid int, description varchar(30));
Then your data might look like this:
Person
11 Bob 18
12 Sally 68
13 Ford 42
Hobby
21 11 fishing
22 11 hunting
23 12 needlepoint
24 12 rock-climbing
25 13 hitch-hiking
To get the list of anyone's hobbies, you'd use a query like:
select person.name, hobby.description
from person
join hobby on person.personid=hobby.personid

Generally (from what I have learned) if you have an object, which itself contains a list of other objects, that would be a 1 to many (or potentially many-to-many) relationship. To store this data you would want to use another table. In the other table, you will have your primary key for the object, and then a foreign key referencing the parent object to which it belongs. See this link for a better explanation.
Example:
CREATE TABLE User (
_id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
name TEXT
);
CREATE TABLE UserPicture(
_id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
userId INTEGER,
path TEXT
FOREIGN KEY(userId) REFERENCES User(_id)
);
Now say you have a user object, with a List of UserPictures', when you save to the database you
will want to iterate over each picture and insert each one into the UserPicture table, using the
userId as the link back to the User table.
In the instance of a many-to-many relationship, each Object would have a List of their children objects.
A better example of this would be a Membership/Role system. A User would have a List of Roles, and a Role
would have a List of Users, since a user can (normally) be in multiple roles, and a role can of course have
multiple users. This would simply require whats called a join table I think it is. UserInRole would have two columns, UserID and RoleID to show that User X belongs to Role Y.
As far as how to implement it, search around for 'Android sqlite tutorial'. Here and here are two links with tutorials on how to setup a sqlite android database app.

You can use this Library,
use annotation '#SaveList' on the List of Strings
//To save
EscapeSQLBoiler.getEscapeSQLBoiler(this).saveMyList("UniqueKey", strings);
//to get
strings = EscapeSQLBoiler.getEscapeSQLBoiler(this).giveMyListSavedInKey(UniqueKey);
Library path
https://dl.bintray.com/bipinayetra/maven/com/bipinayetra/save-processor/
https://dl.bintray.com/bipinayetra/maven/com/bipinayetra/save-annotation/
Download
dependencies {
annotationProcessor 'com.bipinayetra:save-processor:1.0.0'
implementation 'com.bipinayetra:save-annotation:1.0.0'
}
To use Save List library, add the plugin to your buildscript:
allprojects {
repositories {
maven {
url "https://dl.bintray.com/bipinayetra/maven"
}
}
}
You can also view this information on Blog
https://www.bipinayetra.com/products/2018/12/25/save-yourself-from-saving.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.