What is a good way to remove duplicates?

What is a good way to remove duplicates? - java

I have a varchar column. It contains values separated by semicolon (;).
For example, it looks like
10;20;21;17;20;21;22;
It's not always 7 elements. It could contain anything from around 30 to 70. The reason they designed it this way is because the values are actually genome segments and it makes sense to enter or retrieve it collectively
I need to remove records with duplicate columns, so if I see another record with the same value as above, I need to remove it.
I also need to remove the record if it contains same values in another record. For example, I need to remove
10;;21;17;20;21;22;
because it's the same as the first but it doesn't have the second value, 20. If it's more complete than the first, I will remove the first one instead.
1;2;3;4;5;6;7; and 1;2;3;4;5;6;7;8; are dups and I'm taking the 2nd one because it's more complete. 1;2;3;4;5;6;;7 is also a duplicate. In this case, if they have 13 or more matched numbers and no mismatch, we will merge them so it becomes a single value 1;2;3;4;5;6;7;7;.
I can scan each record in java but I'm afraid that it will be complicated and time consuming, given that the table contains millions of records. I was wondering if it's doable in oracle itself.
My final goal is to calculate the frequency that those numbers occur. For instance, if number 10 appears 5 out of 100 times, it will be 5%. The calculation will be simple. However, I can't calculate this unless I make sure there's no duplicates in the table in the first place.

Note: This answer is a placeholder because the question looks in danger of closure but I think it will be worthy of an answer once all the rules are established.
It's trivial to remove the exact duplicates:
delete from your_table y
where y.rowid not in ( select min(x.rowid)
from your_table x
group by x.genome_string)
The hard part is establishing duplicating strings which have exact matches and nulls. Merging rows makes the logic even more convoluted.

The sql below is a solution ONLY IF:
1;2;3;4;5; is a more complete form of 1;2;;5
All your entries end with ;
The request was tested using sqlite so perhaps it may need some changes for Oracle.
It expects a table "TEST" with a column "VALUE"
SELECT
DISTINCT VALUE
from TEST As ORIGIN_TEST
WHERE NOT EXISTS (SELECT VALUE FROM TEST
WHERE
VALUE <> ORIGIN_TEST.VALUE AND
(VALUE LIKE replace(ORIGIN_TEST.VALUE, ';;', ';_%;') OR
VALUE LIKE ORIGIN_TEST.VALUE || '_%;')
)

Related

how can i generate a item code value with auto increment integer plus an identification integer

i'm creating a Itemcode for my inventory system i want the number system of integer values like this using java
for example this
for group 1 the code would be 001 -
0010001,
0010002
for group 2 the code would be 002-
0020003,
0020004
for group 3 the code would be 003-
0030005,
0030006
the items are encoded individually so when i add a new entry it will detect which group it belongs to and generate it desired item code the first 3 digits will be the corresponding Value identification in which group it belongs to the the next 4 digit code will just be the increment value..and would be stored as one integer using MySQL database

You need to decide:
Are the item codes to be represented as: one integer, a pair of integers (group & item), a string ... or something else.
Is the numbering scheme per the first example or the second one. (You seem to have chosen one scheme now ...)
How you are going to populate the items and codes. Do you read the codes? Do you generate them all in one go while loading items from a file. Do you create items and item ids one at a time (e.g. interactively).
How is this information going to be "stored"? In memory only? In a flat file? In a database? (MySQL ... ?)
These decisions will largely dictate how you implement the item id "generation".
Basically, your problem here is that >>you<< need to figure out what the requirements are. Once you have done that, the set of possible solutions will reduce to a manageable size, and you can then either work it out for yourself or ask a sensible question.

Database performance for filter by integer or boolean?

I will be having a database table with a few million entries, eg products of an online shop.
If one is out of stock, I want to mark it somehow, and I want to exclude it from any findAll() sql fetches.
Therefore I though one of the following options:
each product already has an integer count of availability. I anyhow have to set that = 0. select * from products where availcount > 0
or I could introduce a boolean available = 'true' field that I set to false if out of stock, and the query would then be ...where available = 'true'
Question: will this make any difference? Are there reasons one of these options should be preferred?

I would stick with the stock levels (int availcount). Bit fields are typically very difficult to index, unless there is a massive skew in the data such that there are of the order of 1% or less products out of stock (and since you will likely be searching for in-stock products only, any index on the flag will be unused).
Since it seems you already store the actual stock level in any event, not storing available in stock indicator will save you headaches on trying to keep the two columns in synch.
Finally, many RDBMS's allow you to add COMPUTED columns (or failing which, add the available indicator to a VIEW), which will allow you the logical derivation of available indicator from the actual availcount, without any storage overhead.
Edit
As per the comments below, note that an index on availcount (for queries WHERE availcount = 0 and availcount > 0) will be equally un-SARGable as an index on a bit field, although an index may not be needed if the products are generally searched by other criteria.
In addition to deriving is available in stock ? in the database, this determination can also be taken in code, e.g. an additional bool isAvailable() { return availcount > 0 ;} method on your entity class.

If you already have the availcount column anyway, there is no reason to add a new one, your availcount > 0 will do.
If you do not need the count for other reasons, and are just trying to decide between having a count or a boolean, consider how hard it is going to be to update that column rather than filtering.
If you only have a boolean, you'll only need to touch when the product goes out of stock (or comes back in). Having a count is more complex: you'd need to update it every time a sale is made or the item is restocked. This is more complicated, has possible performance implications, and a bunch or corner cases to care about. So, unless you need the count for other purposes, it's, probably, a better idea to stick with the boolean.

I think the two options would be equally efficient on SELECT as long as there's an index in the column in question.
Indexing availcount will have a small penalty on any update of this column (and I guess this column will be updated often). On the other hand, having an available column will add redundancy to your database (i.e. it will not be normalized) which you may want to avoid.

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'

Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.

It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.

In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).

From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.

It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach

There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.

You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

What is the best way to match over 10000 different elements in database?

Ok here's my scenario:
Programming language: Java
I have a MYSQL database which has around 100,000,000 entries.
I have a a list of values in memory say valueList with around 10,000 entries.
I want to iterate through valueList and check whether each value in this list, has a match in the database.
This means I have to make atleast 10,000 database calls which is highly inefficient for my application.
Other way would be to load the entire database into memory once, and then do the comparison in the memory itself. This is fast but needs a huge amount of memory.
Could you guys suggest a better approach for this problem?
EDIT :
Suppose valueList consists of values like :
{"New","York","Brazil","Detroit"}
From the database, I'll have a match for Brazil and Detroit. But not for New and York , though New York would have matched. So the next step is , in case of any remaining non matched values, I combine them to see if they match now. So In this case, I combine New and York and then find the match.
In the approach I was following before( one by one database call) , this was possible. But in case of the approach of creatign a temp table, this wont be possible

You could insert the 10k records in a temporary table with a single insert like this
insert into tmp_table (id_col)
values (1),
(3),
...
(7);
Then join the the 2 tables to get the desired results.
I don't know your table structure, but it could be like this
select s.*
from some_table s
inner join tmp_table t on t.id_col = s.id

Most efficient way to check if row exists in grid Java

All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks

If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset

I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What is a good way to remove duplicates? - java

Related

how can i generate a item code value with auto increment integer plus an identification integer

Database performance for filter by integer or boolean?

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

What is the best way to match over 10000 different elements in database?

Most efficient way to check if row exists in grid Java

Categories

Resources