Derby - How to handle additions to objective fields

Derby - How to handle additions to objective fields - java

I'm looking to create a table for users and tracking their objectives. The objectives themselves would be on the order of 100s, if not 1000s, and would be maintained in their own table, but it wouldn't know who completed them - it would only define what objectives are available.
Objective:
ID | Name | Notes |
----+---------+---------+
| | |
Now, in the Java environment, the users will have a java.util.BitSet for the objectives. So I can go
/* in class User */
boolean hasCompletedObjective(int objectiveNum) {
if(objectiveNum < 0 || objectivenum > objectives.length())
throw new IllegalArgumentException("Objective " + objectiveNum + " is invalid. Use a constant from class Objective.");
return objectives.get(objectivenum);
}
I know internally, the BitSet uses a long[] to do its storage. What would be the best way to represent this in my Derby database? I'd prefer to keep it in columns on the AppUser table if at all possible, because they really are elements of the user.
Derby does not support arrays (to my knowledge) and while I'm not sure the column limit, something seems wrong with having 1000 columns, espeically since I know I will not be querying the database with things like
SELECT *
FROM AppUser
WHERE AppUser.ObjectiveXYZ
What are my options, both for storing it, and marshaling it into the BitSet?
Are there viable alternatives to java.util.BitSet?
Is there a flaw in the general approach? I'm open to ideas!
Thanks!
*EDIT: If at all possible, I would like the ability to add more objectives with only a data modification, not a table modification. But again, I'm open to ideas!

[puts on fake moustache]
Store the bitset as a BLOB. Start by simply serializing it, then if you want more space-efficiency, trying pushing the results through a DeflaterOutputStream on their way to the database. For better space- and time- efficiency, try the bitmap compression method used in FastBit, which breaks the bitset into 31-bit chunks, then run-length encodes all-zero chunks, packing the literal and run chunks into 32-bit words along with a discriminator bit.
If you know you'll only look at the objective bitset while the ResultSet that brought it from the database is still open, write a new bitset class that wraps the Blob interface and implements get on top of getBytes. This avoids having to read the whole BLOB into memory to check a few specific bits, and at least avoids having to allocate a separate buffer for the bitset if you do want to look at all the values. Note that making this work with a compressed bitset will take substantial ingenuity.
Be aware that this approach gives you no referential integrity, and no ability to query on the user-objective relationship, little flexibility for different uses of the data in future, and is exactly the kind of thing that Don Knuth warned you about.

The orthodox way to do this does not involve bitsets at all. You have a table for users, a table for objectives, and a join table, indicating which objectives a user has. Something like:
create table users (
id integer primary key,
name varchar(100) not null
);
create table objectives (
id integer primary key,
name varchar(100) not null
);
create table user_objective (
user_id integer not null references users,
objective_id integer not null references objectives,
primary key (user_id, objective_id)
);
Whenever a user has an objective, you put a row in the join table indicating the fact.
If you want to get the results into a bitset for a user, do an outer join of the user onto the objectives table via the join table, such that you get a row back for every objective, which has a single column with, say, a 1 for each joined objective, or 0 if there was no join.
The orthodox approach would also be to use a Set on the Java side, rather than a bitset. That maps very nicely onto the join table. Have you considered doing it this way?
If you're worried about memory consumption, a set will use about one pointer per objective a user actually has; the bitset will use a bit per possible objective. Most JVMs have 32-bit pointers (only old or huge-heaped 64-bit JVMs have 64-bit pointers), so if each user has on average less than 1/32nd of the possible objectives, the set will use less memory. There are some groovy data structures which will be able to store this information more compactly than either of those structures, but let's leave that to another question.

Related

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.

I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

Cassandra; best practice regarding Indexes?

I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.

According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)

It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.

How can I insert common data into a temp table from disparate schemas?

I am not sure how to solve this problem:
We import order information from a variety of online vendors ( Amazon, Newegg etc ). Each vendor has their own specific terminology and structure for their orders that we have mirrored into a database. Our data imports into the database with no issues, however the problem I am faced with is to write a method that will extract required fields from the database, regardless of the schema.
For instance assume we have the following structures:
Newegg structure:
"OrderNumber" integer NOT NULL, -- The Order Number
"InvoiceNumber" integer, -- The invoice number
"OrderDate" timestamp without time zone, -- Create date.
Amazon structure:
"amazonOrderId" character varying(25) NOT NULL, -- Amazon's unique, displayable identifier for an order.
"merchant-order-id" integer DEFAULT 0, -- A unique identifier optionally supplied for the order by the Merchant.
"purchase-date" timestamp with time zone, -- The date the order was placed.
How can I select these items and place them into a temporary table for me to query against?
The temporary table could look like:
"OrderNumber" character varying(25) NOT NULL,
"TransactionId" integer,
"PurchaseDate" timestamp with time zone
I understand that some of the databases represent an order number with an integer and others a character varying; to handle that I plan on casting the datatypes to String values.
Does anyone have a suggestion for me to read about that will help me figure this out?
I don't need an exact answer, just a nudge in the right direction.
The data will be consumed by Java, so if any particular Java classes will help, feel free to suggest them.

First, you can create a VIEW to provide this functionality:
CREATE VIEW orders AS
SELECT '1'::int AS source -- or any other tag to identify source
,"OrderNumber"::text AS order_nr
,"InvoiceNumber" AS tansaction_id -- no cast .. is int already
,"OrderDate" AT TIME ZONE 'UTC' AS purchase_date -- !! see explanation
FROM tbl_newegg
UNION ALL -- not UNION!
SELECT 2
"amazonOrderId"
,"merchant-order-id"
,"purchase-date"
FROM tbl_amazon;
You can query this view like any other table:
SELECT * FROM orders WHERE order_nr = 123 AND source = 2;
The source is necessary if the order_nr is not unique. How else would you guarantee unique order-numbers over different sources?
A timestamp without time zone is an ambiguous in a global context. It's only good in connection with its time zone. If you mix timestamp and timestamptz, you need to place the timestamp at a certain time zone with the AT TIME ZONE construct to make this work. For more explanation read this related answer.
I use UTC as time zone, you might want to provide a different one. A simple cast "OrderDate"::timestamptz would assume your current time zone. AT TIME ZONE applied to a timestamp results in timestamptz. That's why I did not add another cast.
While you can, I advise not to use camel-case identifiers in PostgreSQL ever. Avoids many kinds of possible confusion. Note the lower case identifiers (without the now unnecessary double-quotes) I supplied.
Don't use varchar(25) as type for the order_nr. Just use text without arbitrary length modifier if it has to be a string. If all order numbers consist of digits exclusively, integer or bigint would be faster.
Performance
One way to make this fast would be to materialize the view. I.e., write the result into a (temporary) table:
CREATE TEMP TABLE tmp_orders AS
SELECT * FROM orders;
ANALYZE tmp_orders; -- temp tables are not auto-analyzed!
ALTER TABLE tmp_orders
ADD constraint orders_pk PRIMARY KEY (order_nr, source);
You need an index. In my example, the primary key constraint provides the index automatically.
If your tables are big, make sure you have enough temporary buffers to handle this in RAM before you create the temp table. Else it will actually slow you down.
SET temp_buffers = 1000MB;
Has to be the first call to temp objects in your session. Don't set it high globally, just for your session. A temp table is dropped automatically at the end of your session anyway.
To get an estimate how much RAM you need, create the table once and measure:
SELECT pg_size_pretty(pg_total_relation_size('tmp_orders'));
More on object sizes under this related question on dba.SE.
All the overhead only pays if you have to process a number of queries within one session. For other use cases there are other solutions. If you know the source table at the time of the query, it would be much faster to direct your query to the source table instead. If you don't, I would question the uniqueness of your order_nr once more. If it is, in fact, guaranteed to be unique you can drop the column source I introduced.
For only one or a few queries, it might be faster to use the view instead of the materialized view.
I would also consider a plpgsql function that queries one table after the other until the record is found. Might be cheaper for a couple of queries, considering the overhead. Indexes for every table needed of course.
Also, if you stick to text or varchar for your order_nr, consider COLLATE "C" for it.

Sounds like you need to create an abstract class that will define the basics of interacting with the data, then derive a class per database schema you need to access. This will allow the core code to operate on a single object type, and each implementation can then specify the queries in a form specific to that database schema.
Something like:
public class Order
{
private String orderNumber;
private BigDecimal orderTotal;
... etc ...
}
public abstract class AbstractOrderInformation
{
public abstract ArrayList<Order> getOrders();
...
}
with a Newegg class:
public class NeweggOrderInformation extends AbstractOrderInformation
{
public ArrayList<Order> getOrders() {
... do the work of getting the newegg order
}
...
}
Then you can have an arbitrarily large number of formats and when you need information, you can just iterate over all the implementations and get the Orders from each.

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?

Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).

You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END

Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).

If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

Storing UUID in HSQLDB database

I wish to store UUIDs created using java.util.UUID in a HSQLDB database.
The obvious option is to simply store them as strings (in the code they will probably just be treated as such), i.e. varchar(36).
What other options should I consider for this, considering issues such as database size and query speed (neither of which are a huge concern due to the volume of data involved, but I would like to consider them at least)

HSQLDB has a built-in UUID type. Use that
CREATE TABLE t (
id UUID PRIMARY KEY
);

You have a few options:
Store it as a VARCHAR(36), as you already have suggested. This will take 36 bytes (288 bits) of storage per UUID, not counting overhead.
Store each UUID in two BIGINT columns, one for the least-significant bits and one for the most-significant bits; use UUID#getLeastSignificantBits() and UUID#getMostSignificantBits() to grab each part and store it appropriately. This will take 128 bits of storage per UUID, not counting any overhead.
Store each UUID as an OBJECT; this stores it as the binary serialized version of the UUID class. I have no idea how much space this takes up; I'd have to run a test to see what the default serialized form of a Java UUID is.
The upsides and downsides of each approach is based on how you're passing the UUIDs around your app -- if you're passing them around as their string-equivalents, then the downside of requiring double the storage capacity for the VARCHAR(36) approach is probably outweighed by not having to convert them each time you do a DB query or update. If you're passing them around as native UUIDs, then the BIGINT method probably is pretty low-overhead.
Oh, and it's nice that you're looking to consider speed and storage space issues, but as many better than me have said, it's also good that you recognize that these might not be critically important given the amount of data your app will be storing and maintaining. As always, micro-optimization for the sake of performance is only important if not doing so leads to unacceptable cost or performance. Otherwise, these two issues -- the storage space of the UUIDs, and the time it takes to maintain and query them in the DB -- are reasonably low-importance given the cheap cost of storage and the ability of DB indices to make your life much easier. :)

I would recommend char(36) instead of varchar(36). Not sure about hsqldb, but in many DBMS char is a little faster.
For lookups, if the DBMS is smart, then you can use an integer value to "get closer" to your UUID.
For example, add an int column to your table as well as the char(36). When you insert into your table, insert the uuid.hashCode() into the int column. Then your searches can be like this
WHERE intCol = ? and uuid = ?
As I said, if hsqldb is smart like mysql or sql server, it will narrow the search by the intCol and then only compare at most a few values by the uuid. We use this trick to search through million+ record tables by string, and it is essentially as fast as an integer lookup.

Using BINARY(16) is another possibility. Less storage space than character types. Use CREATE TYPE UUID .. or CREATE DOMAIN UUID .. as suggested above.

I think the easiest thing to do would be to create your own domain thus creating your own UUID "type" (not really a type, but almost).
You also should consider the answer to this question (especially if you plan to use it instead of a "normal" primary key)
INT, BIGINT or UUID/GUID in HSQLDB? (deleted by community ...)
HSQLDB: Domain Creation and Manipulation

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.