Efficiently copy large timeseries results in Java

Efficiently copy large timeseries results in Java - java

I am querying data from a timeseries database (Influx in my case) using Java.
I have approximately 20.000-100.000 values (Strings) in the database.
Mapping the results that I get via the Influx Java API to my Domain Objects seems to be very inefficient (ca.0,5s on a small machine).
I suppose this is due to "resource intensive" object creation of the domain objects.
I am currently using StreamsAPI:
QueryResult series = result.getResults().get(0).getSeries().get(0);
List<ItemHistoryEntity> mappedList = series.getValues().stream().parallel().map(valueList ->
new ItemHistoryEntity(valueList)).collect(Collectors.toList());
Unfortunately, I downsampling my data at the database is not an option in my case.
How can I do this more efficiently in Java?
EDIT:
Next thing I will do with the list is downsampling. The problem is that for further downsampling, I need the oldest timestamp in the list. To get this timestamp, I need to iterate the full list. Would it be more efficient, to never call Collectors.toList() until I have reduced the size of the list, even though I need to iterate it at least twice. Or should I find the oldest timestamp using an additional db query and then iterate the list only once and call the Collector only for the reduce list?

Related

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.

I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

How is data retrieved from hash tables for collisions

I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.

I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).

Java 8 Stream vs Collection Storage

I have been reading up on Java 8 Streams and the way data is streamed from a data source, rather than have the entire collection to extract data from.
This quote in particular I read on an article regarding streams in Java 8.
No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.
I understand the concept of streaming data in from a source piece by piece. What I don't understand is if you are streaming from a collection how is there no storage? The collection already exists on the Heap, you are just streaming the data from that collection, the collection already exists in "storage".
What's the difference memory-footprint wise if I were to just loop through the collection with a standard for loop?

The statement about streams and storage means that a stream doesn't have any storage of its own. If the stream's source is a collection, then obviously that collection has storage to hold the elements.
Let's take one of examples from that article:
int sum = shapes.stream()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
Assume that shapes is a Collection that has millions of elements. One might imagine that the filter operation would iterate over the elements from the source and create a temporary collection of results, which might also have millions of elements. The mapToInt operation might then iterate over that temporary collection and generate its results to be summed.
That's not how it works. There is no temporary, intermediate collection. The stream operations are pipelined, so elements emerging from filter are passed through mapToInt and thence to sum without being stored into and read from a collection.
If the stream source weren't a collection -- say, elements were being read from a network collection -- there needn't be any storage at all. A pipeline like the following:
int sum = streamShapesFromNetwork()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
might process millions of elements, but it wouldn't need to store millions of elements anywhere.

Think of the stream as a nozzle connected to the water tank that is your data structure. The nozzle doesn't have its own storage. Sure, the water (data) the stream provides is coming from a source that has storage, but the stream itself has no storage. Connecting another nozzle (stream) to your tank (data structure) won't require storage for a whole new copy of the data.

Collection is a data structure. Based on the problem you decide which collection to be used like ArrayList, LinekedList (Considering time and space complexity) . Where as Stream is just a processing kind of tool, which makes your life easy.
Other difference is, you can consider Collection as in-memory data structure, where you can add , remove element.
Where as in Stream you can perform two kind of operation:
a. Intermediate operation : Filter, map ,sort,limit on the result set
b. Terminal operation : forEach ,collect the result set to a collection.
But if you notice, with stream you can't add or remove elements.
Stream is kind of iterator, you can traverse collection through stream. Note, you can traverse stream only once, let me give you an example to have better understanding:
Example1:
List<String> employeeNameList = Arrays.asList("John","Peter","Sachin");
Stream<String> s = employeeNameList.stream();
// iterate through list
s.forEach(System.out :: println); // this work's perfectly fine
s.forEach(System.out :: println); // you will get IllegalStateException, stating stream already operated upon
So, what you can infer is, collection you can iterate as many times as you want. But for the stream, once you iterate , it won't remember what it is supposed to do. So, you need to instruct it again.
I hope, it is clear.

A stream is just a view of the data, it has no storage of its own and you can't modify the underlying collection (assuming it's a stream that was built on top a collection) through the stream. It's like a "read only" access.
If you have any RDBMS experience - it's the exact same idea of "view".

Previous answer are mostly correct. Yet still a much more intuitive response follows (for Google passengers landing here):
Think of streams as UNIX pipelines of text:
cat input.file | sed ... | grep ... > output.file
In general those UNIX text utilities will consume an small quantity of RAM compared to the processed input data.
That's not always the case. Think of "sort". This algorithm will need to keep intermediate stuff in memory.
That same is true for streams. Sometimes temporal data will be needed. Most of the times it will not.
As an extra simile, to some extend "cloud-serverless APIs" follows this same UNIX pipelines o Java stream design.
They do not exist in memory until the have some input data to process. The cloud OS will launch them and inject the input data. The output is sent gradually somewhere else, so the cloud-serverless-API does not consume many resources (most of the times).
Not absolute "trues" in this case.

How can I improve performance of string processing with less memory?

I'm implementing this in Java.
Symbol file Store data file
1\item1 10\storename1
10\item20 15\storename6
11\item6 15\storename9
15\item14 1\storename250
5\item5 1\storename15
The user will search store names using wildcards like storename?
My job is to search the store names and produce a full string using symbol data. For example:
item20-storename1
item14-storename6
item14-storename9
My approach is:
reading the store data file line by line
if any line contains matching search string (like storename?), I will push that line to an intermediate store result file
I will also copy the itemno of a matching storename into an arraylist (like 10,15)
when this arraylist size%100==0 then I will remove duplicate item no's using hashset, reducing arraylist size significantly
when arraylist size >1000
sort that list using Collections.sort(itemno_arraylist)
open symbol file & start reading line by line
for each line Collections.binarySearch(itemno_arraylist,itmeno)
if matching then push result to an intermediate symbol result file
continue with step1 until EOF of store data file
...
After all of this I would combine two result files (symbol result file & store result file) to present actual strings list.
This approach is working but it is consuming more CPU time and main memory.
I want to know a better solution with reduced CPU time (currently 2 min) & memory (currently 80MB). There are many collection classes available in Java. Which one would give a more efficient solution for this kind of huge string processing problem?
If you have any thoughts on this kind of string processing problems that too in Java would be great and helpful.
Note: Both files would be nearly a million lines long.

Replace the two flat files with an embedded database (there's plenty of them, I used SQLite and Db4O in the past): problem solved.

So you need to replace 10\storename1 with item20-storename1 because the symbol file contains 10\item20. The obvious solution is to load the symbol file into a Map:
String tokens=symbolFile.readLine().split("\\");
map.put(tokens[0], tokens[1]);
Then read the store file line by line and replace:
String tokens=storelFile.readLine().split("\\");
output.println(map.get(tokens[0])+'-'+tokens[1]));
This is the fastest method, though still using a lot of memory for the map. You can reduce the memory storing the map in a database, but this would increase the time significantly.

If your input data file is not changing frequently, then parse the file once, put the data into a List of custom class e.g. FileStoreRecord mapping your record in the file. Define a equals method on your custom class. Perform all next steps over the List e.g. for search, you can call contains method by passing search string in form of the custom object FileStoreRecord .
If the file is changing after some time, you may want to refresh the List after certain interval or keep the track of list creation time and compare against the file update timestamp before using it. If ifferent, recreate the list. One other way to manage the file check could be to have a Thread continuously polling the file update and the moment, it is updated, it notifies to refresh the list.

Is there any limitation to use Map?
You can add Items to Map, then you can search easily?
1 million record means 1M * recordSize, therefore it will not be problem.
Map<Integer,Item> itemMap= new HashMap();
...
Item item= itemMap.get(store.getItemNo());
But, the best solution will be with Database.

Unknown record number while executing SQLite query

When I am executing a SQLite query (using sqlite4java) I am following the general scheme of executing it step by step and obtaining one row at a time. My final result should be a 2-D array, whose length should correspond to the amount of records. The problem I am facing is the fact that I don't know in advance how many records are to be returned by my query. so I basically store them in an ArrayList and then copy pointers to the actual array. Is there a technique to somehow obtain the number of records to be returned by the query prior to executing it fully?

My final result should be a 2-D array, whose length should correspond to the amount of records.
Why? It would generally be a better idea to make the result a List<E> where E is some custom type representing "a record".
It sounds like you're already creating an ArrayList - so why do you need an actual array? The Collections API is generally more flexible and convenient than using arrays directly.

No, if using JDBC. You can, however, first do a COUNT() query, and then the real query.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.