Alternatives for Spark Dataframe's count() API

Alternatives for Spark Dataframe's count() API - java

I'm using Spark with Java connector to process my data.
One of the essential operations I need to do with the data is to count the number of records (row) within a data frame.
I tried df.count() but the execution time is extremely slow (30-40 seconds for 2-3M records).
Also, due to the system's requirement, I don't want to use df.rdd().countApprox() API because we need the exact count number.
Could somebody give me a suggestion of any alternatives that return exactly the same result as df.count() does, with faster execution time?
Highly appreciate your replies.

df.cache
df.count
It will be slow for the first time, since it caches during the execution of count for the first time, but in subsequent count will provide you good performance.
Leveraging df.cache depends on the use case.

A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty. Here's a scala implementation of this.
Here is the reason why df.count() is a slow operation.

Count is very fast. You need to look to some of your other operations, the data loading and transformations you do to generate the Data frame that you are counting. That is the part slowing you down not the count itself.
If you can reduce the amount of data you load or cut out any transformations that don't affect the count you may be able to speed things up. If that's not an option you may be able to. Write your transformations more efficiently. Without knowing your transformations though it's not possible to say what the bottleneck might be.

I just found out that loading data into Spark data frame for further queries and count is unecessary.
Instead, we can use aerospike client to do the job and it's much faster than the above approach.
Here's the reference of how to use aerospike client
http://www.aerospike.com/launchpad/query_multiple_filters.html
Thanks everyone

Related

How to count input and output rows on the Spark SQL API from Java?

I am trying to count the number of rows that a Java process reads and writes. The process is using the SQL API dealing with Datasets of Row. Adding .count() at various points seems to slow it down a lot, even if I do a .persist() prior to those points.
I have also seen code that does a
.map(row -> {
accumulator.add(1);
return row;
}, SomeEncoder)
which works well enough but the deserialization and re-serialization of the whole row seems unnecessary and it isn't mentally automatic since one has to come up with the correct SomeEncoder at each point.
A third option is maybe to call a UDF0 that does the counting and then drop the dummy object it would return but I'm not sure if Spark would be allowed to optimize the whole code away if it can tell the UDF0 isn't changing the output.
Is there a good way of counting without deserializing the rows? Or alternatively, is there a method that does the equivalent of Java's streams' .peek() where the returned data isn't important?
EDIT: to clarify, the job isn't just counting. The counting is just for record-keeping purposes. The job is doing other things. In fact, this is a pretty generic problem, I've got lots of jobs that are doing some transformations on data and saving them somewhere, I just want to keep a running record of how many rows these jobs read and wrote.
Thank you

The fastest way to populate a In Memory Data Grid Hazelcast

What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?

I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.

How store a big list of strings to optimize both initialization time and search speed

I'm writing an android application which stores a set of ~50.000
strings, and I need input on how to best store them.
My objective is to be able to query with low latency for a list of
strings matching a pattern (like Hello W* or *m Aliv*), but avoid
a huge initialization time.
I thought of the following 2 ways:
A java collection. I imagine a java collection should be quick to
search, but given that it's fairly large I'm afraid it might have a
big impact on the app initialization time.
A table in a SQLite database. I imagine this would go easy on
initialization time (since it doesn't need to be loaded to memory),
but I'm afraid the query would impose some relevant latency since it
needs to start a SQLite process (or doesn't it?).
Are my "imagine"s correct or horribly wrong? Which way would be best?

If you want quick (as in instant) search times, what you need is a full-text index of your strings. Fortunately, SQLite has some full-text search support with the FTS extension. SQLite is part of the Android APIs and the initialisation time is totally negligible. What you you do have to watch is that the index (the .sqlite file) has to either be shipped with your app in the .apk, or be re-created the first time it opens (and that can take quite some time)

Look at data structures like a patricia trie (http://en.wikipedia.org/wiki/Radix_tree) or a Ternary Search Tree (http://en.wikipedia.org/wiki/Ternary_search_tree). They will dramatically reduce your search time and depending on the amount of overlap in your strings may actually reduce the memory requirements. The Java collections are good for many purposes but are not optimal for large sets of short strings.

I would definitely stick to SQLite. It's really fast in the both initialization and querying. SQLite runs in application process, thus there is almost no time penalties on initialization. A query is normally fired in a background thread to not block main thread. It will be very fast on 50.000 records and you won't load all data in memory, which is also important.

your string no are 50 in this case you can use java collection database will be time taking.

Does db computation worth Parallelizing?

I use DB2 9.7.5 64Bits. The server has enough memory but no clustering.
I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
The results are stored in a result table.
Now I have several solutions (As a policy, we exclude Stored Proc).
I call each ratio, one at a time from a java client OR
I call several ratios in a multi threaded java client.
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
I am currently reading the IBM Data manual but would like to have your feedback.
Many thanks.

I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
25 seconds is not necessarily a bad thing. maybe its a wonderful result, depends on what you compute
Now I have several solutions (As a policy, we exclude Stored Proc).
Stored proc are not evil, you just need to know how to use them safely
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
multithreading in java never hearts (as long as you keep the threads safe), especially useful in your case when you are doing alot of calculations.
I don's use db2 so I don't know how good it is on multithreading, but if its single threaded I doubt that it will ever reach 100% cpu usage. you should check the conf files of your db2 to tweek it a little bit
Also read the article about IBM DB2 clustering
I also suggest using a data warehouse tool to analyze your script performance againest the db2
Good luck

Take a look at Materialized Query Tables. If what you are working with is reporting, and especially doesn't require absolutely up-to-date information, you can set up MQTs that will contain the parts that are heavy to calculate with for instance hourly versions.

how to handle large lists of data

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java

Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.

If you're working with huge amounts of data, you might want to consider using a database instead.

Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days

Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?

I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Alternatives for Spark Dataframe's count() API - java

df.cache df.count It will be slow for the first time, since it caches during the execution of count for the first time, but in subsequent count will provide you good performance. Leveraging df.cache depends on the use case.

A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty. Here's a scala implementation of this. Here is the reason why df.count() is a slow operation.

Related

How to count input and output rows on the Spark SQL API from Java?

The fastest way to populate a In Memory Data Grid Hazelcast

How store a big list of strings to optimize both initialization time and search speed

Does db computation worth Parallelizing?

how to handle large lists of data

Categories

Resources