How can redis dump to several files but not only one? - java

Hello,I'd like to ask you two questions. (I am using java and jedis)
I want to write 2G data to redis, how can I write faster?
Does the redis dump the data to several files, not only the dump.rdb? Such as the data is too large as 4G, the data will be dumped to the dump.rdb and dump2.rdb ?

You can import data faster into Redis by using variadic parameters commands (such as MSET), and/or using pipelining (which is supporting by Jedis) to aggregate roundtrips to the Redis instance. The less roundtrips, the faster import.
Another good practice is to deactivate the AOF (if it is activated), and background RDB dump (if it is activated) during the import operation (and reactivate them after).
Redis cannot dump to several data files. But if you write 2 GB of data in Redis, there is no way the dump file can take 4 GB. The dump file is always much more compact than the data in memory. The only way to get several dump files is to start multiple Redis instances and shard the data.

Related

Can Apache Spark speed up the process of reading millions of records from Oracle DB and then writing these to a file?

I am new to Apache-Spark,
I have a requirement to read millions(~5 million) of records from Oracle database, then do some processing on these records , and write the processed records to a file.
At present ,this is done in Java , and in this process
- the records in DB are categorized into different sub sets, based on some data criteria
- In the Java process, 4 threads are running in parallel
- Each thread reads a sub set of records , processes and writes processed records to a new file
- finally it merges all these files into single file.
Still It takes around half an hour to complete the whole process .
So I would like to know , if Apache Spark could make this process fast- read millions of records from Oracle DB, process these, and write to a file ?
If Spark can make this process faster, what is the best approach to be used to implement this in my process? Also wWill it be effective in a non-clustered environment too?
Appreciate the help.
Yeah you can do that using Spark, it's built for distributed processing ! http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
You should be using a well configured spark cluster to achieve the same. Performance is something you need to fine tune by adding more worker nodes as required.

Spark Thrift Server for exposing big size file?

We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.
When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space).
Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).
Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.
Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.
--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?
Thanks.
Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.
Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.
As #paul said , dont trust file size.
Parquet is a columnar storage data file format, so retrieve data like "*" is really not a good idea. But its good for group by queries.
The driver role is to manage worker executors, then give you thé query result at the end, so all your data will be collected on the driver.
Try limiting your query and specifying some fields rather than *.

Pull files concurrently using single SFTP connection in Java - Improving SFTP performance

I need to pull the files concurrently from remote server using single SFTP connection in Java code.
I've already got few links to pull the files one by one on single connection.
Like:
To use sftpChannel.ls("Path to dir"); which will returns list of files in the given path as a vector and you have to iterate on the vector to download each file sftpChannel.get();
But I want to pull multiple files concurrently for eg. 2 files at a time on single connection.
Thank You!
The ChannelSftp.get method returns an InputStream.
So you can call the get multiple times, acquiring a stream for each download. And then keep polling the streams until all reach the end-of-file.
Though I do not see, what advantage this gives you over a sequential download.
If you want to improve performance, you first need to know, what is the bottleneck.
The typical bottlenecks are:
Network speed: If you are saturating the network speed already, you cannot improve anything.
Network latency: If the latency is the bottleneck, increasing size of an SFTP request queue may help. Use the ChannelSftp.setBulkRequests method (the default is 16, so use a higher number)
CPU: If the CPU is the bottleneck, you either have to improve efficiency of the encryption implementation, or spread the load across CPU cores. Spreading the encryption load of a single session/connection is tricky and would have to be supported on low-level SSH implementation. I do not think JSch or any other implementation supports that.
Disk: If a disk drive (local or remote) is the bottleneck (unlikely), the parallel transfers as shown above may help, even when using a single connection, if the parallel transfers use a different disk drive each.
For more in-depth information, see my answers to:
Why is FileZilla SFTP file transfer max capped at 1.3MiB/sec instead of saturating available bandwidth? rsync and WinSCP are even slower
Why is FileZilla so much faster than PSFTP?

Spring batch - using in-memory database for huge file processing

I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar
When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.

Fastest way to store data from sensors in java

I am currently writing a Java application that receives data from various sensors. How often this happens varies, but I believe that my application will receive signals about 100k times per day. I would like to log the data received from a sensor every time the application receives a signal. Because the application does much more than just log sensor data, performance is an issue. I am looking for the best and fastest way to log the data. Thus, I might not use a database, but rather write to a file and keep 1 file per day.
So what is faster? Use a database or log to files? No doubt there is also a lot of options to what logging software to use. Which is the best for my purpose if logging to file is the best option?
The data stored might be used later for analytical purposes, so please keep this in mind as well.
I would recommend you first of all to use log4j (or any other logging framework).
You can use a jdbc appender that writes into the db or any kind of file appender that writes into the file. The point is that your code will be generic enough to be changed later if you like...
In general files are much faster than db access, but there is a place for optimizations here.
If the performance is critical, you can use batching/asynchronous calls to the logging infrastructure.
A free database on a cheap PC should be able to record 10 records per second easily.
A tuned database on a good system or a logger on a cheap PC should be able to write 100 records/lines per second easily.
A tuned logger should be able to write 1000 lines per second easily.
A fast binary logger can perform 1 million records per second easily (depending on the size of the record)
Your requirement is about 1.2 records per second per signal which should be able to achieve any way you like. I assume you want to be able to query your data so you want it in a database eventually so I would put it there.
Ah the world of embedded systems. I had a similar problem when working with a hovercraft. I solved it with a separate computer(you can do this with a separate program) over the local area network that would just SIT and LISTEN as a server for logs I sent to it. The FileWriter program was written in C++. This must solve two problems of yours. First is the obvious performance gain while writing the logs. And secondly the Java program is FREED of writing any logs at all(but will act as a proxy) and can concentrate on performance critical tasks. Using a DB for this is going to be an overkill, except if you're using SQLite.
Good luck!

Categories