We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.
When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space).
Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).
Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.
Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.
--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?
Thanks.
Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.
Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.
As #paul said , dont trust file size.
Parquet is a columnar storage data file format, so retrieve data like "*" is really not a good idea. But its good for group by queries.
The driver role is to manage worker executors, then give you thé query result at the end, so all your data will be collected on the driver.
Try limiting your query and specifying some fields rather than *.
Related
I am new to Apache-Spark,
I have a requirement to read millions(~5 million) of records from Oracle database, then do some processing on these records , and write the processed records to a file.
At present ,this is done in Java , and in this process
- the records in DB are categorized into different sub sets, based on some data criteria
- In the Java process, 4 threads are running in parallel
- Each thread reads a sub set of records , processes and writes processed records to a new file
- finally it merges all these files into single file.
Still It takes around half an hour to complete the whole process .
So I would like to know , if Apache Spark could make this process fast- read millions of records from Oracle DB, process these, and write to a file ?
If Spark can make this process faster, what is the best approach to be used to implement this in my process? Also wWill it be effective in a non-clustered environment too?
Appreciate the help.
Yeah you can do that using Spark, it's built for distributed processing ! http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
You should be using a well configured spark cluster to achieve the same. Performance is something you need to fine tune by adding more worker nodes as required.
This is what I have been trying to achieve.
We are in process to let go a vendor tool called GO-Anywhere that reads data from an DB2 database after firing a select query creates a file writes data to it zips it and sftps it to a machine where our ETL tool can read it.
I have been able to achieve what GA does in almost the same time infact beating the above tools by 5 minutes on a 6.5GB file by using JSCH and jaring un-jaring on the fly. This brings down the time to read and write the file from earlier 32 minutes to now 27 minutes.
But to meet the new SLA requirements we need to further bring down the time to almost half of what I have that is something around 13 odd minutes
To achieve the above I have been able to read the .MBR file directly and push the same on to the Linux machine in 13 minutes or less but the format of this file is not clear text.
I would like to know how can one convert the .MBR file into plain text format using Java or using AS400 command without firing the SQL.
Any help appreciated.
You're under the mistaken impression that a "FILE" on the IBM i is like a file on Windows/Unix/Linux.
It's not.
Like every other object type in IBM i, it's an object with well defined interfaces.
In the particular case of a *FILE object, it's a database table. DB2 for i isn't an add-on DBMS installed on top the OS; DB2 for i is simply the name they gave to the DBMS integrated into the OS. A user program can't simply open storage space directly like you can do with files on Windows/Unix/Linux. You have to go through the interfaces provided by the OS.
There are two interfaces available, Record Level Access (RLA) or SQL. Both can be used from a Java application. RLA is provided by the com.ibm.as400.access.AS400File class. SQL access is provided by the JDBC classes.
SQL is likely to provide the best performance, since your dealing with a set of records instead of one at a time with RLA.
Take a look at the various performance related JDBC properties available..
From a performance standpoint, it's unlikely that your single process would fully utilize the system, ie. CPU usage won't be at 100% nor will disk activity be upwards of 60-80%.
That being the case, your best bet is to break the process into multiple ones. You'll need some way to limit each process to a selected set of records. Possibly segregation by primary key. That will add some overhead unless the records are in primary key order. If the table doesn't have deleted records, using RRN() to segregate by physical order may work. But be warned, on older versions of the OS, the use of RRN() required a full table scan.
Guessing at what is happening is that there are packed decimal fields in the source table which aren't getting unpacked by your home-grown method of reading the table.
There are several possibilities.
Have the IBM i team create a view over the source table which has all of the numeric columns zoned decimal. Additionally, omit columns that the ETL doesn't need - it will reduce the I/O by not having to move those bytes around. Perform the extract over that. Note: there may be such a view already on the system.
Have the IBM i team build appropriate indexes. Often, SQL bottlenecks can be alleviated with proper indexes.
Don't ZIP and UNZIP; send the raw file to the other system. Even at 6GB, gigabit Ethernet can easily deal with that.
Load an ODBC driver on the ETL system and have it directly read the source table (or the appropriate view) rather than send a copy to the ETL system.
Where did the SLA time limit come from? If the SLA said 'subsecond response time' what would you do? At some point, the SLA needs to reflect some version of reality as defined by the laws of physics. I'm not saying that you've reached that limit: I'm saying that you need to find the rationale for it.
Have the IBM i team make sure they are current on patches (PTFs). IBM often address performance issues via PTF.
Have the IBM i team make sure that the subsystem where your jobs are running has enough memory.
I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar
When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.
We are using EhCache 2.6.2. Because we need high survivability we use only DiskStorage and not MemoryStorage.
After every data update we have in the program, we flush the data to the disk.
After a while, the cache.data file exceeded max of 1 gb. When the data file was 250 mb, the flush took 250ms and when it's 1gb, it takes 3.5 sec.
Our objects are about 20kb each, so there are millions of them.
Is there a way to split the data file to few smaller files and let EhCache handle it?
We would prefer to have solution involving only configuration changes and not code change, cause it's in production environment.
Environment details:
Running WebSphere 7 with IBM Java 1.6 with EhCache 2.6.2 on AIX 6.1 64bit.
In Ehcache 2.6.2 all cache data will always be on disk, as the storage model changed, so you could benefit from a speed-up by using memory storage in addition to the disk storage.
What do you mean when you say:
After every data update we have in the program, we flush the data to the disk.
Regarding the performance of the disk store, there is one option that you can try:
<cache diskAccessStripes="4" ...>
...
</cache>
where the diskAccessStripes attribute takes a power of two value. Try it first with small values and see if you gain anything. The exact effect of this attribute will depend on a lot of factors: hardware, operation system as well as usage patterns of your application.
Hello,I'd like to ask you two questions. (I am using java and jedis)
I want to write 2G data to redis, how can I write faster?
Does the redis dump the data to several files, not only the dump.rdb? Such as the data is too large as 4G, the data will be dumped to the dump.rdb and dump2.rdb ?
You can import data faster into Redis by using variadic parameters commands (such as MSET), and/or using pipelining (which is supporting by Jedis) to aggregate roundtrips to the Redis instance. The less roundtrips, the faster import.
Another good practice is to deactivate the AOF (if it is activated), and background RDB dump (if it is activated) during the import operation (and reactivate them after).
Redis cannot dump to several data files. But if you write 2 GB of data in Redis, there is no way the dump file can take 4 GB. The dump file is always much more compact than the data in memory. The only way to get several dump files is to start multiple Redis instances and shard the data.