We are using EhCache 2.6.2. Because we need high survivability we use only DiskStorage and not MemoryStorage.
After every data update we have in the program, we flush the data to the disk.
After a while, the cache.data file exceeded max of 1 gb. When the data file was 250 mb, the flush took 250ms and when it's 1gb, it takes 3.5 sec.
Our objects are about 20kb each, so there are millions of them.
Is there a way to split the data file to few smaller files and let EhCache handle it?
We would prefer to have solution involving only configuration changes and not code change, cause it's in production environment.
Environment details:
Running WebSphere 7 with IBM Java 1.6 with EhCache 2.6.2 on AIX 6.1 64bit.
In Ehcache 2.6.2 all cache data will always be on disk, as the storage model changed, so you could benefit from a speed-up by using memory storage in addition to the disk storage.
What do you mean when you say:
After every data update we have in the program, we flush the data to the disk.
Regarding the performance of the disk store, there is one option that you can try:
<cache diskAccessStripes="4" ...>
...
</cache>
where the diskAccessStripes attribute takes a power of two value. Try it first with small values and see if you gain anything. The exact effect of this attribute will depend on a lot of factors: hardware, operation system as well as usage patterns of your application.
Related
We have a small search app in local context. For back services, we are using Apache Solr 6.6.2 for data index and storage. The front-end is in PHP with Apache2 webserver.
We have a server of 48 core and 96 GB RAM where these services are installed. The expected size of documents in index in about 200 Million and each document can have maximum 20 fields. Most fields are both indexed and stored.
The expected simultaneous requests can be hundreds of thousands at a time. So what will be the best configuration of Apache Solr to handle it? We have started Solr with 20 GB RAM and stress tested but it start to degrade performance near 100 users. Where is the problem? What is the optimal way for this issue.
We have also tested Solr in SolrCloud mode but the performance does not improve too much. We were expecting that if there will be some memory problem that their will be OOM exception but did not happen anything like that. We have just changed schema according to our requirement and change memory via command-line. All other setting are default.
Following are few references that we have consulted already
https://wiki.apache.org/solr/SolrPerformanceProblems
https://blog.cloudera.com/blog/2017/06/apache-solr-memory-tuning-for-production/
We have 200 million records in each Collection and we have 200 collections. We have 5 servers and each server has 8 cores and 64 gb ram.
I would suggest you break up your servers into multiple servers.
Do the replication of the data on each server so that requests gets divided into multiple servers. The more the number of servers, the more you'll be able to responsd quicker.
Note : Just understand the replication factor : 2F+1 formula where if you have 5 servers then 3 replicas atleast should be there. I'll suggest you to go with 5 replicas only (1 replica for each server)
If you plan to handle hundreds of thousands of requests per second you will need more than one server - no matter how big it is. Even if it's just for HA / DR purposes. So I would recommend using SolrCloud and sharding the index across multiple machines and with multiple replicas just so start.
Beyond that the devil is in the details
How fast do you want queries to perform (median and 99%ile). This will help you size CPU and memory needs.
How complex are your queries?
Are you using filters? (Requires more heap memory)
How fast is your disk acccess?
Will you be adding data in real time (impacting your setting of autoCommit and soft commit
But first and foremost you need to get away from "one big box" thinking.
Recently, I tried to use Hibernate Search indexing and I'm working in order to find a stable solution for a production environment. The case is in a wildfly 10 AS I am using indexing using a HibernateOGM PersistenceContext. This automatically adds data to index(Infinispan file-cache-store).
The problem is that I have an MDB consuming data from a JMS queue and I need in on call of this function(onMessage, one queue entry contains around 1 million entities - big requests) to persist around 1 million entities and publish them to another AMQP queue via a stateless EJB.
While persisting and publishing, I noticed that after a specific amount of time major gc cannot happen and after old gen gets full, eden space becomes also and there is a strong degrade in the rate of persisting and publishing messages.
My thoughts are that the onMessage function needs a transaction and until it finishes it keeps all the data in memory or something(indexing or persisted data) and can't just clean the old gen in order to be able to rollback.
I provide some monitoring pictures. You can easily see that suddenly after both spaces of memory(old gen and eden) are full and trying to go empty, there is a strong degrade at the rate of publishing messages to the other queue(it's like I create one by one entities from a list that comes as a request from the jms, I persist them and publish them in a for loop to a rabbitmq queue). Is there any way to keep index always on disk with infinispan if that's the case? Already tried minimum value at eviction, small chunk size etc. Didn't work well. Also tried to change GC algorithms but I end up always in the same situtation. Maybe another infinispan persistent file store implementation? I use single-file-cache-store for now and used soft-index cache store before. Any suggestions-thoughts?
Thanks
Hibernate Search 5.6.1, Infinispan 8.2.4, Hibernate OGM 5.1, Wildfly 10
VisualGC from visualVM
VisualVM
RabbitMQ
JMS Threads
Hibernate Search Sync Thread
The latest version of Infinispan (9.2) is able to store data "off heap" so the short answer is yes it's possible. But consider the big picture before choosing to do that, not all scenarious benefit from off heap storage as this depends on a number of factors.
Infinispan by definition is meant to buffer hottest data in memory, by default "on heap" as that will help your overall performance when it's just Java objects as you can then skip (de)serialization overhead; you need to tune your heap sizes to accomodate for the load you are planning, it can not do that automatically. The easiest strategy is to observe it with similar tools under load when enabling a very generous heap size and then trim it down to a reasonable size you know will work for your load.
So try to verify first if you're not just having a too small heap for its peak operation requirements before suspecting a leak or an unbounded growth. If there actually is an actual leak, you might first want to try upgrading as those versions are quite old - a lot of issues have been fixed already.
We have setup one Thrift Server with Spark 2.0 in Mesos Client mode.
When trying to query one 170 MB parquet file (select * from the table), it always failed with Java Out of Memory Exception (Java Heap Space).
Even thought there are couple of Executor/Workers and executors' tasks are completed successfully (read from Spark UI).
Finally the query can be completed successfully when the JVM memory increased to 25 GB and Spark Driver Memory got 21GB! Seems the bottleneck is on the Driver memory itself.
Kryo serialization is used (spark.kryoserializer.buffer.max=1024m); and files are stored in S3 bucket; YARN is not used.
--Why does the Driver consume that much memory for such simple query?
--What other parameters/configuration can help to support large data sets and concurrent JDBC connections?
Thanks.
Q1: Parquet files are compressed, when loaded into memory it will be decompressed. What's more, Java object including string has its overhead and if you have lots of small strings, the cost could be considerable.
Q2: Not sure about Spark 2.0, but for some previous version you could use incremental collect option to get results batch by batch.
As #paul said , dont trust file size.
Parquet is a columnar storage data file format, so retrieve data like "*" is really not a good idea. But its good for group by queries.
The driver role is to manage worker executors, then give you thé query result at the end, so all your data will be collected on the driver.
Try limiting your query and specifying some fields rather than *.
I am using spark streaming in my application. Data comes in the form of streaming files every 15 minute. I have allocated 10G of RAM to spark executors. With this setting my spark application is running fine.
But by looking the spark UI, under Storage tab -> Size in Memory the usage of RAM keep on increasing over the time.
When I started streaming job, "Size in Memory" usage was in KB. Today it has been 2 weeks 2 days 22 hours since when I started the streaming job and usage has increased to 858.4 MB.
Also I have noticed on more thing, under Streaming heading:
When I started the job, Processing Time and Total Delay (from the image) was 5 second and which after 16 days, increased to 19-23 seconds while the streaming file size is almost same.
Before increasing the executor memory to 10G, spark jobs keeps on failing almost every 5 days (with default executor memory which is 1GB). With increase of executor memory to 10G, it is running continuously from more than 16 days.
I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well. What I can do to avoid this? Do I need to do some configuration?
Just to give the context of my spark application, I have enable following properties in spark context:
SparkConf sparkConf = new SparkConf().setMaster(sparkMaster). .set("spark.streaming.receiver.writeAheadLog.enable", "true")
.set("spark.streaming.minRememberDuration", 1440);
And also, I have enable checkpointing like following:
sc.checkpoint(hadoop_directory)
I want to highlight one more thing is that I was having issue while enabling checkpointing. Regarding checkpointing issue, I have already posted a question on following link:
Spark checkpoining error when joining static dataset with DStream
I was not able to set the checkpointing the way I wanted, so did it differently (highlighted above) and it is working fine now. I am not asking checkpointing question again, however I mentioned it so that it might help you to understand if current memory issue somehow related to previous one (checkpointing).
Environment detail:
Spark 1.4.1 with two node cluster of centos 7 machines. Hadoop 2.7.1.
I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well.
No, that's not how RAM works. Running out is perfectly normal, and when you run out, you take RAM that you are using for less important purposes and use it for more important purposes.
For example, if your system has free RAM, it can try to keep everything it wrote to disk in RAM. Who knows, somebody might try to read it from disk again and having it in RAM will save an I/O operation. Since free RAM is forever wasted (it's not like you can use 1GB less today to use 1GB more tomorrow, any RAM not used right now is potential to avoid I/O that's forever lost) you might as well use it for anything that might help. But that doesn't mean it can't evict those things from RAM when it needs RAM for some other purpose.
It is not at all unusual, on a typical system, for almost all of its RAM to be used and almost all of its RAM to also be available. This is typical behavior on most modern systems.
I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar
When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.