Efficient way to query PostgreSQL every 30 seconds? - java

I need to fetch data(How many records waiting to be processed) from certain PostgreSQL Tables in AWS for reporting. The result of the qry is posted to a log and picked up by FluentD demons and pushed to elasticsearch/Kibana. The straight forward way to do this is write a small spring boot app that ping Db every 30 seconds or so. This, I feel, inefficient and costly. Is there a better way to this?
Appreciate your help.

Instead of querying DB periodically, use Change Data Capture (CDC) to produce a stream of change events. Using stream processing, write the result to an Elasticsearch index. If you're not concerned about vendor locking, you can use AWS DMS, Kinesis and Lambda to do that. Otherwise, you can use a suitable Kafka connector to read the changes and post the events on Kafka. Then using Kafka streams push the data to Elasticsearch.

Related

Reading huge file and writing in RDBMS

I have a huge text file which is continuously getting appended from a common place, which I need to read line by line from my java application and update in a SQL RDBMS such that if java application crashes, it should start from where it left and not from the beginning.
its a plain text file. Each row will contains:
<Datatimestamp> <service name> <paymentType> <success/failure> <session ID>
Also the data which is retrieved from database should also be real time without any performance, availability or availability issues in web application
Here is my approach:
Deploy application in two systems boxes with each contains heartbeat which pings the other system for service availability.
When you get a success response to heart beat,you also get the time stamp which is last successfully read.
When the next heartbeat response fails, application in another system can take over, based on:
1. failed response
2. Last successful time stamp.
Also, since the need for data retrieval is very real time and data is huge, can I crawl the database put that into Solr or Elastic search for faster retrieval, instead of making the database calls ?
There are various ways to do it, what is the best way.
I would put a messaging system in between the text file and the DB writing applications. (for example RabbitMQ) in this case, the messaging system functions as a queue. one application constantly reads the file and inserts the rows as messages to the broker. on the other side, multiple "DB writing applications" can read from the queue and write to DB.
the advantage of the messaging system is its support for multiple clients reading from the queue. the messaging system takes care of synchronizing between the clients, dealing with errors, dead letters, etc. the clients don't care about what payload was processed by other instances.
regarding maintaining multiple instances of "DB writing applications": I would go for ready made cluster solutions. perhaps docker cluster managed by kubernates?
another viable alternative is a streaming platform, like Apache Kafka.
You can use a software like FileBeat to read the file and direct the filebeat output to RabbitMQ or Kafka. From there a Java program can subscribe / consume the data and put it into a RDBMS system.

Broker disk usage after topic deletion

I'm using Apache Kafka. I dump huge dbs into Kafka, where each database's table is a topic.
I cannot delete topic before it's completely consumed. I cannot set time-based retention policy because I don't know when topic will be consumed. I have limitited disk and too much data. I have to write code that will orchestrate by consumption and deletion programmatically. I understand that the problem appear because we're using Kafka for batch processing, but I can't change technology stack.
What is the correct way to delete consumed topic from brokers?
Currently, I'm calling kafka.admin.AdminUtils#deleteTopic. But I can't find clear related documentation. The method signature doesn't contain kafka server URLs. Does that mean that I'm deleting only topic's metadata and broker's disk usage isn't reduced? So when real append-log file deletion happens?
Instead of using a time-based retention policy, are you able to use a size-based policy? log.retention.bytes is a per-partition setting that might help you out here.
I'm not sure how you'd want to determine that a topic is fully consumed, but calling deleteTopic against the topic initially marks it for deletion. As soon as there are no consumers/producers connected to the cluster and accessing those topics, and if delete.topic.enable is set to true in your server.properties file, the controller will then delete the topic from the cluster as soon as it is able to do so. This includes purging the data from disk. It can take anywhere between a few seconds and several minutes to do this.

Spark Streaming Write Ahead Logs with Custom Receiver

I have a Spark Streaming application using a Custom Receiver and I want it to be fully fault-tolerant. To do so, I have enabled Write Ahead Logs (WAL) in the configuration file when running spark-submit and have checkpointing set up (using getOrCreate).
From a tutorial I saw online, it says that to make sure WAL is recovering buffered data properly with custom receiver, I need to make sure that the receiver is reliable and data is acknowledged after it is saved to WAL directory. The reference on Spark website also talks about acknowledging data from source:
https://spark.apache.org/docs/1.6.1/streaming-custom-receivers.html
However, there is no example code of how to set up the order for:
First save data to WAL (by calling store())
Acknowledge the data (??)
Any idea how I can do it?
Currently, in my Spark UI, I see that the application resumes with multiple batches having "0 events".

How to architect in-memory processing for real time trading order status in Java

We have an application preparing trade orders sending them to the broker(Interactive Brokers) through their API
The Application uses MySQL as a data store and ActiveMQ for processing trade order status messages from Interactive Brokers API
We store all order related data and sending notifications and order positions and order sequences data in the database, which is really taking considerable time due to database query latency
We have an issue with a performance for submitting and processing more than 200 orders and it is taking about 3-4minutes
We profiled application (JProfiler) and found to be database calls only contributing more time towards overall order execution time
As it is a trading application 3-4 minutes is a huge time as trade prices change very quickly in the market.
Please provide some in-memory frameworks or databases where we could process everything on fly and sync with the actual database at later point. (preferably MySQL as a connector)
Also, Interactive Brokers will send order status messages as call back events, Please let us know any Complex Event Processing (CEP) frameworks in java which are in-line with in-memory frameworks.
Any help with how to architect or going about implementing such frameworks is appreciated.
Thank you.

Easiest point-multipoint data distribution library/framework in Java

I have a table in a central database that gets constantly appended to. I want to batch these appends every couple of minutes and have them sent to a bunch of "slave" servers. These servers will, in turn, process that data and then discard it (think distributed warehousing). Each "slave" server needs a different subset of the data and there is a natural partitioning key I can use for that.
Basically, I need this to be eventually consistent : every batch of data to be eventually delivered to every "slave" server (reliability), even if the "slave" was down at the moment the batch is ready to be delivered (durability). I don't care about the order in which the batches are delivered.
Possible solutions I have considered,
MySQL replication does not fit my requirement because I would have to replicate the whole table on each server.
ETL & ESB products are too bloated for this, I am not doing any data processing.
Plain JMS, I could use but I'm looking for something even simpler
JGroups is interesting but members that are left the group will not get the messages once they rejoin.
Pushing files & ack files across servers : can do but I don't know of any framework so would need to write my own.
Note : This question is about how to move the data from the central server to the N others with reliability & durability; not how to create or ingest it.
(Edited on Aug 24 to add durability requirement)
You may use JGroups for same. Its a toolkit for reliable multicast communication
I ended up finding "Spring integration" which includes plugins to poll directories via SFTP for example.
http://www.springsource.org/spring-integration

Categories