Spark Streaming Write Ahead Logs with Custom Receiver - java

I have a Spark Streaming application using a Custom Receiver and I want it to be fully fault-tolerant. To do so, I have enabled Write Ahead Logs (WAL) in the configuration file when running spark-submit and have checkpointing set up (using getOrCreate).
From a tutorial I saw online, it says that to make sure WAL is recovering buffered data properly with custom receiver, I need to make sure that the receiver is reliable and data is acknowledged after it is saved to WAL directory. The reference on Spark website also talks about acknowledging data from source:
https://spark.apache.org/docs/1.6.1/streaming-custom-receivers.html
However, there is no example code of how to set up the order for:
First save data to WAL (by calling store())
Acknowledge the data (??)
Any idea how I can do it?
Currently, in my Spark UI, I see that the application resumes with multiple batches having "0 events".

Related

Reading huge file and writing in RDBMS

I have a huge text file which is continuously getting appended from a common place, which I need to read line by line from my java application and update in a SQL RDBMS such that if java application crashes, it should start from where it left and not from the beginning.
its a plain text file. Each row will contains:
<Datatimestamp> <service name> <paymentType> <success/failure> <session ID>
Also the data which is retrieved from database should also be real time without any performance, availability or availability issues in web application
Here is my approach:
Deploy application in two systems boxes with each contains heartbeat which pings the other system for service availability.
When you get a success response to heart beat,you also get the time stamp which is last successfully read.
When the next heartbeat response fails, application in another system can take over, based on:
1. failed response
2. Last successful time stamp.
Also, since the need for data retrieval is very real time and data is huge, can I crawl the database put that into Solr or Elastic search for faster retrieval, instead of making the database calls ?
There are various ways to do it, what is the best way.
I would put a messaging system in between the text file and the DB writing applications. (for example RabbitMQ) in this case, the messaging system functions as a queue. one application constantly reads the file and inserts the rows as messages to the broker. on the other side, multiple "DB writing applications" can read from the queue and write to DB.
the advantage of the messaging system is its support for multiple clients reading from the queue. the messaging system takes care of synchronizing between the clients, dealing with errors, dead letters, etc. the clients don't care about what payload was processed by other instances.
regarding maintaining multiple instances of "DB writing applications": I would go for ready made cluster solutions. perhaps docker cluster managed by kubernates?
another viable alternative is a streaming platform, like Apache Kafka.
You can use a software like FileBeat to read the file and direct the filebeat output to RabbitMQ or Kafka. From there a Java program can subscribe / consume the data and put it into a RDBMS system.

Efficient way to query PostgreSQL every 30 seconds?

I need to fetch data(How many records waiting to be processed) from certain PostgreSQL Tables in AWS for reporting. The result of the qry is posted to a log and picked up by FluentD demons and pushed to elasticsearch/Kibana. The straight forward way to do this is write a small spring boot app that ping Db every 30 seconds or so. This, I feel, inefficient and costly. Is there a better way to this?
Appreciate your help.
Instead of querying DB periodically, use Change Data Capture (CDC) to produce a stream of change events. Using stream processing, write the result to an Elasticsearch index. If you're not concerned about vendor locking, you can use AWS DMS, Kinesis and Lambda to do that. Otherwise, you can use a suitable Kafka connector to read the changes and post the events on Kafka. Then using Kafka streams push the data to Elasticsearch.

Confirming document upload in couchbase

I am creating an app that logs data. I am creating documents that have the data and sending those documents to a couchbase server. Or I am trying to anyways. One major concern I have is how do I confirm a document is stored on the server so that it can be immediately deleted on the device? I am hoping there is a quick and efficient way to do this. The end result is to have a thread constantly checking if there is a connection to couchbase, and if so start sending data up to clear it off the device. Most documentation seems to be regarding syncying the database, however I don't want to do this because I don't want to keep a copy of the data on the device. It would take up too much storage. Thanks for any help.
EDIT: For clarification, I currently have the app storing many data points in documents. I want to send these documents to a couchbase server. I don't want to "sync" the documents, but rather just insert them into the database then immediately delete them off the device. How would one go about doing this? Most examples I have seen typically sync documents such as profile information where changes can be made in various synced databases and all those changes would appear in every database. Instead I want a 1 way relationship with the database were information is sent, confirmed as received, then immediately deleted from the device.
There are at least a few possibilities.
If you are expecting a solid network connection, or are ok with handling errors yourself, you can achieve this with a direct REST call to Sync Gateway. You can, of course, always write your own REST server that talks directly to Couchbase Server, too.
The second way relies on an older version of Couchbase Lite. Couchbase Lite 2.x is a major rewrite of the product. As of the current shipping version (2.1), it does not support this approach, so you'll need to use the 1.x version (1.3 or later, IIRC). See further down for how to approach this with 2.1.
Set up a push only replication. After replication, cycle through the docs and purge all the ones that are not still pending. (This uses the isDocumentPending method on the Replication class. That's the key piece not available as of 2.1.) You can either run one shot replications and do this after the replication completes, or monitor the replication state of a continuous replication.
Purging a document from the local CB Lite database effectively makes it act as if it never existed on that device. By running a push only replication, you don't have to worry about the docs getting sent back to the device.
Using 2.1, you can't as easily determine if a document has been replicated. So you need to run a replication to completion while avoiding a race condition with writing something new.
One approach here is to pause writing documents, run a one shot replication, then purge the documents before starting up again. You could also work out something with alternating databases, or tracking documents yourself somehow, etc.
For completeness, if you were in a situation where you had a mixed use, that is, wanted only some documents pushed up off the device and forgotten, and some synced, you would control this through Sync Gateway channels.
I don't know Lite and Sync Gateway well enough, but from a Server perspective:
You could use the new Eventing service in Couchbase. When a document is created in bucket A, you could write an event to copy that document over to bucket B. Then, if the documents are deleted on the device, it wouldn't matter if they get deleted from bucket A.
I have a bucket "staging" and a bucket "final". I created a function called "moveIt" with "final" (I aliased as 'f').
The OnUpdate function could be as simple as:
function OnUpdate(doc, meta) {
f[meta.id] = doc;
}
My main concern would be the timing. I don't think there's an easy way for your mobile app to know that the event has finished copying a document before you decide to delete it in Lite and start a Sync. But it might be worth a try. Check out the docs to learn more about the Eventing service details.
In Couchbase Lite 2.5, you can use replicated events to detect when a document has synced (pushed to server or pulled from server). You can register a callback on the Couchbase Lite replicator to detect if the documents have been pushed to the sync gateway and then use the purge API to locally purge

Broker disk usage after topic deletion

I'm using Apache Kafka. I dump huge dbs into Kafka, where each database's table is a topic.
I cannot delete topic before it's completely consumed. I cannot set time-based retention policy because I don't know when topic will be consumed. I have limitited disk and too much data. I have to write code that will orchestrate by consumption and deletion programmatically. I understand that the problem appear because we're using Kafka for batch processing, but I can't change technology stack.
What is the correct way to delete consumed topic from brokers?
Currently, I'm calling kafka.admin.AdminUtils#deleteTopic. But I can't find clear related documentation. The method signature doesn't contain kafka server URLs. Does that mean that I'm deleting only topic's metadata and broker's disk usage isn't reduced? So when real append-log file deletion happens?
Instead of using a time-based retention policy, are you able to use a size-based policy? log.retention.bytes is a per-partition setting that might help you out here.
I'm not sure how you'd want to determine that a topic is fully consumed, but calling deleteTopic against the topic initially marks it for deletion. As soon as there are no consumers/producers connected to the cluster and accessing those topics, and if delete.topic.enable is set to true in your server.properties file, the controller will then delete the topic from the cluster as soon as it is able to do so. This includes purging the data from disk. It can take anywhere between a few seconds and several minutes to do this.

what is best option for creating log message buffer

I am working on a web application which needs to be deployed to cloud. There is a cloud service which can store log messages for applications securely. This is exposed by cloud using REST API which can take up to max 25 log messages in json format. we are currently using log4j(open for any other too) to log in to file. Now, we need to transition our application to move from file based logging to using cloud REST API.
I am considering that it would be expensive to make REST API call for every log message and slow down the application.
in this context, I am considering writing a custom appender which can write to a buffer. buffer can be in-memory or persistent buffer which will be read and emptied periodically by a separate thread or process by sending 25 messages in bunch to cloud REST API.
option 1:
using in-memory buffer
my custom appender would write message to in memory list and keep filling it.
There woudl be a daemon thread which will keep removing 25 messages at a time from the buffer and write to cloud using REST API. There is a downside to this approach that in event of application/server/node crashing.. we lose critical log message which can lead to diagnostic of why crash occurred.I am not sure if this is right way of thinking.
option 2:
using persistent buffer database/message queue:
appender can log message to database table temporarily or post to message queue which will be processed by separate long running job to pick up messages from db or queue and post it to cloud using REST API.
please guide which option looks best.
There is a lot of build in appender in log4j : https://logging.apache.org/log4j/2.x/manual/appenders.html and if you use a dedicated service in a cloud, they may give a specific appender.
If it's in your environment, maybe try a stack like ELK with log4j rollingfile apender, with that technique you'll not lose log entries.

Categories