Writing Spark Streaming Output to a Socket - java

I have a DStream "Crowd" and I want to write each element in "Crowd" to a socket. When I try to read from that socket, it dosen't print anything. I am using following line of code:
val server = new ServerSocket(4000,200);
val conn = server.accept()
val out = new PrintStream(conn.getOutputStream());
crowd.foreachRDD(rdd => {rdd.foreach(record=>{out.println(record)})})
But if use (this is not I want though):
crowd.foreachRDD(rdd => out.println(rdd))
It does write something to the socket.
I suspect there is a problem with using rdd.foreach(). Although it should work. I am not sure what I am missing.

The code outside the DStream closure is executed in the driver, while the rdd.foreach(...) will be executed on each distributed partition of the RDD.
So, there's a socket created on the driver's machine and the job tries to write to it on another machine - that will not work for the obvious reasons.
DStream.foreachRDD is executed on the driver, so in that instance, the socket and the computation are performed in the same host. Therefore it works.
With the distributed nature of an RDD computation, this Server Socket approach will be hard to make work as dynamic service discovery becomes a challenge i.e. "where is my server socket open?". Look into some system that will allow you to have centralized access to distributed data. Kafka is a good alternative for this kind of streaming process.

Here in the official documentation you have the answer!
You have to create the connection inside of the foreachRDD function, and if you want to do it optimally you need to create a "pool" of connections, and then bring the connection you want inside of the foreachPartition function, and call to the foreach function to send the elements through that connection. This is the example code for doing it in the best way:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
In any case, check the other comments as they provide good knowledge about the context of the problem.

crowd.foreachRDD(rdd => {rdd.collect.foreach(record=>{out.println(record)})})
Your suggested code in your comments will work fine but in this case you have to collect all records of RDD in driver. If number of records are small that will be ok but if number of records are larger than the driver's memory that will be become bottle neck. Your first attempt should always process data on client. Remember RDD is distributed on worker machines so that means first you need to bring all records in RDD to driver resulting in increased communication which is a kill in distributed computing. So as stated your code will only be ok when there are limited records in RDD.
I am working on similar problems and I have been searching how to pool connections and serialize them to client machines. If some body has any answers to that, will be great.

Related

Apache Spark take Action on Executors in fully distributed mode

I am new to spark, i have the basic idea of how the transformation and action work (guide). I am trying some NLP operation on each line (basically paragraphs) in a text file. After processing, the result should be sent to a server (REST Api) for storage. The program is run as a spark job (submitted using spark-submit) on a cluster of 10 nodes in yarn mode. This is what i have done so far.
...
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> processedLines = lines
.map(line -> {
// processed here
return result;
});
processedLines.foreach(line -> {
// Send to server
});
This works but the foreach loop seems sequential, it seems like it is not running in distributed mode on the worker nodes. Am i correct?
I tried the following code but it doesn't work. Error: java: incompatible types: inferred type does not conform to upper bound(s). Obviously its wrong because map is a transformation, not an action.
lines.map(line -> { /* processing */ })
.map(line -> { /* Send to server */ });
I also tried with take(), but it requires int and the processedLines.count() is of type long.
processedLines.take(processedLines.count()).forEach(pl -> { /* Send to server */ });
The data is huge (greater than 100gb). What i want is that both the processing and sending it to the server should be done on the worker nodes. The processing part in the map defiantly takes place on the worker nodes. But how do i send the processed data from the worker nodes to the server because the foreach seems sequential loop taking place in the driver (if i am correct). Simply put, how to execute action in the worker nodes and not in the driver program.
Any help will be highly appreciated.
foreach is an action in spark. It basically takes each element of the RDD and applies a function to that element.
foreach is performed on the executor nodes or worker nodes. It does not get applied on the driver node. Note that in the local execution mode of running spark both driver and executor node can reside on the same JVM.
Check this for reference foreach explanation
Your approach looks ok where you are trying to map each element of RDD and then apply foreach to each element. The reason which I can think of why it is taking time is because of the data size that you are dealing with(~100GB).
One way of doing the optimization to this is to repartition the input data set. Ideally each partition should be of size 128MB for better performance results. There are many articles which you will find about best practices for doing the repartition of the data. I would suggest you follow them, It will give some performance benefit.
The second optimization which you can think of doing is the memory that you assign to each executor node. It plays a very important role while doing spark tuning.
The third optimization that you can think of is, batch the network call to the server. You are currently doing network calls to the server for each element of the RDD. If your design allows you to batch these network calls, where you can send more than 1 element in a single network call. This might help as well if the latency produced is majorly due to these network calls.
I hope this helps.
Firstly when your code is running on Executors its already in distributed mode now when you want to utilize all the CPU resources on Executors for more parallelism you should go for some async options and more preferrably with batch mode operation to avoid excess creation of Client connection objects as below.
You can replace your code with
processedLines.foreach(line -> {
with either of the solution
processedLines.foreachAsync(line -> {
// Send to server
}).get();
//To iterate batch wise I would go for this
processedLines.foreachPartitionAsync(lineIterator -> {
// Create your ouput client connection here
while (lineIterator.hasNext()){
String line = lineIterator.next();
}
}).get();
Both the function will create a Future object or submit a new thread or a unblocking call which would automatically add parallelism to your code.

Use (force) second JDBC connection to the same DB in the same thread

I am using streaming in my query to MySQL DB. Works fine, until I issue another query during the streaming. That is fully OK and explained in the java.sql.SQLException:
Streaming result set com.mysql.jdbc.RowDataDynamic#16559dec is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
Since I really would like to make my second query in the middle of streaming, apparently I just need to use another DB connection for that.
So how can I force using another connection in the same thread?
I am using Spring Data with Hibernate
Please do not suggest fetching all at once or paging, that's not the point of the question.
Edits:
streaming in my case is having a select from long table (millions of records), where next row is only transferred when it is requested. It is not streaming the file content. This is the article about that: http://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html
using another thread is a solution, but the question is about having in the same thread 2 connections at the same time
Program flow:
run the query using stream (using connection 1)
for every row from the stream
do something on db (using connection 2)
streaming finishes, connection 1 closes
So 1 thread with { 1. open conn, 2 - 8. asyncroneously stream, 9. close conn }, { 3. open or use conn, 4. query, 9. close or skip }??
Try using a connection pool instead if the streaming is short enough.
Otherwise is this also a case where large streams are better stored as file name in the database (using a UUID for instance to generate file names). Then the streaming can be done outside of the database, and you could throttle the streaming to hamper a self-made DenialOfService.
After question has been reedited.
So the scenario is:
void f() {
open conn
do a java 8 stream
g()
close conn
}
void g() {
open conn
...
close conn
}
This is possible, with several ways of handling it: with a single global connection, with a connection pool, transactional or not cq. autocommit.
For queries I guess the most important is to close things. Try-with resources is ideal to not leak resources.
try (Connection conn = ...) {
...
try (PreparedStatement stm = conn.prepareStatement(sql)) {
...
try (ResultSet rs = stm.executeQuery()) {
... the stream
}
}
}
The above closes also the ResultSet, which might have gotten you in troubles.
It is also quite thinkable you pass the stream on, it has access the result set.
As Stream is AutoCloseable too, you might need to tweak the code there. Or use a CachedRowSet instead of a ResultSet.
Sorry for this indeterminate answer.
I'm not sure I understand fully what you want to do. I take it that you want to stream your data, transform somehow each item and then persist the result. If that's the case, I believe that having a separate 'persist' method with annotation #Transactional(propagation = Propagation.REQUIRES_NEW) should do the trick.
If you want more than one processing your stream (say, because you do some REST calls as part of it or, for some other reason you think that it might take long), you might want to consider pushing the streamed elements to a blocking queue and have multiple threads reading from there and doing whatever you need doing with each item.

Publishing to KDB from multiple threads

We have an application with multiple threads which reuses one KDB connection.
From performance perspective, will it be good to open multiple connection to multithreaded KDB instance to speed up the process? Just also interesting is there any potential downside effect if we publish from multiple threads to a single connection: we have java app and use exxeleron java library.
Aside from the fact that a single socket connection to KDB isn't very resource hungry by itself, in the end I think you'll find that disk seeks and memory allocation are by far the largest bottlenecks, not how many connections you have to a database. That said, since you ask...
Let's go on simple assumptions:
The KDB database is a historical database. Multithread options on that side are negative port number and -s - which can't be set simultaneously
You have a single process, let's call it A, that accesses it
With a negative port number, you get multi-threaded input queue. So if A has the ability to do multiple queries they can be dispatched simultaneously and KDB+ won't block on each call. However A would somehow need to be able to identify the incoming stream of results as the responses to particular queries. You can query it like (<queryId>;<actualQuery>) and parse the the first element for identification I suppose. However in this use case it sounds like you should have multiple A's.
With -s you get multi-threaded queries so you q queries have to written as such (sometimes you get it for free though, like querying across partitions). You'll block on every call, so no real advantage in having multiple A's.

Elegant/efficient way reading millions of records in MySQL Database, Java

I have a MySQL database with ~8.000.000 records. Since I need to process them all I use a BlockingQueue which as Producer reads from the database and puts 1000 records in a queue. The Consumer is the processor that takes records from the queue.
I am writing this in Java, however I'm stuck to figure out how I can (in a clean, elegant way) read from my database and 'suspend' reading once the BlockingQueue is full. After this the control is being handed to the Consumer until there are free spots available again in the BlockingQueue. From here on the Producer should continue reading in records from the database.
Is it clean/elegant/efficient keeping my database connection open inorder for it to continuously read? Or should, once the control is shifted from Producer to Consumer, close the connection, store the id of the record read so far and later open the connection and start reading from that id? The latter seems to me not really good since my database will have to open/close a lot! However, the former is not so elegant in my opinion either?
With persistent connections:
You cannot build transaction processing effectively
Impossible user sessions on the same connection
The applications are not scalable.
With time you may need to extend it and it will require management/tracking of persistent connections
If the script, for whatever reason, could not release the lock on the table, then any following scripts will block indefinitely and one should restart the db server.
Using transactions, transaction block will also pass to the next script (using the same connection) if script execution ends before the transaction block completes, etc.
Persistent connections do not bring anything that you can do with non-persistent connections.
Then, why to use them, at all?
The only possible reason is performance, to use them when overhead of creating a link to your MySQL Server is high. And this depends on many factors like:
Database type
Whether MySQL server is on the same machine and, if not, how far? might be out of your local network /domain?
How much overloaded by other processes the machine on which MySQL sits
One always can replace persistent connections with non-persistent connections. It might change the performance of the script, but not its behavior!
Commercial RDBMS might be licensed by the number of concurrent opened connections and here the persistent connections can mis serve.
If you are using a bounded BlockingQueue by passing a capacity value in the constructor, then the producer will block when it attempts to call put() until the consumer removes an item by calling take().
It would help to know more about when or how the program is going to execute to decide how to deal with database connections. Some easy choices are: have the producer and all consumers get an individual connection, have a connection pool for all consumers while the producer holds the a connection, or have all producers and consumers use a connection pool.
You can facilitate minimizing the number of connections by using something such as Spring to manage your connection pool and transactions; however, it would only be necessary in some execution situations.

MySQL and Java: Insert efficiently as data comes in via events with high frequency

When an external event occurs (incoming measurement data) an event handler in my Java code is being called. The data should be written to a MySQL database. Due to the high frequency of these calls (>1000 per sec) I'd like to handle the inserts efficiently. Unfortunately I'm not a professional developer and an idiot with databases.
Neglecting the efficiency aspect my code would look roughly like this:
public class X {
public void eventHandler(data) {
connection = DriverManager.getConnection()
statement = connection.prepareStatement("insert …")
statement.setString(1, data)
statement.executeUpdate()
statement.close()
connection.close()
}
}
My understanding is that by calling addBatch() and executeBatch() on statement I could limit the physical disk access to let's say every 1000th insert. However as you can see in my code sketch above the statement object is newly instantiated with every call of eventHandler(). Therefore my impression is that the batch mechanism won't be helpful in this context. Same for turning off auto-commit and then calling commit() on the connection object since that one is closed after every insert.
I could turn connection and statement from local variables into class members and reuse them during the whole runtime of the program. But wouldn't it be bad style to keep the database connection open at all time?
A solution would be to buffer the data manually and then write to the database only after collecting a proper batch. But so far I still hope that you smart guys will tell me how to let the database do the buffering for me.
I could turn connection and statement from local variables into class
members and reuse them during the whole runtime of the program. But
wouldn't it be bad style to keep the database connection open at all
time?
Considering that most (database-)connection pools are usually configured to keep at least one or more connections open at all times, I wouldn't call it "bad style". This is to avoid the overhead of starting a new connection on each database operation (unless necessary, if all the already opened connections are in use and the pool allows for more).
I'd probably go with some form of batching in this case (but of course I don't know all your requirements/environment etc). If the data doesn't need to be immediately available somewhere else, you could build some form of a job queue for writing the data, push the incoming data there and let other thread(s) take care of writing it to database in N size batches. Take a look what classes are available in the java.util.concurrent-package.
I would suggest you use a LinkedList<> to buffer the data(like a queue), then store the data into the dbms as and when required in a separate thread, executed at regular intervals(maybe every 2 seconds?)
See how to construct a queue using linkedlist in java

Categories