Substitute for thread executor pool in Scala - java

My application requires that I have multiple threads running fetching data from various HDFS nodes. For that I am using the thread executor pool and forking threads.
Forking at :
val pathSuffixList = fileStatuses.getOrElse("FileStatus", List[Any]()).asInstanceOf[List[Map[String, Any]]]
pathSuffixList.foreach(block => {
ConsumptionExecutor.execute(new Consumption(webHdfsUri,block))
})
My class Consumption :
class Consumption(webHdfsUri: String, block:Map[String,Any]) extends Runnable {
override def run(): Unit = {
val uriSplit = webHdfsUri.split("\\?")
val fileOpenUri = uriSplit(0) + "/" + block.getOrElse("pathSuffix", "").toString + "?op=OPEN"
val inputStream = new URL(fileOpenUri).openStream()
val datumReader = new GenericDatumReader[Void]()
val dataStreamReader = new DataFileStream(inputStream, datumReader)
// val schema = dataStreamReader.getSchema()
val dataIterator = dataStreamReader.iterator()
while (dataIterator.hasNext) {
println(" data : " + dataStreamReader.next())
}
}
}
ConsumptionExecutor :
object ConsumptionExecutor{
val counter: AtomicLong = new AtomicLong()
val executionContext: ExecutorService = Executors.newCachedThreadPool(new ThreadFactory {
def newThread(r: Runnable): Thread = {
val thread: Thread = new Thread(r)
thread.setName("ConsumptionExecutor-" + counter.incrementAndGet())
thread
}
})
executionContext.asInstanceOf[ThreadPoolExecutor].setMaximumPoolSize(200)
def execute(trigger: Runnable) {
executionContext.execute(trigger)
}
}
However I want to use Akka streaming/ Akka actors where in I don't need to give a fixed thread pool size and Akka takes care of everything.
I am pretty new to Akka and the concept of Streaming and actors . Can someone give me any leads in the form of a sample code to fit my use case?
Thanks in advance!

An idea would be to create a (subclass) instance of ActorPublisher for each HDFS node that you are reading from, and then Merge them in as multiple Sources in a FlowGraph.
Something like this pseudo-code, where the details of the ActorPublisher sources are left out:
val g = PartialFlowGraph { implicit b =>
import FlowGraphImplicits._
val in1 = actorSource1
val in2 = actorSource2
// etc.
val out = UndefinedSink[T]
val merge = Merge[T]
in1 ~> merge ~> out
in2 ~> merge
// etc.
}
This can be improved for a collection of actor sources by just iterating over them and adding an edge to the merge for each one, but this gives the idea.

Related

Akka stream broadcast in java

I am trying to broadcast to 2 sink from a source in java, got stuck in between, any pointer will be helpful
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer,CompletionStage<Done>> firstSink = Sink.foreach(x -> System.out.println("first sink "+x));
Sink<Integer,CompletionStage<Done>> secondsink = Sink.foreach(x -> System.out.println("second sink "+x));
RunnableGraph.fromGraph(
GraphDSL.create(
b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source)).viaFanOut(bcast).to(b.add(firstSink)).to(b.add(secondsink));
return ClosedShape.getInstance();
}))
.run(materializer);
}
i am not that much familiar with java api for akka-stream graphs, so i used the official doc. there are 2 errors in your snippet:
when you added source to the graph builder, you need to get Outlet from it. so instead of b.from(b.add(source)) there should smth like this: b.from(b.add(source).out()) according to the official doc
you can't just call two .to method in a row, because .to expects smth with Sink shape, which means kind of dead end. instead you need to attach 2nd sink to the bcast directly, like this:
(...).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
all in all the code should look like this:
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer, CompletionStage<Done>> firstSink = foreach(x -> System.out.println("first sink " + x));
Sink<Integer, CompletionStage<Done>> secondSink = foreach(x -> System.out.println("second sink " + x));
RunnableGraph.fromGraph(
GraphDSL.create(b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source).out()).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
return ClosedShape.getInstance();
}
)
).run(materializer);
Final note - i would think twice whether it makes sense to use graph api. If you case as simple as this one (just 2 sinks), you might want just to use alsoTo or alsoToMat. They give you the possibility to attach multiple sinks to the flow without the need to use graphs.

Flink does not checkpoint, and BucketingSink leaves files in pending state, when source is generated from Collection

I'm trying to generate some test data using a collection, and write that data to s3, Flink doesn't seem to do any checkpointing at all when I do this, but it does do checkpointing when the source comes from s3.
For example, this DOES checkpoint and leaves output files in a completed state:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = {
val path = "s3a://my_bucket/simple_job/in"
env
.readFile(
inputFormat = new TextInputFormat(new Path(path)),
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = 5000L
)
}
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()
Meanwhile, this DOES NOT checkpoint, and leaves files in a .pending state even after the job has finished:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = env.fromCollection((1 to 100).map(_.toString))
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()
It turns out that this is because of this ticket: https://issues.apache.org/jira/browse/FLINK-2646 and simply comes about because the stream from collection finishes finishes before the app ever has time to make a single checkpoint.

Spark foeach no effect on Java Map [duplicate]

Below is code for getting list of file Names in a zipped file
def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open))
val filesInZip = new ArrayBuffer[String]()
var ze : Option[ZipEntry] = None
zipInputStream.foreach(stream =>{
do{
ze = Option(stream.getNextEntry);
ze.foreach{ze =>
if(ze.getName.endsWith("java") && !ze.isDirectory()){
var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
filesInZip += fileName
}
}
stream.closeEntry()
} while(ze.isDefined)
println(filesInZip.toList.length) // print 889 (correct)
})
println(filesInZip.toList.length) // print 0 (WHY..?)
(filesInZip.toList)
}
I execute above code in the following manner :
scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25
scala> getListOfFilesInRepo(zipRDD)
889
0
res12: List[String] = List()
Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values.
def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???
zipInputStream.flatMap(listFiles)
You can learn more from Understanding closures

Scala server with Spray Akka overthreading

I'm using spray.io and akka.io on my freebsd 1CPU/2GB server and I'm facing
threading problems. I've started to notice it when I got OutOfMemory exception
because of "can't create native thread".
I check Thread.activeCount() usually and I see it grows enormously.
Currently I use these settings:
myapp-namespace-akka {
akka {
loggers = ["akka.event.Logging$DefaultLogger"]
loglevel = "DEBUG"
stdout-loglevel = "DEBUG"
actor {
deployment {
default {
dispatcher = "nio-dispatcher"
router = "round-robin"
nr-of-instances = 1
}
}
debug {
receive = on
autoreceive = on
lifecycle = on
}
nio-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 8
parallelism-factor = 1.0
parallelism-max = 16
task-peeking-mode = "FIFO"
}
shutdown-timeout = 4s
throughput = 4
throughput-deadline-time = 0ms
attempt-teamwork = off
mailbox-requirement = ""
}
aside-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 8
parallelism-factor = 1.0
parallelism-max = 32
task-peeking-mode = "FIFO"
}
shutdown-timeout = 4s
throughput = 4
throughput-deadline-time = 0ms
attempt-teamwork = on
mailbox-requirement = ""
}
}
}
}
I want nio-dispatcher to be my default non blocking (lets say single thread)
dispatcher. And I execute all my futures (db, network queries) on aside-dispatcher.
I get my contexts through my application as follows:
trait Contexts {
def system: ActorSystem
def nio: ExecutionContext
def aside: ExecutionContext
}
object Contexts {
val Scope = "myapp-namespace-akka"
}
class ContextsImpl(settings: Config) extends Contexts {
val System = "myapp-namespace-akka"
val NioDispatcher = "akka.actor.nio-dispatcher"
val AsideDispatcher = "akka.actor.aside-dispatcher"
val Settings = settings.getConfig(Contexts.Scope)
override val system: ActorSystem = ActorSystem(System, Settings)
override val nio: ExecutionContext = system.dispatchers.lookup(NioDispatcher)
override val aside: ExecutionContext = system.dispatchers.lookup(AsideDispatcher)
}
// Spray trait mixed to service actors
trait ImplicitAsideContext {
this: EnvActor =>
implicit val aside = env.contexts.aside
}
I think that I did mess up with configs or implementations. Help me out here.
Usually I see thousands of threads for now on my app until it crashes (I set
freebsd pre process limit to 5000).
If your app indeed starts so many threads this can be usually tracked back to blocking behaviours inside ForkJoinPools (bad bad thing to do!), I explained the issue in detail in the answer here: blocking blocks, so you may want to read up on it there and verify what threads are being created in your app and why – the ForkJoinPool does not have a static upper limit.

Does Akka has a ExecutorCompletionService equivalent where Futures are queued by their completion time?

With java I can create an ExecutorCompletionService with an executor and a bunch of tasks. This class arranges that submitted tasks are, upon completion, placed on a queue accessible using take.
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
Does Akka has something similar for managing Futures returned by actors?
This answer is for Scala only. In scala there is sequence/firstCompletedOf to compose futures, which returns you new future completing after all/one of the underlying futures isCompleted (which is equivalent to examples from CompletionService's api docs). Such solution is more safe than ecs.take().get() as there is no blocking if you use onComplete listener; however, if you still want some blocking waiter - use Await.result. So, no need for CompletionService as list of futures is flexible enough and much more safe. Equivalent of first example:
val solvers: List[() => Int] = ...
val futures = solvers.map(s => Future {s()}) //run execution
(Future sequence futures) onComplete { results: Seq[Int] =>
results.map(use)
}
Another example is cancelling the task:
val solvers: List[Future => Int] = ... //some list of functions(tasks), Future is used to check if task was interrupted
val (futures, cancels): solvers.map(cancellableFuture) //see https://stackoverflow.com/questions/16020964/cancellation-with-future-and-promise-in-scala
(Future firstCompletedOf futures) onComplete { result: Int =>
cancels.foreach(_())
use(result)
}
Talking about Java, Akka has adaptation of scala's futures: http://doc.akka.io/docs/akka/snapshot/java/futures.html
If you just want to sequentially process results on their completion, you may use actor for that:
val futures: List[Future]
futures.map(_ pipeTo actor) //actor's mailbox is used as queue
To model completion queue's behavior (which is not recommended):
import scala.concurrent._
import duration._
import scala.concurrent.ExecutionContext.Implicits.global //some execution context
class Queue[T](solvers: Seq[() => T]) extends Iterator[T]{
case class Result(f: Future[Result], r: T)
var futures: Set[Future[Result]] = solvers map {s =>
lazy val f: Future[Result] = Future{Result(f, s())}
f
} toSet
def hasNext() = futures.nonEmpty
def next() = {
val result = Await.result((Future firstCompletedOf futures.toSeq), Duration.Inf)
futures -= result.f
result.r
}
}
scala> val q = new Queue(List(() => 1, () => 2, () => 3, () => 4))
q: Queue[Int] = non-empty iterator
scala> q.next
res14: Int = 2
scala> q.next
res15: Int = 1
scala> q.foreach(println)
4
3
Maybe this probable solution without using ExecutorCompletionService will help you:
import java.util.concurrent.atomic.AtomicLong
import java.util.concurrent._
import scala.concurrent.duration._
import scala.util._
import scala.concurrent.{ExecutionContextExecutorService, ExecutionContext, Future}
class BatchedIteratorsFactory[S,R](M: Int, timeout: Duration) {
implicit val ec = ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
val throttlingQueue = new LinkedBlockingQueue[Future[R]](M) // Can't put more than M elements to the queue
val resultQueue = new LinkedBlockingQueue[Try[R]](M)
val jobCounter = new AtomicLong(0)
def iterator(input: Iterator[S])(job: S => R): Iterator[Try[R]] = {
val totalWork = Future(input.foreach { elem =>
jobCounter.incrementAndGet
throttlingQueue.put(Future { job(elem) } andThen {
case r => resultQueue.put(r); throttlingQueue.poll() // the order is important here!
})
})
new Iterator[Try[R]] {
override def hasNext: Boolean = jobCounter.get != 0 || input.hasNext
override def next(): Try[R] = {
jobCounter.decrementAndGet
Option(resultQueue.poll(timeout.toMillis, TimeUnit.MILLISECONDS)).getOrElse(
throw new TimeoutException(s"No task has been completed within ${timeout.toMillis} ms!")
)
}
}
}
}
So you can use it like this:
val job = { (elem: Int) =>
val result = elem * elem
Thread.sleep(1000L) // some possibel computation...
result
}
val src = Range(1, 16).toIterator
val it = new BatchedIteratorsFactory[Int, Int](M = 3, timeout = 4 seconds)
.iterator(src)(job)

Categories