Below is code for getting list of file Names in a zipped file
def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
val zipInputStream = => new ZipInputStream(
val filesInZip = new ArrayBuffer[String]()
var ze : Option[ZipEntry] = None
zipInputStream.foreach(stream =>{
ze = Option(stream.getNextEntry);
ze.foreach{ze =>
if(ze.getName.endsWith("java") && !ze.isDirectory()){
var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
filesInZip += fileName
} while(ze.isDefined)
println(filesInZip.toList.length) // print 889 (correct)
println(filesInZip.toList.length) // print 0 (WHY..?)
I execute above code in the following manner :
scala> val zipFileRDD = sc.binaryFiles("./handsOn/")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/ BinaryFileRDD[17] at binaryFiles at <console>:25
scala> getListOfFilesInRepo(zipRDD)
res12: List[String] = List()
Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values.
def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???
You can learn more from Understanding closures
So I'm trying to parse an .obj wavefront file to be displayed with OpenGL ES, thing is, I'm getting the Nullpointer as if the file did not exist or was empty (?).
I tried two different ways of getting to parse the file, also made sure there were no empty lines on it, put it in different folders (assets, src root, res, etc...) but the result is the same. Maybe the error I'm getting is more to do with the OpenGL part of the code? But I'm kinda lost, because apparently it should work...
Also tried buffering the file outside the function, same happened. From another question here, the problem the person had, had to do with " trying to update UI from worker Thread ". Async did not help me here.
I got the code idea form this blog:
And the file to base my work on from here:
fun loadObjFile() {
try {
var str: String
var tmp: Array<String>
var ftmp: Array<String>
var v: Float
val vlist = ArrayList<Float>()
val nlist = ArrayList<Float>()
val fplist = ArrayList<Fp>()
val mContext: Context? = null
//val inb: BufferedReader = File("androidmodel.obj").bufferedReader()
//val inputString = inb.use { it.readText() }
val inb = BufferedReader(InputStreamReader(mContext?.getAssets()?.open
("src/main/res/androidmodel.obj")), 1024) //Error is here at
while (inb.readLine().also { str = it } != null) {
tmp = str.split(" ".toRegex()).toTypedArray()
//Parse the vertices
if (tmp[0].equals("v", ignoreCase = true)) {
for (i in 1..3) {
v = tmp[i].toFloat()
//Parse the vertex normals
if (tmp[0].equals("vn", ignoreCase = true)) {
for (i in 1..3) {
v = tmp[i].toFloat()
//Parse the faces/indices
if (tmp[0].equals("f", ignoreCase = true)) {
for (i in 1..3) {
ftmp = tmp[i].split("/".toRegex()).toTypedArray()
val chi = ftmp[0].toInt() - 1.toLong()
var cht = 0
if (ftmp[1] != "") cht = ftmp[1].toInt() - 1
val chn = ftmp[2].toInt() - 1
fplist.add(Fp(chi, cht, chn))
val vbb = ByteBuffer.allocateDirect(fplist.size * 4 * 3)
mVertexBuffer = vbb.asFloatBuffer()
val nbb = ByteBuffer.allocateDirect(fplist.size * 4 * 3)
mNormBuffer = nbb.asFloatBuffer()
for (j in fplist.indices) {
mVertexBuffer?.put(vlist[(fplist[j].Vi * 3).toInt()])
mVertexBuffer?.put(vlist[(fplist[j].Vi * 3 + 1).toInt()])
mVertexBuffer?.put(vlist[(fplist[j].Vi * 3 + 2).toInt()])
mNormBuffer?.put(nlist[fplist[j].Ni * 3])
mNormBuffer?.put(nlist[fplist[j].Ni * 3 + 1])
mNormBuffer?.put(nlist[fplist[j].Ni * 3 + 2])
mIndexBuffer = CharBuffer.allocate(fplist.size)
for (j in fplist.indices) {
} catch (e: IOException) {
private class Fp
(var Vi: Long, var Ti: Int, var Ni: Int)
The problem is that you pass null into InputStreamReader. The path to the asset is wrong.
First of all the file should be located under assets directory that is positioned on the same level in directory hierarchy as the java and res folder.
Second, you should pass path relative to the assets directory. So if your file is located directly under assets then the relative path is "androidmodel.obj". Thus, creating input stream will look like this:
But I strongly recommend you to check for non-null because if mContext is null - the issue will return.
mContext?.getAssets()?.open("androidmodel.obj")?.let { nonNullAsset ->
This part is crucial ?.let { as it runs the let function only if the object is not null.
If there is no assets directory, just create it as a simple directory and it will be picked up by IDE automatically:
As the NPE still occurs the only reason left is the null value in mContext variable. Make sure it is initialized.
And after a little bit more digging, I can say that this was the issue from the beginning. Any attempt to pass the wrong path of a file to the function will result in FileNotFoundException. Thus, even though the path you use is wrong you did not even reach the point of opening a file as the context is null.
I'm trying to generate some test data using a collection, and write that data to s3, Flink doesn't seem to do any checkpointing at all when I do this, but it does do checkpointing when the source comes from s3.
For example, this DOES checkpoint and leaves output files in a completed state:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = {
val path = "s3a://my_bucket/simple_job/in"
inputFormat = new TextInputFormat(new Path(path)),
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = 5000L
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
Meanwhile, this DOES NOT checkpoint, and leaves files in a .pending state even after the job has finished:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = env.fromCollection((1 to 100).map(_.toString))
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
It turns out that this is because of this ticket: and simply comes about because the stream from collection finishes finishes before the app ever has time to make a single checkpoint.
For the below code, both stream1 and stream2 run fine individually and I can see output, but the joined stream just doesn't log anything at all. I have a feeling it has something to do with the join window, but the data from both streams comes in at almost exactly the same time.
val stream =, byteArraySerde, "topic")
val stream1 = stream
.filter((key, value) => somefilter(key, value))
.through(stringSerde, byteArraySerde, "topic1")
val stream2 = stream
.filter((key, value) => someotherfilter(key, value))
.through(stringSerde, byteArraySerde, "topic2")
val joinedStream = stream1
.join(stream2, (value1: Array[Byte], value2: Array[Byte]) => {
println("wont print anything")
return somerandomdata
stringSerde, byteArraySerde, byteArraySerde)
Shouldn't the keys of both topic be te same in order to join them?
I think the Javadoc explains this :
This might also be an interesting read :
With java I can create an ExecutorCompletionService with an executor and a bunch of tasks. This class arranges that submitted tasks are, upon completion, placed on a queue accessible using take.
Does Akka has something similar for managing Futures returned by actors?
This answer is for Scala only. In scala there is sequence/firstCompletedOf to compose futures, which returns you new future completing after all/one of the underlying futures isCompleted (which is equivalent to examples from CompletionService's api docs). Such solution is more safe than ecs.take().get() as there is no blocking if you use onComplete listener; however, if you still want some blocking waiter - use Await.result. So, no need for CompletionService as list of futures is flexible enough and much more safe. Equivalent of first example:
val solvers: List[() => Int] = ...
val futures = => Future {s()}) //run execution
(Future sequence futures) onComplete { results: Seq[Int] =>
Another example is cancelling the task:
val solvers: List[Future => Int] = ... //some list of functions(tasks), Future is used to check if task was interrupted
val (futures, cancels): //see
(Future firstCompletedOf futures) onComplete { result: Int =>
Talking about Java, Akka has adaptation of scala's futures:
If you just want to sequentially process results on their completion, you may use actor for that:
val futures: List[Future] pipeTo actor) //actor's mailbox is used as queue
To model completion queue's behavior (which is not recommended):
import scala.concurrent._
import duration._
import //some execution context
class Queue[T](solvers: Seq[() => T]) extends Iterator[T]{
case class Result(f: Future[Result], r: T)
var futures: Set[Future[Result]] = solvers map {s =>
lazy val f: Future[Result] = Future{Result(f, s())}
} toSet
def hasNext() = futures.nonEmpty
def next() = {
val result = Await.result((Future firstCompletedOf futures.toSeq), Duration.Inf)
futures -= result.f
scala> val q = new Queue(List(() => 1, () => 2, () => 3, () => 4))
q: Queue[Int] = non-empty iterator
res14: Int = 2
res15: Int = 1
scala> q.foreach(println)
Maybe this probable solution without using ExecutorCompletionService will help you:
import java.util.concurrent.atomic.AtomicLong
import java.util.concurrent._
import scala.concurrent.duration._
import scala.util._
import scala.concurrent.{ExecutionContextExecutorService, ExecutionContext, Future}
class BatchedIteratorsFactory[S,R](M: Int, timeout: Duration) {
implicit val ec = ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
val throttlingQueue = new LinkedBlockingQueue[Future[R]](M) // Can't put more than M elements to the queue
val resultQueue = new LinkedBlockingQueue[Try[R]](M)
val jobCounter = new AtomicLong(0)
def iterator(input: Iterator[S])(job: S => R): Iterator[Try[R]] = {
val totalWork = Future(input.foreach { elem =>
throttlingQueue.put(Future { job(elem) } andThen {
case r => resultQueue.put(r); throttlingQueue.poll() // the order is important here!
new Iterator[Try[R]] {
override def hasNext: Boolean = jobCounter.get != 0 || input.hasNext
override def next(): Try[R] = {
Option(resultQueue.poll(timeout.toMillis, TimeUnit.MILLISECONDS)).getOrElse(
throw new TimeoutException(s"No task has been completed within ${timeout.toMillis} ms!")
So you can use it like this:
val job = { (elem: Int) =>
val result = elem * elem
Thread.sleep(1000L) // some possibel computation...
val src = Range(1, 16).toIterator
val it = new BatchedIteratorsFactory[Int, Int](M = 3, timeout = 4 seconds)
My application requires that I have multiple threads running fetching data from various HDFS nodes. For that I am using the thread executor pool and forking threads.
Forking at :
val pathSuffixList = fileStatuses.getOrElse("FileStatus", List[Any]()).asInstanceOf[List[Map[String, Any]]]
pathSuffixList.foreach(block => {
ConsumptionExecutor.execute(new Consumption(webHdfsUri,block))
My class Consumption :
class Consumption(webHdfsUri: String, block:Map[String,Any]) extends Runnable {
override def run(): Unit = {
val uriSplit = webHdfsUri.split("\\?")
val fileOpenUri = uriSplit(0) + "/" + block.getOrElse("pathSuffix", "").toString + "?op=OPEN"
val inputStream = new URL(fileOpenUri).openStream()
val datumReader = new GenericDatumReader[Void]()
val dataStreamReader = new DataFileStream(inputStream, datumReader)
// val schema = dataStreamReader.getSchema()
val dataIterator = dataStreamReader.iterator()
while (dataIterator.hasNext) {
println(" data : " +
ConsumptionExecutor :
object ConsumptionExecutor{
val counter: AtomicLong = new AtomicLong()
val executionContext: ExecutorService = Executors.newCachedThreadPool(new ThreadFactory {
def newThread(r: Runnable): Thread = {
val thread: Thread = new Thread(r)
thread.setName("ConsumptionExecutor-" + counter.incrementAndGet())
def execute(trigger: Runnable) {
However I want to use Akka streaming/ Akka actors where in I don't need to give a fixed thread pool size and Akka takes care of everything.
I am pretty new to Akka and the concept of Streaming and actors . Can someone give me any leads in the form of a sample code to fit my use case?
Thanks in advance!
An idea would be to create a (subclass) instance of ActorPublisher for each HDFS node that you are reading from, and then Merge them in as multiple Sources in a FlowGraph.
Something like this pseudo-code, where the details of the ActorPublisher sources are left out:
val g = PartialFlowGraph { implicit b =>
import FlowGraphImplicits._
val in1 = actorSource1
val in2 = actorSource2
// etc.
val out = UndefinedSink[T]
val merge = Merge[T]
in1 ~> merge ~> out
in2 ~> merge
// etc.
This can be improved for a collection of actor sources by just iterating over them and adding an edge to the merge for each one, but this gives the idea.