For the below code, both stream1 and stream2 run fine individually and I can see output, but the joined stream just doesn't log anything at all. I have a feeling it has something to do with the join window, but the data from both streams comes in at almost exactly the same time.
val stream = builder.stream(stringSerde, byteArraySerde, "topic")
val stream1 = stream
.filter((key, value) => somefilter(key, value))
.through(stringSerde, byteArraySerde, "topic1")
val stream2 = stream
.filter((key, value) => someotherfilter(key, value))
.through(stringSerde, byteArraySerde, "topic2")
val joinedStream = stream1
.join(stream2, (value1: Array[Byte], value2: Array[Byte]) => {
println("wont print anything")
return somerandomdata
},
JoinWindows.of("othertopic").within(10000L),
stringSerde, byteArraySerde, byteArraySerde)
Shouldn't the keys of both topic be te same in order to join them?
I think the Javadoc explains this :
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html
This might also be an interesting read :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
Related
I have tens of thousands files in dir demotxt likeļ¼
demotxt/
aa.txt
this is aaa1
this is aaa2
this is aaa3
bb.txt
this is bbb1
this is bbb2
this is bbb3
this is bbb4
cc.txt
this is ccc1
this is ccc2
I would like to efficiently make a WordCount for each .txt in this dir with Spark2.4 (scala or python)
# target result is:
aa.txt: (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1)
bb.txt: (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1)
cc.txt: (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)
code maybe like?
def dealWithOneFile(path2File):
res = wordcountFor(path2File)
saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)
Seems using sc.textFile(".../demotxt/") spark will load all files which may cause memory issues,also it treats all files as one which is not expected.
So I wonder how should I do this? Many Thanks!
Here is an approach. Can be with DF or RDD. Here I show RDD using Databricks, as you do not state. Scala as well.
It's hard to explain, but works. Try some input.
%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey()
rdd4.collect
I am trying to merge values in Kafka Stream, by grouping them with Key, windowed by session (30 seconds) aggregate & produce to a new topic. However with in the same window, the code produces multiple aggregated records, with some having null value, some has the initial few aggregated values. I would expect the aggregate only to produce 1 output at the end of the join window. I also tried using suppress, but it didn't helped. Can someone point me if I did something wrong here?
Here is my code:
KStream<String, EventLog> elogStream = builder.stream("eventlognw", Consumed.with(stringSerde, eventLogSerde));
elogStream.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapWithNoGrace(Duration.ofSeconds(30)))
.aggregate(() -> EventLogList.newBuilder().setEventLogItems(new ArrayList<EventLog>()).build(), (key, value, wradAggregate) -> {
List<EventLog> elogList = wradAggregate.getEventLogItems();
if(null == elogList || elogList.isEmpty()) {
elogList = new ArrayList<>();
}
elogList.add(value);
wradAggregate.setEventLogItems(elogList);
log.info("***wradAggregate={}", null != wradAggregate ? wradAggregate.toString() : null);
return wradAggregate;
}, (aggKey, aggOne, aggTwo) -> {
List<EventLog> elogList2 = aggTwo.getEventLogItems();
elogList2.removeAll(aggOne.getEventLogItems());
aggOne.setEventLogItems(elogList2);
log.info("***aggOne={}", null != aggOne ? aggOne.toString() : null);
return aggOne;
})
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
//.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(30), BufferConfig.unbounded()))
.toStream()
.map((k,v) -> KeyValue.pair(k.key(), v))
.peek((k, v) -> log.info("****After applying GroupBy and Aggregate on elogStream, key(reqId): {}, value(eventLog): {}", k, v))
//.to("eventlogagg")
Below is code for getting list of file Names in a zipped file
def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open))
val filesInZip = new ArrayBuffer[String]()
var ze : Option[ZipEntry] = None
zipInputStream.foreach(stream =>{
do{
ze = Option(stream.getNextEntry);
ze.foreach{ze =>
if(ze.getName.endsWith("java") && !ze.isDirectory()){
var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
filesInZip += fileName
}
}
stream.closeEntry()
} while(ze.isDefined)
println(filesInZip.toList.length) // print 889 (correct)
})
println(filesInZip.toList.length) // print 0 (WHY..?)
(filesInZip.toList)
}
I execute above code in the following manner :
scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25
scala> getListOfFilesInRepo(zipRDD)
889
0
res12: List[String] = List()
Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values.
def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???
zipInputStream.flatMap(listFiles)
You can learn more from Understanding closures
I'm trying to take the JSON output of an analysis tool, and turn it into a Java list.
I'm doing this with Scala, and while it doesn't sound that hard, I've run into trouble.
So I was hoping the following code would do it:
def returnWarnings(json: JSONObject): java.util.List[String] ={
var bugs = new ArrayBuffer[String]
for (warning <- json.getJSONArray("warnings").){ //might need to add .asScala
bugs += (warning.getString("warning_type") + ": " + warning.getString("message"))
}
val rval: java.util.List[String] = bugs.asJava
rval
}
This block produces two errors when I try to compile it:
Error:(18, 42) value foreach is not a member of org.json.JSONArray
for (warning <- json.getJSONArray("warnings")){ //might need to add .asScala
^
and
Error:(21, 49) value asJava is not a member of scala.collection.mutable.ArrayBuffer[String]
val rval: java.util.List[String] = bugs.asJava
^
I don't know what's wrong with my for loop.
EDIT: with a bit more reading, I figured out what was up with the loop. see https://stackoverflow.com/a/6376083/5843840
The second error is especially baffling, because as far as I can tell it should work. It is really similar to the code from this documentation
scala> val jul: java.util.List[Int] = ArrayBuffer(1, 2, 3).asJava
jul: java.util.List[Int] = [1, 2, 3]
You should try the following:
import scala.collection.JavaConverters._
def returnWarnings(input: JSONObject): java.util.List[String] = {
val warningsArray = input.getJSONArray("warnings")
val output = (0 until warningsArray.length).map { i =>
val warning = warningsArray.getJSONObject(i)
warning.getString("warning_type") + ": " + warning.getString("message")
}
output.asJava
}
That final conversion could be done implicitly (without invoking .asJava), by importing scala.collection.JavaConversions._
With java I can create an ExecutorCompletionService with an executor and a bunch of tasks. This class arranges that submitted tasks are, upon completion, placed on a queue accessible using take.
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
Does Akka has something similar for managing Futures returned by actors?
This answer is for Scala only. In scala there is sequence/firstCompletedOf to compose futures, which returns you new future completing after all/one of the underlying futures isCompleted (which is equivalent to examples from CompletionService's api docs). Such solution is more safe than ecs.take().get() as there is no blocking if you use onComplete listener; however, if you still want some blocking waiter - use Await.result. So, no need for CompletionService as list of futures is flexible enough and much more safe. Equivalent of first example:
val solvers: List[() => Int] = ...
val futures = solvers.map(s => Future {s()}) //run execution
(Future sequence futures) onComplete { results: Seq[Int] =>
results.map(use)
}
Another example is cancelling the task:
val solvers: List[Future => Int] = ... //some list of functions(tasks), Future is used to check if task was interrupted
val (futures, cancels): solvers.map(cancellableFuture) //see https://stackoverflow.com/questions/16020964/cancellation-with-future-and-promise-in-scala
(Future firstCompletedOf futures) onComplete { result: Int =>
cancels.foreach(_())
use(result)
}
Talking about Java, Akka has adaptation of scala's futures: http://doc.akka.io/docs/akka/snapshot/java/futures.html
If you just want to sequentially process results on their completion, you may use actor for that:
val futures: List[Future]
futures.map(_ pipeTo actor) //actor's mailbox is used as queue
To model completion queue's behavior (which is not recommended):
import scala.concurrent._
import duration._
import scala.concurrent.ExecutionContext.Implicits.global //some execution context
class Queue[T](solvers: Seq[() => T]) extends Iterator[T]{
case class Result(f: Future[Result], r: T)
var futures: Set[Future[Result]] = solvers map {s =>
lazy val f: Future[Result] = Future{Result(f, s())}
f
} toSet
def hasNext() = futures.nonEmpty
def next() = {
val result = Await.result((Future firstCompletedOf futures.toSeq), Duration.Inf)
futures -= result.f
result.r
}
}
scala> val q = new Queue(List(() => 1, () => 2, () => 3, () => 4))
q: Queue[Int] = non-empty iterator
scala> q.next
res14: Int = 2
scala> q.next
res15: Int = 1
scala> q.foreach(println)
4
3
Maybe this probable solution without using ExecutorCompletionService will help you:
import java.util.concurrent.atomic.AtomicLong
import java.util.concurrent._
import scala.concurrent.duration._
import scala.util._
import scala.concurrent.{ExecutionContextExecutorService, ExecutionContext, Future}
class BatchedIteratorsFactory[S,R](M: Int, timeout: Duration) {
implicit val ec = ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
val throttlingQueue = new LinkedBlockingQueue[Future[R]](M) // Can't put more than M elements to the queue
val resultQueue = new LinkedBlockingQueue[Try[R]](M)
val jobCounter = new AtomicLong(0)
def iterator(input: Iterator[S])(job: S => R): Iterator[Try[R]] = {
val totalWork = Future(input.foreach { elem =>
jobCounter.incrementAndGet
throttlingQueue.put(Future { job(elem) } andThen {
case r => resultQueue.put(r); throttlingQueue.poll() // the order is important here!
})
})
new Iterator[Try[R]] {
override def hasNext: Boolean = jobCounter.get != 0 || input.hasNext
override def next(): Try[R] = {
jobCounter.decrementAndGet
Option(resultQueue.poll(timeout.toMillis, TimeUnit.MILLISECONDS)).getOrElse(
throw new TimeoutException(s"No task has been completed within ${timeout.toMillis} ms!")
)
}
}
}
}
So you can use it like this:
val job = { (elem: Int) =>
val result = elem * elem
Thread.sleep(1000L) // some possibel computation...
result
}
val src = Range(1, 16).toIterator
val it = new BatchedIteratorsFactory[Int, Int](M = 3, timeout = 4 seconds)
.iterator(src)(job)