How to manage Apache-Beam TextIO exceptions into failures?

How to manage Apache-Beam TextIO exceptions into failures? - java

How to convert TextIO exceptions into failures?
Sometimes when i use TextIO.read() I have
org.apache.beam.sdk.Pipeline$PipelineExecutionException:
java.io.FileNotFoundException: No files matched spec:
src/test/resources/config/qqqqqqq
How to separate exceptions to independent list of failures?
For example this code:
I have a file with list of other files and need to read all lines from all files as one list
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
PCollection<String> lines = pipeline
.apply(TextIO.read().from("src/test/resources/config/W-PSFV-LOG-FILE-2022-05-16_23-59-59.txt"))
.apply(MapElements.into(TypeDescriptors.strings()).via(line -> "src/test/resources/config/" + line))
.apply(TextIO.readAll());
;
lines.apply(Log.ofElements());
pipeline.run();
But if one of files is broken it throws FileNotFoundException and stops. Do not want to stop, I want to get a list of all existing files and list with broken files

I thinks you can use a dead letter queue in order to solve your problem.
Beam proposes natively error handling with TupleTags or exceptionInto and exceptionVia methods in MapElements.
It then returns a Result structure with good outputs PCollections and failures PCollection.
You can also use a library called Asgarde :
https://github.com/tosun-si/asgarde
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
PCollection<String> lines = pipeline
.apply(TextIO.read().from("src/test/resources/config/W-PSFV-LOG-FILE-2022-05-16_23-59-59.txt"))
WithFailures.Result<PCollection<String>, Failure> result = CollectionComposer.of(lines)
.apply(MapElements.into(TypeDescriptors.strings()).via(line -> "src/test/resources/config/" + line));
;
// Gets outputs and Failure PCollections.
PCollection<String> output = result.output();
PCollection<Failure> failures = result.failures();
// Then you can sink your Failures in database, GCS file or topic if needed...
......
pipeline.run();
Failure object is proposed by Asgarde library and give the current input element as String and exception :
public class Failure implements Serializable {
private final String pipelineStep;
private final String inputElement;
private final Throwable exception;
If you want to use this code, you have to import Asgarde library, for example with Maven in your pom.xml file :
<dependency>
<groupId>fr.groupbees</groupId>
<artifactId>asgarde</artifactId>
<version>0.19.0</version>
</dependency>
or with Gradle :
implementation group: 'fr.groupbees', name: 'asgarde', version: '0.19.0'
PS : I am the creator of Asgarde library, the Readme of project shows many examples to apply Dead letter queue with native Beam and with Asgarde library.
Don't hesitate to read the Readme file of the project : https://github.com/tosun-si/asgarde

You can use a FileIO first to split the files into readable-existing files and non-existing files.
PCollection<KV<String, String>> categoryAndFiles = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(kvs(strings(), strings()))
.via((ReadableFile f) -> {
try {
f.open();
return KV.of(
"readable-existing",
f.getMetadata().resourceId().toString());
} catch (IOException ex) {
return KV.of(
"non-existing",
f.getMetadata().resourceId().toString());
}
}));
Adapted from an example.

#Rule
public transient TestPipeline pipeline = TestPipeline.create();
#Test
public void testTransformWithIOException3() throws FileNotFoundException {
PCollection<String> digits = pipeline.apply(Create.of("1", "2", "3"));
WithFailures.Result<PCollection<String>, Failure> result = pipeline
.apply("Read ", Create.of("1", "2", "3")) // PCollection<String>
.apply("Map", new PTransform<PCollection<String>, WithFailures.Result<PCollection<String>, Failure>>() {
#Override
public WithFailures.Result<PCollection<String>, Failure> expand(PCollection<String> input) {
return CollectionComposer.of(input)
.apply("//", MapElements.into(TypeDescriptors.strings()).via(s -> {
try {
if (s.equals("2")) throw new FileNotFoundException();
else
return s.toString();
} catch(Exception e){
throw new RuntimeException("error ");
}
}))
.getResult();
}
});
result.output().apply("out",
MapElements.into(TypeDescriptors.strings()).via(x -> {
System.out.println(x);
return x.toString();
}));
result.failures().apply("failures",
MapElements.into(TypeDescriptors.strings()).via(x -> {
System.out.println(x);
return x.toString();
}));
pipeline.run().waitUntilFinish();

Related

FlinkCEP pattern detection doesn't happen in real time

I'm still new to Flink CEP library and yet I don't understand the pattern detection behavior.
Considering the example below, I have a Flink app that consumes data from a kafka topic, data is produced periodically, I want to use Flink CEP pattern to detect when a value is bigger than a given threshold.
The code is below:
public class CEPJob{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
consumer.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
DataStream<String> stream = env.addSource(consumer);
// Process incoming data.
DataStream<Stock> inputEventStream = stream.map(new MapFunction<String, Stock>() {
private static final long serialVersionUID = -491668877013085114L;
#Override
public Stock map(String value) {
String[] data = value.split(":");
System.out.println("Date: " + data[0] + ", Adj Close: " + data[1]);
Stock stock = new Stock(data[0], Double.parseDouble(data[1]));
return stock;
}
});
// Create the pattern
Pattern<Stock, ?> myPattern = Pattern.<Stock>begin("first").where(new SimpleCondition<Stock>() {
private static final long serialVersionUID = -6301755149429716724L;
#Override
public boolean filter(Stock value) throws Exception {
return (value.getAdj_Close() > 140.0);
}
});
// Create a pattern stream from our warning pattern
PatternStream<Stock> myPatternStream = CEP.pattern(inputEventStream, myPattern);
// Generate alert for each matched pattern
DataStream<Stock> warnings = myPatternStream .select((Map<String, List<Stock>> pattern) -> {
Stock first = pattern.get("first").get(0);
return first;
});
warnings.print();
env.execute("CEP job");
}
}
What happens when I run the job, pattern detection doesn't happen in real-time, it outputs the warning for the detected pattern of the current record only after a second record is produced, it looks like it's delayed to print to the log the warining, I really didn't understand how to make it outputs the warning the time it detect the pattern without waiting for next record and thank you :) .
Data coming from Kafka are in string format: "date:value", it produce data every 5 secs.
Java version: 1.8, Scala version: 2.11.12, Flink version: 1.12.2, Kafka version: 2.3.0

The solution I found that to send a fake record (a null object for example) in the Kafka topic every time I produce a value to the topic, and on the Flink side (in the pattern declaration) I test if the received record is fake or not.
It seems like FlinkCEP always waits for the upcoming event before it outputs the warning.

Akka stream broadcast in java

I am trying to broadcast to 2 sink from a source in java, got stuck in between, any pointer will be helpful
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer,CompletionStage<Done>> firstSink = Sink.foreach(x -> System.out.println("first sink "+x));
Sink<Integer,CompletionStage<Done>> secondsink = Sink.foreach(x -> System.out.println("second sink "+x));
RunnableGraph.fromGraph(
GraphDSL.create(
b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source)).viaFanOut(bcast).to(b.add(firstSink)).to(b.add(secondsink));
return ClosedShape.getInstance();
}))
.run(materializer);
}

i am not that much familiar with java api for akka-stream graphs, so i used the official doc. there are 2 errors in your snippet:
when you added source to the graph builder, you need to get Outlet from it. so instead of b.from(b.add(source)) there should smth like this: b.from(b.add(source).out()) according to the official doc
you can't just call two .to method in a row, because .to expects smth with Sink shape, which means kind of dead end. instead you need to attach 2nd sink to the bcast directly, like this:
(...).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
all in all the code should look like this:
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer, CompletionStage<Done>> firstSink = foreach(x -> System.out.println("first sink " + x));
Sink<Integer, CompletionStage<Done>> secondSink = foreach(x -> System.out.println("second sink " + x));
RunnableGraph.fromGraph(
GraphDSL.create(b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source).out()).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
return ClosedShape.getInstance();
}
)
).run(materializer);
Final note - i would think twice whether it makes sense to use graph api. If you case as simple as this one (just 2 sinks), you might want just to use alsoTo or alsoToMat. They give you the possibility to attach multiple sinks to the flow without the need to use graphs.

Parallel processing using collection of CompletableFuture supplyAsync then collecting results

//Unit of logic I want to make it to run in parallel
public PagesDTO convertOCRStreamToDTO(String pageId, Integer pageSequence) throws Exception {
LOG.info("Get OCR begin for pageId [{}] thread name {}",pageId, Thread.currentThread().getName());
OcrContent ocrContent = getOcrContent(pageId);
OcrDTO ocrData = populateOCRData(ocrContent.getInputStream());
PagesDTO pageDTO = new PagesDTO(pageId, pageSequence.toString(), ocrData);
return pageDTO;
}
Logic to execute convertOCRStreamToDTO(..) in parallel then collect its results when individuals thread execution is done
List<PagesDTO> pageDTOList = new ArrayList<>();
//javadoc: Creates a work-stealing thread pool using all available processors as its target parallelism level.
ExecutorService newWorkStealingPool = Executors.newWorkStealingPool();
Instant start = Instant.now();
List<CompletableFuture<PagesDTO>> pendingTasks = new ArrayList<>();
List<CompletableFuture<PagesDTO>> completedTasks = new ArrayList<>();
CompletableFuture<<PagesDTO>> task = null;
for (InputPageDTO dcInputPageDTO : dcReqDTO.getPages()) {
String pageId = dcInputPageDTO.getPageId();
task = CompletableFuture
.supplyAsync(() -> {
try {
return convertOCRStreamToDTO(pageId, pageSequence.getAndIncrement());
} catch (HttpHostConnectException | ConnectTimeoutException e) {
LOG.error("Error connecting to Redis for pageId [{}]", pageId, e);
CaptureException e1 = new CaptureException(Error.getErrorCodes().get(ErrorCodeConstants.REDIS_CONNECTION_FAILURE),
" Connecting to the Redis failed while getting OCR for pageId ["+pageId +"] " + e.getMessage(), CaptureErrorComponent.REDIS_CACHE, e);
exceptionMap.put(pageId,e1);
} catch (CaptureException e) {
LOG.error("Error in Document Classification Engine Service while getting OCR for pageId [{}]",pageId,e);
exceptionMap.put(pageId,e);
} catch (Exception e) {
LOG.error("Error getting OCR content for the pageId [{}]", pageId,e);
CaptureException e1 = new CaptureException(Error.getErrorCodes().get(ErrorCodeConstants.TECHNICAL_FAILURE),
"Error while getting ocr content for pageId : ["+pageId +"] " + e.getMessage(), CaptureErrorComponent.REDIS_CACHE, e);
exceptionMap.put(pageId,e1);
}
return null;
}, newWorkStealingPool);
//collect all async tasks
pendingTasks.add(task);
}
//TODO: How to avoid unnecessary loops which is happening here just for the sake of waiting for the future tasks to complete???
//TODO: Looking for the best solutions
while(pendingTasks.size() > 0) {
for(CompletableFuture<PagesDTO> futureTask: pendingTasks) {
if(futureTask != null && futureTask.isDone()){
completedTasks.add(futureTask);
pageDTOList.add(futureTask.get());
}
}
pendingTasks.removeAll(completedTasks);
}
//Throw the exception cought while getting converting OCR stream to DTO - for any of the pageId
for(InputPageDTO dcInputPageDTO : dcReqDTO.getPages()) {
if(exceptionMap.containsKey(dcInputPageDTO.getPageId())) {
CaptureException e = exceptionMap.get(dcInputPageDTO.getPageId());
throw e;
}
}
LOG.info("Parallel processing time taken for {} pages = {}", dcReqDTO.getPages().size(),
org.springframework.util.StringUtils.deleteAny(Duration.between(Instant.now(), start).toString().toLowerCase(), "pt-"));
Please look at my above code base todo items, I have below two concerns for which I am looking for advice over stackoverflow:
1) I want to avoid unnecessary looping (happening in while loop above), what is the best way for optimistically I wait for all threads to complete its async execution then collect my results out of it??? Please anybody has an advice?
2) ExecutorService instance is created at my service bean class level, thinking that, it will be re-used for every requests, instead create it local to the method, and shutdown in finally. Am I doing right here?? or any correction in my thought process?

Simply remove the while and the if and you are good:
for(CompletableFuture<PagesDTO> futureTask: pendingTasks) {
completedTasks.add(futureTask);
pageDTOList.add(futureTask.get());
}
get() (as well as join()) will wait for the future to complete before returning a value. Also, there is no need to test for null since your list will never contain any.
You should however probably change the way you handle exceptions. CompletableFuture has a specific mechanism for handling them and rethrowing them when calling get()/join(). You might simply want to wrap your checked exceptions in CompletionException.

Delete directory recursively in Scala

I am writing the following (with Scala 2.10 and Java 6):
import java.io._
def delete(file: File) {
if (file.isDirectory)
Option(file.listFiles).map(_.toList).getOrElse(Nil).foreach(delete(_))
file.delete
}
How would you improve it ? The code seems working but it ignores the return value of java.io.File.delete. Can it be done easier with scala.io instead of java.io ?

With pure scala + java way
import scala.reflect.io.Directory
import java.io.File
val directory = new Directory(new File("/sampleDirectory"))
directory.deleteRecursively()
deleteRecursively() Returns false on failure

Try this code that throws an exception if it fails:
def deleteRecursively(file: File): Unit = {
if (file.isDirectory) {
file.listFiles.foreach(deleteRecursively)
}
if (file.exists && !file.delete) {
throw new Exception(s"Unable to delete ${file.getAbsolutePath}")
}
}
You could also fold or map over the delete if you want to return a value for all the deletes.

Using scala IO
import scalax.file.Path
val path = Path.fromString("/tmp/testfile")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
or better, you could use a Try
val path: Path = Path ("/tmp/file")
Try(path.deleteRecursively(continueOnFailure = false))
which will either result in a Success[Int] containing the number of files deleted, or a Failure[IOException].

From
http://alvinalexander.com/blog/post/java/java-io-faq-how-delete-directory-tree
Using Apache Common IO
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.WildcardFileFilter;
public void deleteDirectory(String directoryName)
throws IOException
{
try
{
FileUtils.deleteDirectory(new File(directoryName));
}
catch (IOException ioe)
{
// log the exception here
ioe.printStackTrace();
throw ioe;
}
}
The Scala one can just do this...
import org.apache.commons.io.FileUtils
import org.apache.commons.io.filefilter.WildcardFileFilter
FileUtils.deleteDirectory(new File(outputFile))
Maven Repo Imports

Using the Java NIO.2 API:
import java.nio.file.{Files, Paths, Path, SimpleFileVisitor, FileVisitResult}
import java.nio.file.attribute.BasicFileAttributes
def remove(root: Path): Unit = {
Files.walkFileTree(root, new SimpleFileVisitor[Path] {
override def visitFile(file: Path, attrs: BasicFileAttributes): FileVisitResult = {
Files.delete(file)
FileVisitResult.CONTINUE
}
override def postVisitDirectory(dir: Path, exc: IOException): FileVisitResult = {
Files.delete(dir)
FileVisitResult.CONTINUE
}
})
}
remove(Paths.get("/tmp/testdir"))
Really, it's a pity that the NIO.2 API is with us for so many years and yet few people are using it, even though it is really superior to the old File API.

Expanding on Vladimir Matveev's NIO2 solution:
object Util {
import java.io.IOException
import java.nio.file.{Files, Paths, Path, SimpleFileVisitor, FileVisitResult}
import java.nio.file.attribute.BasicFileAttributes
def remove(root: Path, deleteRoot: Boolean = true): Unit =
Files.walkFileTree(root, new SimpleFileVisitor[Path] {
override def visitFile(file: Path, attributes: BasicFileAttributes): FileVisitResult = {
Files.delete(file)
FileVisitResult.CONTINUE
}
override def postVisitDirectory(dir: Path, exception: IOException): FileVisitResult = {
if (deleteRoot) Files.delete(dir)
FileVisitResult.CONTINUE
}
})
def removeUnder(string: String): Unit = remove(Paths.get(string), deleteRoot=false)
def removeAll(string: String): Unit = remove(Paths.get(string))
def removeUnder(file: java.io.File): Unit = remove(file.toPath, deleteRoot=false)
def removeAll(file: java.io.File): Unit = remove(file.toPath)
}

Using java 6 without using dependencies this is pretty much the only way to do so.
The problem with your function is that it return Unit (which I btw would explicit note it using def delete(file: File): Unit = {
I took your code and modify it to return map from file name to the deleting status.
def delete(file: File): Array[(String, Boolean)] = {
Option(file.listFiles).map(_.flatMap(f => delete(f))).getOrElse(Array()) :+ (file.getPath -> file.delete)
}

To add to Slavik Muz's answer:
def deleteFile(file: File): Boolean = {
def childrenOf(file: File): List[File] = Option(file.listFiles()).getOrElse(Array.empty).toList
#annotation.tailrec
def loop(files: List[File]): Boolean = files match {
case Nil ⇒ true
case child :: parents if child.isDirectory && child.listFiles().nonEmpty ⇒
loop((childrenOf(child) :+ child) ++ parents)
case fileOrEmptyDir :: rest ⇒
println(s"deleting $fileOrEmptyDir")
fileOrEmptyDir.delete()
loop(rest)
}
if (!file.exists()) false
else loop(childrenOf(file) :+ file)
}

This one uses java.io but one can delete directories matching it with wildcard string which may or may not contain any content within it.
for (file <- new File("<path as String>").listFiles;
if( file.getName() matches("[1-9]*"))) FileUtils.deleteDirectory(file)
Directory structure e.g.
* A/1/, A/2/, A/300/ ... thats why the regex String: [1-9]*, couldn't find a File API in scala which supports regex(may be i missed something).

Getting little lengthy, but here's one that combines the recursivity of Garrette's solution with the npe-safety of the original question.
def deleteFile(path: String) = {
val penultimateFile = new File(path.split('/').take(2).mkString("/"))
def getFiles(f: File): Set[File] = {
Option(f.listFiles)
.map(a => a.toSet)
.getOrElse(Set.empty)
}
def getRecursively(f: File): Set[File] = {
val files = getFiles(f)
val subDirectories = files.filter(path => path.isDirectory)
subDirectories.flatMap(getRecursively) ++ files + penultimateFile
}
getRecursively(penultimateFile).foreach(file => {
if (getFiles(file).isEmpty && file.getAbsoluteFile().exists) file.delete
})
}

This is recursive method that clean all in directory, and return count of deleted files
def cleanDir(dir: File): Int = {
#tailrec
def loop(list: Array[File], deletedFiles: Int): Int = {
if (list.isEmpty) deletedFiles
else {
if (list.head.isDirectory && !list.head.listFiles().isEmpty) {
loop(list.head.listFiles() ++ list.tail ++ Array(list.head), deletedFiles)
} else {
val isDeleted = list.head.delete()
if (isDeleted) loop(list.tail, deletedFiles + 1)
else loop(list.tail, deletedFiles)
}
}
}
loop(dir.listFiles(), 0)
}

What I ended up with
def deleteRecursively(f: File): Boolean = {
if (f.isDirectory) f.listFiles match {
case files: Array[File] => files.foreach(deleteRecursively)
case null =>
}
f.delete()
}

os-lib makes it easy to delete recursively with a one-liner:
os.remove.all(os.pwd/"dogs")
os-lib uses java.nio under the hood, just doesn't expose all the Java ugliness. See here for more info on how to use the library.

You can do this by excute external system commands.
import sys.process._
def delete(path: String) = {
s"""rm -rf ${path}""".!!
}

GATE ML Information Extraction process fail to produce proper class-level

I am trying to learn Machine Learning. In case of Information Extraction, save files are getting populated properly with data. But Number of class level is 0 in NLPFeaturesData.save file and log is also sounds like
93 #numTrainingDocs
0 #numClasses
53738 #numNullLabelInstances
9006940 #totalFeatures
C:\...\learnedModels.save #modelFile
SVMLibSvmJava #learnerName
null #learnerExecutable
-c 0.7 -t 0 -m 100 -tau 0.4 #learnerParams
I have run the following code base. But generated "NLPFeatureData.save" file's all class level is 0. Could someone please help me where I went wrong in machine learning.
try{
// ***************** load Gate & it's plugin [ Load learning] ***********************************
System.setProperty("gate.home", "C:\\Program Files\\GATE_Developer_7.1");
Gate.init();
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());
Gate.getCreoleRegister().registerDirectories(new URL(FILE_WORK_PATH+"/plugins/Learning"));
// ****************** Instantiate corpus and load training documents *****************************
gate.Corpus corpus = (Corpus) Factory.createResource("gate.corpora.CorpusImpl");
FileFilter fileFilter = new FileFilter() {
public boolean accept(File pathname) {
// TODO Auto-generated method stub
return true;
}
};
corpus.populate(new URL(".../corpus"),fileFilter,"UTF-8",false);
Gate.getCreoleRegister().registerDirectories();
//Make a pipeline and add the corpus
FeatureMap pfm = Factory.newFeatureMap();
pfm.put("corpus", corpus);
pipeline = (gate.creole.SerialAnalyserController)gate.Factory.createResource("gate.creole.SerialAnalyserController", pfm);
initAnnie();
//********************************* Configure with relations config file and learning api
File configFile = new File("../learning-config.xml"); //Wherever it is
RunMode mode = RunMode.TRAINING; //or TRAINING, or APPLICATION ..
FeatureMap fm = Factory.newFeatureMap();
fm.put("configFileURL", configFile.toURI().toURL());
fm.put("learningMode", mode);
gate.learning.LearningAPIMain learner = (gate.learning.LearningAPIMain)gate.Factory.createResource("gate.learning.LearningAPIMain", fm);
pipeline.add(learner);
pipeline.execute();
}catch(Exception e){
e.printStackTrace();
}
}
private static void initAnnie() throws GateException {
for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) {
FeatureMap params = Factory.newFeatureMap(); // use default parameters
ProcessingResource pr = (ProcessingResource)
Factory.createResource(ANNIEConstants.PR_NAMES[i], params);
pipeline.add(pr);
}
}

Finally I have resolved this problem. I have added following AnnotationSet with gate.learning.LearningAPIMain instance.
learner.setInputASName("Key");
learner.setOutputASName("Key");
Now my saved files are generating with proper format.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to manage Apache-Beam TextIO exceptions into failures? - java

Related

FlinkCEP pattern detection doesn't happen in real time

Akka stream broadcast in java

Parallel processing using collection of CompletableFuture supplyAsync then collecting results

Delete directory recursively in Scala

GATE ML Information Extraction process fail to produce proper class-level

Categories

Resources