How to add another process after TextIO.write on dataflow pipeline - java

I created a simple dataflow pipeline which consist of this process:
Fetch/read data from bigquery
Change the output to csv format
Create CSV file on Google Storage
//TODO send CSV file to third party
pipeline.apply("ReadFromBigQuery",
BigQueryIO.read(new MyCustomObject1(input))
.fromQuery(myCustomQuery)
.usingStandardSql()
).apply("ConvertToCsv",
ParDo.of(new myCustomObject2())
).apply("WriteToCSV",
TextIO.write().to(fileLocation)
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] {'\r', '\n'})
.withHeader(csvHeader)
);
but after step 3 (write to GS), i can't add another process to dataflow
how can i achieve this?

Because TextIO.write() returns a PDone, instead of a PCollection in the prior PTransform's.
One of the possible solutions in your step 2, you could use a multi out with tags to write to different locations.
final TupleTag<String> csvOutTag= new TupleTag<String>(){};
final TupleTag<String> furtherProcessingTag= new TupleTag<String>(){};
PCollectionTuple mixedCollection =
bigQueryReadCollection.apply(ParDo
.of(new DoFn<TableRow,String>() {
#ProcessElement
public void processElement(ProcessContext c) {
// Emit to main output, which is the output
c.output(c.element().toString());
// Emit to output with tag furtherProcessing
c.output(furtherProcessingTag, c.element());
}
}
}).withOutputTags(csvOutTag,
TupleTagList.of(furtherProcessingTag)));
// Get output with tag csvOutTag.
mixedCollection.get(csvOutTag).apply("WriteToCSV",
TextIO.write().to(fileLocation)
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] {'\r', '\n'})
.withHeader(csvHeader));
// Get output with tag furtherProcessingTag.
mixedCollection.get(furtherProcessingTag).apply(...);
Please add appropriate data types in TupleTag declaration, based on your output for further processing.

Related

Get only a subset of fields from a Kafka topic using Apache Beam

Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.

Side input in global window as slowly changing cache questions

Context:
We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.
What I have tried:
Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:
// Creates a side input that refreshes the schema every minute
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
Map<String, byte[]> dataBlobs = ImmutableMap.of(
"version", Longs.toByteArray(ctx.element().byteValue()),
"avroSchemaBlob", avroSchemaBlob,
"fileDescriptorSetBlob", fileDescriptorSetBlob,
"depsBlob", depsBlob);
ctx.output(dataBlobs);
}
}))
.apply(View.asSingleton());
"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.
However, this approach failed from the exception:
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
I then tried writing my own Combine Globally function like so:
static class GetLatestVersion implements SerializableFunction<Iterable<Map<String, byte[]>>, Map<String, byte[]>> {
#Override
public Map<String, byte[]> apply(Iterable<Map<String, byte[]>> versions) {
Map<String, byte[]> result = Maps.newHashMap();
Long maxVersion = Long.MIN_VALUE;
for (Map<String, byte[]> version: versions){
Long currentVersion = Longs.fromByteArray(version.get("version"));
logger.info("Side input version: " + currentVersion);
if (currentVersion > maxVersion) {
result = version;
maxVersion = currentVersion;
}
}
return result;
}
}
But it still triggers the same exception........
I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).
My questions:
Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?
Thanks
Use .apply(View.asMap()) instead of .apply(View.asSingleton());
This is the full example:
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, KV<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
ctx.output(KV.of("version", Longs.toByteArray(ctx.element().byteValue())));
ctx.output(KV.of("avroSchemaBlob", avroSchemaBlob));
ctx.output(KV.of("fileDescriptorSetBlob", fileDescriptorSetBlob));
ctx.output(KV.of("depsBlob", depsBlob));
}
}))
.apply(View.asMap());
You can use the map from the side inputs as described in documentation.
Apache Beam version 2.34.0

Write Kafka Stream output to multiple directory using Apache Beam

I would like to persist the data from Kafka topic to google storage using Data flow.
I have written a sample code on local, it is working all good.
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p.apply(KafkaIO.<Long, String>read().withBootstrapServers("localhost:9092").withTopic("my-topic")
.withKeyDeserializer(LongDeserializer.class).withValueDeserializer(StringDeserializer.class))
.apply(Window
.<KafkaRecord<Long, String>>
into(FixedWindows.of(Duration.standardMinutes(1)))
)
.apply(FlatMapElements.into(TypeDescriptors.strings())
.via((KafkaRecord<Long, String> line) -> TextUtil.splitLine(line.getKV().getValue())))
.apply(Filter.by((String word) -> StringUtils.isNotEmpty(word))).apply(Count.perElement())
.apply(MapElements.into(TypeDescriptors.strings())
.via((KV<String, Long> lineCount) -> lineCount.getKey() + ": " + lineCount.getValue()))
.apply(TextIO.write().withWindowedWrites().withNumShards(1)
.to("resources/temp/wc-kafka-op/wc"));
p.run().waitUntilFinish();
}
Above code works perfectly. But I would like to save output of each window in separate directory.
e.g. {BasePath}/{Window}/{prefix}{Suffice}
I could not able to get it working.
TextIO supports windowedWrites, when you can specify how the name is derived. See JavaDoc.

How to process multilevel sources in akka asynchronously

I have a list of list which i am looking forward to run it using akka and would want to do a operation when all of the child lists are done processing. But the Complete is running before all child's are completed.
Basically i am trying to read all the sheets in the excel and then read each rows from the excel. For this i am looking to use akka to process each sheets seperately and also in each sheet i am looking to process each rows seperately.
Sample Code:
List<List<String>> workbook = new ArrayList<List<String>>();
List<String> Sheet1 = new ArrayList<String>();
Sheet1.add("S");
Sheet1.add("a");
Sheet1.add("d");
List<String> Sheet2 = new ArrayList<String>();
Sheet2.add("S");
Sheet2.add("a1");
Sheet2.add("d");
workbook.add(Sheet1);
workbook.add(Sheet2);
final ActorSystem system = ActorSystem.create("Sys");
final ActorMaterializer materializer = ActorMaterializer.create(system);
Source.from(workbook).map(sheet -> {
return Source.from(sheet).runWith(Sink.foreach(data -> {
System.out.println(data);
Thread.sleep(1000);
}), materializer).toCompletableFuture();
}).runWith(Sink.ignore(), materializer).whenComplete((a, b) -> {
System.out.println("Complete");
});
system.terminate();
The Current output is:
S
S
Complete
a
a1
d
d
The Expected output is:
S
S
a
a1
d
d
Complete
Could anyone please help ?
Your use of a "stream within a stream" may be overcomplicating the process.
You could instead use Flow.flatMapConcat. I can only provide an example in scala but hopefully it translates easily to java:
val flattenFlow : Flow[List[String], String, NotUsed] =
Flow[List[String].flatMapConcat(sheet => Source(sheet))
val Source[String] flattenedSource = Source(worksheet).via(flattenFlow)
There is a blog post with a example of using flatMapConcat in java but I don't know if my guessed type Flow.of(List<String>.class) is valid code.

Java compare two csv files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So I have two csv files i wish to compare.
Each file could be as much as 20mb each.
Each line has the key followed by the data so key,data
But the data is then separated by comma as well.
csv1.csv
KEY , DATA
AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN
csv2.csv
AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY
What i want to do is read both files and compare the data for each key.
I was thinking read each file line by line adding into a TreeMap. I can then compare each set of data for a given key and if there is a difference write it to another file.
Any advice?
As I am unsure of how to read the files to extract just the keys and data in an efficient way.
Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can parse these 20mb files in 100ms or less. The following solution is a bit involved to prevent loading too much data into memory. Check the tutorial I linked above, there are many ways to accomplish what you need with this library.
First we read one of the CSV files, and generate a Map:
public static void main(String... args) {
//First we parse one file (ideally the smaller one)
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
//Parse all data into a list.
List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
//Convert that list into a map. The first column of this input will produce the keys.
Map<String, String[]> mapOfRecords = toMap(records);
//this where the magic happens.
processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);
}
This is the code to generate a Map from the list of records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your second file and generate another with any updates found:
private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
//configures a new parser again
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
//All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
settings.setProcessor(new RowProcessor() {
//will write the changed rows to another file
CsvWriter writer;
#Override
public void processStarted(ParsingContext context) {
CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
writer = new CsvWriter(output, settings);
}
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from will have the ID as index 0.
// If the map contains the ID, we'll get a row
String[] existingRow = mapOfExistingRecords.get(row[0]);
if (!Arrays.equals(row, existingRow)) {
writer.writeRow(row);
}
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above. All differences will be
//written to the output file.
parser.parse(input);
}
This should work just fine. I hope it helps you.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I work with a lot of CSV file comparisons for my job. I didn't know python before I started working, but I picked it up really quick. If you want to compare CSV files quickly, python is a wonderful way to go, and its fairly easy to pick up if you know java.
I modified a script I use to fit your basic use case (you'll need to modify it a bit more to do exactly what you want). It Runs under a few seconds when I use it compare csv files with millions of rows. If you need to do this in java, you can pretty much transfer this to some java methods. There are similar csv libraries you can use that will replace all the csv functions below.
import csv, sys, itertools
def getKeyPosition(header_row, key_value):
counter = 0
for header in header_row:
if (header == key_value):
return counter
counter += 1
# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
key_dict = {}
row_counter = 0
unique_records = 0
for row in csv_reader:
row_counter += 1
if row[key_position] not in key_dict:
key_dict.update({row[key_position]: row})
unique_records += 1
# My use case requires a lot of checking for duplicates
if unique_records != row_counter:
print "Duplicate Keys in File"
return key_dict
def main():
f1 = open(sys.argv[1])
f2 = open(sys.argv[2])
f1_csv = csv.reader(f1)
f2_csv = csv.reader(f2)
f1_header = next(f1_csv)
f2_header = next(f2_csv)
f1_header_key_position = getKeyPosition(f1_header, "KEY")
f2_header_key_position = getKeyPosition(f2_header, "KEY")
f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)
outputFile = open("KeyDifferenceFile.csv" , 'w')
writer = csv.writer(outputFile)
writer.writerow(f1_header)
#Heres the logic for comparing rows
for key, row_1 in f1_row_dict.iteritems():
#Do whatever comparisions you need here.
if key not in f2_row_dict:
print "Oh no, this key doesn't exist in the file 2"
if key in f2_row_dict:
row_2 = f2_row_dict.get(key)
if row_1 != row_2:
print "oh no, the two rows don't match!"
# You can get more header keys to compare by if you want.
data_position = getKeyPosition(f2_header, "DATA")
row_1_data = row_1[data_position]
row_2_data = row_2[data_position]
if row_1_data != row_2_data:
print "oh no, the data doesn't match!"
# Heres how you'd right the rows
row_to_write = []
#Differences between
for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
row_to_write.append(row_1_column - row_2_column)
writer.writerow(row_to_write)
# Make sure to close those files!
f1.close()
f2.close()
outputFile.close()
main()

Categories