I am currently trying to develop a Dataflow pipeline in order to replace some partitions of a partitioned table. I have a custom partition field which is a date. The input of my pipeline is a file with potentially different dates.
I developed a Pipeline :
PipelineOptionsFactory.register(BigQueryOptions.class);
BigQueryOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BigQueryOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> rows = p.apply("ReadLines", TextIO.read().from(options.getFileLocation()))
.apply("Convert To BQ Row", ParDo.of(new StringToRowConverter(options)));
ValueProvider<String> projectId = options.getProjectId();
ValueProvider<String> datasetId = options.getDatasetId();
ValueProvider<String> tableId = options.getTableId();
ValueProvider<String> partitionField = options.getPartitionField();
ValueProvider<String> columnNames = options.getColumnNames();
ValueProvider<String> types = options.getTypes();
rows.apply("Write to BQ", BigQueryIO.writeTableRows()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withCustomGcsTempLocation(options.getGCSTempLocation())
.to(new DynamicDestinations<TableRow, String>() {
#Override
public String getDestination(ValueInSingleWindow<TableRow> element) {
TableRow date = element.getValue();
String partitionDestination = (String) date.get(partitionField.get());
SimpleDateFormat from = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat to = new SimpleDateFormat("yyyyMMdd");
try {
partitionDestination = to.format(from.parse(partitionDestination));
LOG.info("Table destination "+partitionDestination);
return projectId.get()+":"+datasetId.get()+"."+tableId.get()+"$"+partitionDestination;
} catch(ParseException e){
e.printStackTrace();
return projectId.get()+":"+datasetId.get()+"."+tableId.get()+"_rowsWithErrors";
}
}
#Override
public TableDestination getTable(String destination) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField(partitionField.get());
timePartitioning.setType("DAY");
timePartitioning.setRequirePartitionFilter(true);
TableDestination tableDestination = new TableDestination(destination, null, timePartitioning);
LOG.info(tableDestination.toString());
return tableDestination;
}
#Override
public TableSchema getSchema(String destination) {
return new TableSchema().setFields(buildTableSchemaFromOptions(columnNames, types));
}
})
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
);
p.run();
}
When I trigger the pipeline locally, it successfully replacesthe partitions which date are in the input file. Nevertheless, when deploying on Google Cloud Dataflow and running the template with the exact same parameters, it truncates all the data, and I just have at the end the file I wanted to upload in my table.
Do you know why there is such a difference ?
Thank you !
You specified BigQueryIO.Write.CreateDisposition to CREATE_IF_NEEDED, and this is paired with BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE, so even if the table exists, it may be recreated. This is the reason why you see your table getting replaced.
See this document [1] for details.
[1] https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/BigQueryIO.Write.CreateDisposition#CREATE_IF_NEEDED
Related
I am new to spring and I am trying to do something like this. Let us say my table has the following columns.
task
is_completed
completd_at
The user will provide the following options in the query parameters.
is_completed=true or false
from_date = dd-mm-yyyy
to_date = dd-mm-yyyy
I will check for each parameter one by one and then filter the table.
In Django, I can do something like this
tasks = Task.object.all() # All the tasks will be stored in tasks
tasks = tasks.filter(is_completed=True) # completed tasks will be filtered from all tasks
tasks = tasks.filter(completed_at__gte=from_date, completed_date__lte=to_date) # completed tasks will be filtered based on the completed date
How can I achieve this with spring JPA. Is there any way I can save the filtered results and query the filtered results again instead of querying the entire database?
That way I can check if the parameter has any value and check it like this
if (is_completed = True){
// filter completed tasks
}
if (from_date ){
// filter completed tasks that are completed from or after this date
}
if (to_date ){
// filter completed tasks that are completed till this date
}
The problem with the current approach is I have to write SQL queries for each combination. This becomes complex when there are multiple parameters.
Lets consider this is your Task class.
class Task {
private String id;
private boolean isCompleted;
private LocalDate isCompletedAt;
public Task(String id, boolean isCompleted, LocalDate isCompletedAt) {
this.id = id;
this.isCompleted = isCompleted;
this.isCompletedAt = isCompletedAt;
}
public boolean isCompleted() {
return isCompleted;
}
public LocalDate getCompletedAt() {
return isCompletedAt;
}
#Override
public String toString() {
return "Task{" +
"id='" + id + '\'' +
", isCompleted=" + isCompleted +
", isCompletedAt=" + isCompletedAt +
'}';
}
}
Below is the code using which you can filter the data:
class TaskFilter {
public static void main(String[] args) {
// Sample user input
boolean isCompleted = true;
LocalDate fromDate = LocalDate.parse("2020-11-04");
LocalDate toDate = LocalDate.parse("2021-11-06");
// Simulate data retrieved from JPA repository eg repository.findAll()
List<Task> tasks = List.of(
new Task("1", true, LocalDate.parse("2020-10-04")),
new Task("2", false, LocalDate.parse("2010-12-02")),
new Task("3", false, LocalDate.parse("2021-04-24")),
new Task("4", true, LocalDate.parse("2021-03-12"))
);
// Create a stream on retrieved data
Stream<Task> tasksStream = tasks.stream();
// Filter that stream based on user input
if(isCompleted) {
tasksStream = tasksStream.filter(task -> task.isCompleted());
}
if(fromDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isAfter(fromDate));
}
if(toDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isBefore(toDate));
}
// Finally collect that in a list. This is a must operation, because stream is not executed unless terminal operation is called.
List<Task> filteredTaskList = tasksStream.collect(Collectors.toList());
System.out.println(filteredTaskList);
}
}
I have a Kinesis stream, i have created firehose delivery stream and saving all the data to s3, it was saving correctly in hourly folders. Then i have written firehose transformation lambda, after deploying that all the messages are going to same folder, i am not sure what i am missing. I have below fields in my response from lambda function:
result.put("recordId", record.getRecordId());
result.put("result", "Ok");
result.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
result.put("approximateArrivalTimestamp",record.getApproximateArrivalTimestamp());
result.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
result.put("data", Base64.getEncoder().encodeToString(jsonData.getBytes()));
Edit:
Here is my code in java. I am using KinesisFirehoseEvent and decoding was not needed for my case and i got ByteBuffer in KinesisFirehoseEvent
public JSONObject handler(KinesisFirehoseEvent kinesisFirehoseEvent, Context context) {
final LambdaLogger logger = context.getLogger();
final JSONArray resultArray = new JSONArray();
for (final KinesisFirehoseEvent.Record record: kinesisFirehoseEvent.getRecords()) {
final byte[] data = record.getData().array();
final Optional<TestData> testData = deserialize(data, logger);
if (testData.isPresent()) {
final JSONObject jsonObj = new JSONObject();
final String jsonData = gson.toJson(testData.get());
jsonObj.put("recordId", record.getRecordId());
jsonObj.put("result", "Ok");
jsonObj.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
jsonObj.put("approximateArrivalTimestamp", record.getApproximateArrivalTimestamp());
jsonObj.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
jsonObj.put("data", Base64.getEncoder().encodeToString
(jsonData.getBytes()));
resultArray.add(jsonObj);
}
else {
logger.log("testData not deserialized");
}
}
final JSONObject jsonFinalObj = new JSONObject();
jsonFinalObj.put("records", resultArray);
return jsonFinalObj;
}
The lambda function returning data is not in correct format,
Checkout the below example,
'use strict';
console.log('Loading function');
/* Stock Ticker format parser */
const parser = /^\{\"TICKER_SYMBOL\"\:\"[A-Z]+\"\,\"SECTOR\"\:"[A-Z]+\"\,\"CHANGE\"\:[-.0-9]+\,\"PRICE\"\:[-.0-9]+\}/;
exports.handler = (event, context, callback) => {
let success = 0; // Number of valid entries found
let failure = 0; // Number of invalid entries found
let dropped = 0; // Number of dropped entries
/* Process the list of records and transform them */
const output = event.records.map((record) => {
const entry = (new Buffer(record.data, 'base64')).toString('utf8');
let match = parser.exec(entry);
if (match) {
let parsed_match = JSON.parse(match);
var milliseconds = new Date().getTime();
/* Add timestamp and convert to CSV */
const result = `${milliseconds},${parsed_match.TICKER_SYMBOL},${parsed_match.SECTOR},${parsed_match.CHANGE},${parsed_match.PRICE}`+"\n";
const payload = (new Buffer(result, 'utf8')).toString('base64');
if (parsed_match.SECTOR != 'RETAIL') {
/* Dropped event, notify and leave the record intact */
dropped++;
return {
recordId: record.recordId,
result: 'Dropped',
data: record.data,
};
}
else {
/* Transformed event */
success++;
return {
recordId: record.recordId,
result: 'Ok',
data: payload,
};
}
}
else {
/* Failed event, notify the error and leave the record intact */
console.log("Failed event : "+ record.data);
failure++;
return {
recordId: record.recordId,
result: 'ProcessingFailed',
data: record.data,
};
}
/* This transformation is the "identity" transformation, the data is left intact
return {
recordId: record.recordId,
result: 'Ok',
data: record.data,
} */
});
console.log(`Processing completed. Successful records ${output.length}.`);
callback(null, { records: output });
};
Below documentation can help more details on the data returning format,
https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Hope it helps.
I got this working using above code only, its just that looks like stream is slow so data of new hours haven't reached yet.
Hello, I want to read data from relational database continuously to get new data. I written code but it executes the select query one time only not executing repeatedly.
Iterator iterates over only the result stream but not executing query multiple time.
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
JDBCInputFormatBuilder inputBuilder = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.SELECT_FROM_SOURCE)
.setRowTypeInfo(JDBCConfig.ROW_TYPE_INFO);
SingleOutputStreamOperator<Row> source = environment.createInput(inputBuilder.finish())
.keyBy(0)
.fold(null, new FoldFunction<Row, Row>(){
#Override
public Row fold(Row row1, Row row) throws Exception {
Date dt = (Date) row.getField(2);
return row;
}
});
IterativeStream<Row> iteration = source.iterate();
iteration.closeWith(iteration.filter(new FilterFunction<Row>() {
#Override
public boolean filter(Row row) throws Exception {
if(Integer.parseInt(row.getField(0).toString()) > 0) {
return true;
}
return false;
}
}));
//iteration.print();
source.writeUsingOutputFormat(JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername(JDBCConfig.DRIVER_CLASS)
.setDBUrl(JDBCConfig.DB_URL)
.setQuery(JDBCConfig.INSERT_SQL)
.setSqlTypes(new int[]{Types.BIGINT, Types.VARCHAR, Types.TIMESTAMP_WITH_TIMEZONE})
.finish());
environment.execute();
I have two csv files. One Master CSV File around 500000 records. Another DailyCSV file has 50000 Records.
The DailyCSV files misses few columns which has to be fetched from Master CSV File.
For example
DailyCSV File
id,name,city,zip,occupation
1,Jhon,Florida,50069,Accountant
MasterCSV File
id,name,city,zip,occupation,company,exp,salary
1, Jhon, Florida, 50069, Accountant, AuditFirm, 3, $5000
What I have to do is, read both files, match the records with ID, if ID is present in the master file, then i have to fetch company, exp, salary and write it to a new csv file.
How to achieve this.??
What I have done Currently
while (true) {
line = bstream.readLine();
lineMaster = bstreamMaster.readLine();
if (line == null || lineMaster == null)
{
break;
}
else
{
while(lineMaster != null)
readlineSplit = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String splitId = readlineSplit[4];
String[] readLineSplitMaster =lineMaster.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String SplitIDMaster = readLineSplitMaster[13];
System.out.println(splitId + "|" + SplitIDMaster);
//System.out.println(splitId.equalsIgnoreCase(SplitIDMaster));
if (splitId.equalsIgnoreCase(SplitIDMaster)) {
String writeLine = readlineSplit[0] + "," + readlineSplit[1] + "," + readlineSplit[2] + "," + readlineSplit[3] + "," + readlineSplit[4] + "," + readlineSplit[5] + "," + readLineSplitMaster[15]+ "," + readLineSplitMaster[16] + "," + readLineSplitMaster[17];
System.out.println(writeLine);
pstream.print(writeLine + "\r\n");
}
}
}pstream.close();
fout.flush();
bstream.close();
bstreamMaster.close();
First of all, your current parsing approach will be painfully slow. Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can process your 500K records in less than a second. This is how you can use it to solve your problem:
First let's define a few utility methods to read/write your files:
//opens the file for reading (using UTF-8 encoding)
private static Reader newReader(String pathToFile) {
try {
return new InputStreamReader(new FileInputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for reading at " + pathToFile, e);
}
}
//creates a file for writing (using UTF-8 encoding)
private static Writer newWriter(String pathToFile) {
try {
return new OutputStreamWriter(new FileOutputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for writing at " + pathToFile, e);
}
}
Then, we can start reading your daily CSV file, and generate a Map:
public static void main(String... args){
//First we parse the daily update file.
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
//and to select ONLY the following columns.
//This ensures rows with a fixed size will be returned in case some records come with less or more columns than anticipated.
settings.selectFields("id", "name", "city", "zip", "occupation");
CsvParser parser = new CsvParser(settings);
//Here we parse all data into a list.
List<String[]> dailyRecords = parser.parseAll(newReader("/path/to/daily.csv"));
//And convert them to a map. ID's are the keys.
Map<String, String[]> mapOfDailyRecords = toMap(dailyRecords);
... //we'll get back here in a second.
This is the code to generate a Map from the list of daily records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your master file and generate the list of updates:
private static List<Object[]> processMasterFile(final Map<String, String[]> mapOfDailyRecords) {
//we'll put the updated data here
final List<Object[]> output = new ArrayList<Object[]>();
//configures the parser to process only the columns you are interested in.
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
settings.selectFields("id", "company", "exp", "salary");
//All parsed rows will be submitted to the following RowProcessor. This way the bigger Master file won't
//have all its rows stored in memory.
settings.setRowProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from MASTER will have the ID as index 0.
// If the daily update map contains the ID, we'll get the daily row
String[] dailyData = mapOfDailyRecords.get(row[0]);
if (dailyData != null) {
//We got a match. Let's join the data from the daily row with the master row.
Object[] mergedRow = new Object[8];
for (int i = 0; i < dailyData.length; i++) {
mergedRow[i] = dailyData[i];
}
for (int i = 1; i < row.length; i++) { //starts from 1 to skip the ID at index 0
mergedRow[i + dailyData.length - 1] = row[i];
}
output.add(mergedRow);
}
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above.
parser.parse(newReader("/path/to/master.csv"));
return output;
}
Finally, we can get the merged data and write everything to another file:
... // getting back to the main method here
//Now we process the master data and get a list of updates
List<Object[]> updatedData = processMasterFile(mapOfDailyRecords);
//And write the updated data to another file
CsvWriterSettings writerSettings = new CsvWriterSettings();
writerSettings.setHeaders("id", "name", "city", "zip", "occupation", "company", "exp", "salary");
writerSettings.setHeaderWritingEnabled(true);
CsvWriter writer = new CsvWriter(newWriter("/path/to/updates.csv"), writerSettings);
//Here we write everything, and get the job done.
writer.writeRowsAndClose(updatedData);
}
This should work like a charm. Hope it helps.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I will approach the problem in a step by step manner.
First I will parse/read the master CSV file and keep its content into a hashmap, where the key will be each record's unique 'id' as for the value maybe you can store them in a hash or simply create a java class to store the information.
Example of hash:
{
'1' : { 'name': 'Jhon',
'City': 'Florida',
'zip' : 50069,
....
}
}
Next, read your comparer csv file. For each row, read the 'id' and check if the key exists on the hashmap you have created earlier.
if it exists, then from the hashmap access the information you need and write to a new CSV file.
Also, you might want to consider using a 3rd party CSV parser to make this task easier.
If you have maven maybe you can follow this example I found on net. Otherwise you can just google for apache 'csv parser' example on the internet.
http://examples.javacodegeeks.com/core-java/apache/commons/csv-commons/writeread-csv-files-with-apache-commons-csv-example/
I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. So, it would be something like this
1, user1
2, user2
3, user3
...
I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. Is there any way that I can do this by appending one column to my existing DataFrame or any other way?
PS: I know there is a very similar post, but that's for Scala and not java.
Thanks in advance
I did it by adding a column containing UUIDs in a new Column in DataFrame.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem:
The way to do this it zipWithIndex from JavaRDD()
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() {
#Override
public Row call(Tuple2<Row, Long> v1) throws Exception {
return RowFactory.create(v1._1().getString(0), v1._2());
}
})