Add row to Spark Dataframe with timestamps and id - java

I have a dataframe named timeDF which has the schema below:
root
|-- Id: long (nullable = true)
|-- Model: timestamp (nullable = true)
|-- Prevision: timestamp (nullable = true)
I want to add a new row at the end of timeDF by transforming two Calendar objects c1 & c2 to Timestamp. I know I can do it by first converting them to Timestamp like so :
val t1 = new Timestamp(c1.getTimeInMillis)
val t2 = new Timestamp(c2.getTimeInMillis)
However, I can't figure out how I then write those variables to timeDF as a new row, and how to let spark increase the Id column value ?
Should I create a List object with t1 and t2 and make a temporary dataframe from this list to then union the two dataframes ? If so how do I manage the Id column ? Isn't it too much a mess for such a simple operation ?
Can someone explain me please ?
Thanks.

Here is a solution you can try, in a nutshell:
Ingest your file.
Create a new dataframe with your data and unionByName().
Correct the id.
Clean up.
Create the extra record
First you create the extra record from scratch. As you mix several types, I used a POJO, here is the code:
List<ModelPrevisionRecord> data = new ArrayList<>();
ModelPrevisionRecord b = new ModelPrevisionRecord(
-1L,
new Timestamp(System.currentTimeMillis()),
new Timestamp(System.currentTimeMillis()));
data.add(b);
Dataset<ModelPrevisionRecord> ds = spark.createDataset(data,
Encoders.bean(ModelPrevisionRecord.class));
timeDf = timeDf.unionByName(ds.toDF());
ModelPrevisionRecord is a very basic POJO:
package net.jgp.labs.spark.l999_scrapbook.l000;
import java.sql.Timestamp;
public class ModelPrevisionRecord {
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public Timestamp getModel() {
return model;
}
public void setModel(Timestamp model) {
this.model = model;
}
public Timestamp getPrevision() {
return prevision;
}
public void setPrevision(Timestamp prevision) {
this.prevision = prevision;
}
private long id;
private Timestamp model;
private Timestamp prevision;
public ModelPrevisionRecord(long id, Timestamp model, Timestamp prevision) {
this.id = id;
this.model = model;
this.prevision = prevision;
}
}
Correct the Id
The id is -1, so the id is to create a new column, id2, with the right id:
timeDf = timeDf.withColumn("id2",
when(
col("id").$eq$eq$eq(-1), timeDf.agg(max("id")).head().getLong(0)+1)
.otherwise(col("id")));
Cleanup the dataframe
Finally, clean up your dataframe:
timeDf = timeDf.drop("id").withColumnRenamed("id2", "id");
Important notes
This solution will only work if you add one record at a time, otherwise, you will end up having the same id.
You can see the whole example here: https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l999_scrapbook/l000, it might be easier to clone...

If your first dataframe can be sorted by ID and you need to add rows one by one you can find maximum ID in your list:
long max = timeDF.agg(functions.max("Id")).head().getLong(0);
and then increment and add it to your dataframe by Union. To do this, follow the following example which age can act like id. people.json is a file in spark examples.
Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
df.show();
long max = df.agg(functions.max("age")).head().getLong(0);
List<Row> rows = Arrays.asList(RowFactory.create(max+1, "test"));
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("age", DataTypes.LongType, false, Metadata.empty()),
DataTypes.createStructField("name", DataTypes.StringType, false, Metadata.empty())));
Dataset<Row> df2 = spark.createDataFrame(rows, schema);
df2.show();
Dataset<Row> df3 = df.union(df2);
df3.show();

I tried this but I don't know why, when printing the table saved, it only keep the last 2 rows, all others being deleted.
This is how I init the delta table :
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var timestampDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
val write_format = "delta"
val partition_by = "Model"
val save_path = "/mnt/path/to/folder"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timestampDF.write.partitionBy(partition_by)
.format(write_format)
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
And this how I add a new item to it
def addTimeToData(model: Calendar, target: Calendar): Unit = {
var timeDF = spark.read
.format("delta")
.load("/mnt/path/to/folder")
val modelTS = new Timestamp(model.getTimeInMillis)
val targetTS = new Timestamp(target.getTimeInMillis)
var id: Long = 0
if (!timeDF.head(1).isEmpty) {
id = timeDF.agg(max("Id")).head().getLong(0) + 1
}
val newTime = Arrays.asList(RowFactory.create(id, modelTS, targetTS))
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var newTimeDF = spark.createDataFrame(newTime, schema)
val unionTimeDF = timeDF.union(newTimeDF)
timeDF = unionTimeDF
unionTimeDF.show
val save_path = "/mnt/datalake/Exploration/Provisionning/MeteoFrance/Timestamps/"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timeDF.write.partitionBy("Model")
.format("delta")
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
}
I'm not very familiar with delta tables so I don't know if I can just use SQL on it to add values like so :
spark.sql("INSERT INTO 'myTable' VALUES (" + id + ", " + modelTS + ", " + previsionTS + ")");
And I don't if just putting the timestamps variable like so will work.

Related

Filter the query result only and not the entire table

I am new to spring and I am trying to do something like this. Let us say my table has the following columns.
task
is_completed
completd_at
The user will provide the following options in the query parameters.
is_completed=true or false
from_date = dd-mm-yyyy
to_date = dd-mm-yyyy
I will check for each parameter one by one and then filter the table.
In Django, I can do something like this
tasks = Task.object.all() # All the tasks will be stored in tasks
tasks = tasks.filter(is_completed=True) # completed tasks will be filtered from all tasks
tasks = tasks.filter(completed_at__gte=from_date, completed_date__lte=to_date) # completed tasks will be filtered based on the completed date
How can I achieve this with spring JPA. Is there any way I can save the filtered results and query the filtered results again instead of querying the entire database?
That way I can check if the parameter has any value and check it like this
if (is_completed = True){
// filter completed tasks
}
if (from_date ){
// filter completed tasks that are completed from or after this date
}
if (to_date ){
// filter completed tasks that are completed till this date
}
The problem with the current approach is I have to write SQL queries for each combination. This becomes complex when there are multiple parameters.
Lets consider this is your Task class.
class Task {
private String id;
private boolean isCompleted;
private LocalDate isCompletedAt;
public Task(String id, boolean isCompleted, LocalDate isCompletedAt) {
this.id = id;
this.isCompleted = isCompleted;
this.isCompletedAt = isCompletedAt;
}
public boolean isCompleted() {
return isCompleted;
}
public LocalDate getCompletedAt() {
return isCompletedAt;
}
#Override
public String toString() {
return "Task{" +
"id='" + id + '\'' +
", isCompleted=" + isCompleted +
", isCompletedAt=" + isCompletedAt +
'}';
}
}
Below is the code using which you can filter the data:
class TaskFilter {
public static void main(String[] args) {
// Sample user input
boolean isCompleted = true;
LocalDate fromDate = LocalDate.parse("2020-11-04");
LocalDate toDate = LocalDate.parse("2021-11-06");
// Simulate data retrieved from JPA repository eg repository.findAll()
List<Task> tasks = List.of(
new Task("1", true, LocalDate.parse("2020-10-04")),
new Task("2", false, LocalDate.parse("2010-12-02")),
new Task("3", false, LocalDate.parse("2021-04-24")),
new Task("4", true, LocalDate.parse("2021-03-12"))
);
// Create a stream on retrieved data
Stream<Task> tasksStream = tasks.stream();
// Filter that stream based on user input
if(isCompleted) {
tasksStream = tasksStream.filter(task -> task.isCompleted());
}
if(fromDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isAfter(fromDate));
}
if(toDate != null) {
tasksStream = tasksStream.filter(task -> task.getCompletedAt().isBefore(toDate));
}
// Finally collect that in a list. This is a must operation, because stream is not executed unless terminal operation is called.
List<Task> filteredTaskList = tasksStream.collect(Collectors.toList());
System.out.println(filteredTaskList);
}
}

How to get Page as result in Querydsl query with fetch or fetchResults properly?

Hi what i trying to achieve here is, i want to submit Pageable data into QueryDsl query and get the result as Page, how can i do it properly? here is what i do until now :
here is my controller :
#PostMapping("/view-latest-stock-by-product-codes")
public ResponseEntity<RequestResponseDTO<Page<StockAkhirResponseDto>>> findStockByProductCodes(
#RequestBody StockViewByProductCodesDto request) {
Page<StockAkhirResponseDto> stockAkhir = stockService.findByBulkProduct(request);
return ResponseEntity.ok(new RequestResponseDTO<>(PESAN_TAMPIL_BERHASIL, stockAkhir));
}
in my controller i submit StockViewByProductCodesDto which is looked like this :
#Data
public class StockViewByProductCodesDto implements Serializable {
private static final long serialVersionUID = -2530161364843162467L;
#Schema(description = "Kode gudang yang ingin di tampilkan", example = "GBKTJKT1", required = true)
private String warehouseCode;
#Schema(description = "id dari sebuah branch", example = "1", required = true)
private Long branchId;
#Schema(description = "Kode Branch", example = "JKT", required = true)
private String branchCode;
#Schema(description = "Kode Product yang merupakan kode yang di ambil dari master product", example = "[\"MCM-508\",\"TL-101\"]", required = true)
private List<String> productCodes;
#Schema(description = "Size of row per page", example = "15", required = true)
#NotNull
private int size;
#Schema(description = "Page number", example = "1", required = true)
#NotNull
private int page;
#Schema(description = "Sort by", example = "id", required = false)
private String sort;
}
and here is my service :
public Page<StockAkhirResponseDto> findByBulkProduct(StockViewByProductCodesDto request) {
String warehouseCode = request.getWarehouseCode();
Long branchId = request.getBranchId();
String branchCode = request.getBranchCode();
List<String> productCodes = request.getProductCodes();
Set<String> productCodesSet = new HashSet<String>(productCodes);
Pageable pageable = PageUtils.pageableUtils(request);
Page<StockAkhirResponseDto> stockAkhir = iStockQdslRepository.findBulkStockAkhirPage(warehouseCode, branchId, branchCode, productCodesSet, pageable);
return stockAkhir;
}
as you can see, i extract pageable information with PageUtils.pageableUtils(request), here is my pageableUtils function looked like :
public static Pageable pageableUtils(RequestKeyword request) {
int page = 0;
int size = 20;
if (request.getPage() > 0) {
page = request.getPage() - 1;
}
if (request.getSize() > 0) {
size = request.getSize();
}
if (!request.getSort().isEmpty()) {
return PageRequest.of(page, size, Sort.by(request.getSort()).descending());
} else {
return PageRequest.of(page, size);
}
}
after i got the Pageable data, i submit it into my repository, which is looked like this :
public Page<StockAkhirResponseDto> findBulkStockAkhirPage(String warehouseCode, Long branchId, String branchCode,
Set<String> productCodes, Pageable pageable) {
JPQLQuery<Tuple> query = new JPAQuery<>(em);
long offset = pageable.getOffset();
long limit = pageable.getPageSize();
QStock qStock = QStock.stock;
NumberExpression<Integer> totalQty = qStock.qty.sum().intValue();
query = query.select(qStock.productId, qStock.productCode, totalQty).from(qStock)
.where(qStock.warehouseCode.eq(warehouseCode), qStock.productCode.in(productCodes),
qStock.branchCode.eq(branchCode), qStock.branchId.eq(branchId))
.groupBy(qStock.productId, qStock.productCode);
query.limit(limit);
query.offset(offset);
QueryResults<Tuple> result = query.fetchResults();
long total = result.getTotal();
List<Tuple> rows = result.getResults();
List<StockAkhirResponseDto> stockAkhirDto = rows.stream()
.map(t -> new StockAkhirResponseDto(t.get(0, Long.class), t.get(1, String.class), t.get(2, Integer.class)))
.collect(Collectors.toList());
return new PageImpl<>(stockAkhirDto, pageable, total);
}
there is no error in my editor when viewing this my repository and i able to run my project, but when i execute my repository function, i got this error :
"org.hibernate.hql.internal.ast.QuerySyntaxException: expecting CLOSE,
found ',' near line 1, column 38 [select count(distinct
stock.productId, stock.productCode, stock.warehouseId,
stock.warehouseCode, stock.branchCode, stock.branchId)\nfrom
com.bit.microservices.b2b.warehouse.entity.Stock stock\nwhere
stock.warehouseCode = ?1 and stock.productCode in ?2 and
stock.branchCode = ?3 and stock.branchId = ?4]; nested exception is
java.lang.IllegalArgumentException:
org.hibernate.hql.internal.ast.QuerySyntaxException: expecting CLOSE,
found ',' near line 1, column 38 [select count(distinct
stock.productId, stock.productCode, stock.warehouseId,
stock.warehouseCode, stock.branchCode, stock.branchId)\nfrom
com.bit.microservices.b2b.warehouse.entity.Stock stock\nwhere
stock.warehouseCode = ?1 and stock.productCode in ?2 and
stock.branchCode = ?3 and stock.branchId = ?4]"
the problem is here, on this line :
QueryResults<Tuple> result = query.fetchResults();
when i execute that line, it give me that error, i try to get the fetchResult, because i want to get the .getTotal() for the total.
but if i execute the query with .fetch(), it worked fine, like this :
List<StockAkhirResponseDto> stockAkhirDto = query.fetch()
i got my sql result execute correctly, what did i missed here? how do i get Page result correctly?
Your problem could be related with an open QueryDSL issue. The documented issue has to do with the use of fetchCount but I think very likely could be also your case.
Consider the following comment in the mentioned issue:
fetchCount() uses a COUNT function, which is an aggregate function. Your query already has aggregate functions. You cant aggregate aggregate functions, unless a subquery is used (which is not available in JPA). Therefore this use case cannot be supported.
The issue also provides a temporary solution.
Basically, the idea is be able to perform the COUNT by creating a statement over the initial select. AFAIK it is not possible with QueryDsl and this is why in the indicated workarounds they access the underline mechanisms provided by Hibernate.
Perhaps, another thing that you can try to avoid the limitation is to create a database view for your query, the corresponding QueryDsl objects over it, and use these objects to perform the actual computation. I am aware that it is not an ideal solution, but it will bypass this current QueryDsl limitation.

How to eliminate Metadata from all columns in a Spark Table? (Java)

I have a dataframe df with four columns id, ts, lat and lon. If I run df.schema() in debug mode, I get
0 = {StructField#13126} "StructField(id,LongType,true)"
name = "id"
dataType = {LongType$#12993} "LongType"
nullable = true
metadata = {Metadata#13065} "{"encoding":"UTF-8"}"
1 = {StructField#13127} "StructField(ts,LongType,true)"
name = "timestamp"
dataType = {LongType$#12993} "LongType"
nullable = true
metadata = {Metadata#13069} "{"encoding":"UTF-8"}"
2 = {StructField#13128} "StructField(lat,DoubleType,true)"
name = "position_lat"
dataType = {DoubleType$#13034} "DoubleType"
nullable = true
metadata = {Metadata#13073} "{"encoding":"UTF-8"}"
3 = {StructField#13129} "StructField(lon,DoubleType,true)"
name = "position_lon"
dataType = {DoubleType$#13034} "DoubleType"
nullable = true
metadata = {Metadata#13076} "{"encoding":"UTF-8"}"
Now, I want to get rid of all metadata,, i.e. "{"encoding":"ZSTD"}"shouold be replaced by "" for each column. Please note that my actual table has many columns, so the solution needs to be somewhat generic. Thank you in advance!
You can use encode("XX","ignore")
Example :
Val df=data.map(lambda x: x.encode("ascii", "ignore").

Spring Data JPA BigList insert

I've been trying for two days now to store an array list with about six million entries in my Postgres database with Spring-Data-JPA.
The whole thing works, but it's very slow. I need about 27 minutes for everything.
I've already played around with the batch size, but that didn't bring much success. I also noticed that saving takes longer and longer the bigger the table gets. Is there a way to speed it up ?
I've done the whole thing with SQLite before, there I only needed about 15 seconds for the same amount.
My Entity
#Data
#Entity
#Table(name = "commodity_prices")
public class CommodityPrice {
#Id
#Column( name = "id" )
#GeneratedValue( strategy = GenerationType.SEQUENCE )
private long id;
#Column(name = "station_id")
private int station_id;
#Column(name = "commodity_id")
private int commodity_id;
#Column(name = "supply")
private long supply;
#Column(name = "buy_price")
private int buy_price;
#Column(name = "sell_price")
private int sell_price;
#Column(name = "demand")
private long demand;
#Column(name = "collected_at")
private long collected_at;
public CommodityPrice( int station_id, int commodity_id, long supply, int buy_price, int sell_price, long demand,
long collected_at ) {
this.station_id = station_id;
this.commodity_id = commodity_id;
this.supply = supply;
this.buy_price = buy_price;
this.sell_price = sell_price;
this.demand = demand;
this.collected_at = collected_at;
}
}
My insert Class
#Slf4j
#Component
public class CommodityPriceHandler {
#Autowired
CommodityPriceRepository commodityPriceRepository;
#Autowired
private EntityManager entityManager;
public void inserIntoDB() {
int lineCount = 0;
List<CommodityPrice> commodityPrices = new ArrayList<>( );
StopWatch stopWatch = new StopWatch();
stopWatch.start();
try {
Reader reader = new FileReader( DOWNLOAD_SAVE_PATH + FILE_NAME_COMMODITY_PRICES );
Iterable<CSVRecord> records = CSVFormat.EXCEL.withFirstRecordAsHeader().parse( reader );
for( CSVRecord record : records ) {
int station_id = Integer.parseInt( record.get( "station_id" ) );
int commodity_id = Integer.parseInt( record.get( "commodity_id" ) );
long supply = Long.parseLong( record.get( "supply" ) );
int buy_price = Integer.parseInt( record.get( "buy_price" ) );
int sell_price = Integer.parseInt( record.get( "sell_price" ) );
long demand = Long.parseLong( record.get( "demand" ) );
long collected_at = Long.parseLong( record.get( "collected_at" ) );
CommodityPrice commodityPrice = new CommodityPrice(station_id, commodity_id, supply, buy_price, sell_price, demand, collected_at);
commodityPrices.add( commodityPrice );
if (commodityPrices.size() == 1000){
commodityPriceRepository.saveAll( commodityPrices );
commodityPriceRepository.flush();
entityManager.clear();
commodityPrices.clear();
System.out.println(lineCount);
}
lineCount ++;
}
}
catch( IOException e ) {
log.error( e.getLocalizedMessage() );
}
commodityPriceRepository.saveAll( commodityPrices );
stopWatch.stop();
log.info( "Successfully inserted " + lineCount + " lines in " + stopWatch.getTotalTimeSeconds() + " seconds." );
}
}
My application.properties
# HIBERNATE
spring.jpa.properties.hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
spring.jpa.properties.hibernate.jdbc.lob.non_contextual_creation=true
spring.jpa.hibernate.ddl-auto = update
spring.jpa.properties.hibernate.jdbc.batch_size=1000
spring.jpa.properties.hibernate.order_inserts=true
While you are doing your insert in batch, your sequence generation strategy still requires you to issue one statement for each record you insert. Thus, for a batch size of 1000 records you issue 1001 statements, which is clearly not what is expected.
My recommendations:
enable sql logging to see what statements are sent to your db. I personally use datasource-proxy, but use anything you are happy with.
modify your sequence generator. At a minimum, use
#Id
#Column( name = "id" )
#GeneratedValue(generator = "com_pr_generator", strategy = GenerationType.SEQUENCE )
#SequenceGenerator(name="com_pr_generator", sequenceName = "book_seq", allocationSize=50)
private long id;
Read about different generation strateges and fine tune your sequence generator.
A beginner’s guide to Hibernate enhanced identifier generators
Hibernate pooled and pooled-lo identifier generators

How to transform a csv string into a Spark-ML compatible Dataset<Row> format?

I have a Dataset<Row> df, that contains two columns ("key" and "value") of type string. df.printSchema(); is giving me the following output:
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
The content of the value column is actually a csv formated line (coming from a kafka topic), with the last entry of that line representing the class label and all the previous entries beeing the features (first row not included in the dataset):
feature0,feature1,label
0.6720004294237854,-0.4033586564886893,0
0.6659082469383558,0.07688976580256132,0
0.8086502311695247,0.564354801275521,1
Since I would like to train a classifier on this data, I need to transform this representation into a row of type dense vector, containing all the feature values and a column of type double, containing the label value:
root
|-- indexedFeatures: vector (nullable = false)
|-- indexedLabel: double (nullable = false)
How can I do this, using java 1.8 and Spark 2.2.0?
Edit: I got further, but while attempting to make it work with a flexible amount feature dimensions, I got stuck again. I created a follow-up question.
A VectorAssembler (javadocs) can transform the dataset into the required format.
First, the input is split into three columns:
Dataset<FeaturesAndLabelData> featuresAndLabelData = inputDf.select("value").as(Encoders.STRING())
.flatMap(s -> {
String[] splitted = s.split(",");
if (splitted.length == 3) {
return Collections.singleton(new FeaturesAndLabelData(
Double.parseDouble(splitted[0]),
Double.parseDouble(splitted[1]),
Integer.parseInt(splitted[2]))).iterator();
} else {
// apply some error handling...
return Collections.emptyIterator();
}
}, Encoders.bean(FeaturesAndLabelData.class));
The result is then transformed by a VectorAssembler:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[] { "feature1", "feature2" })
.setOutputCol("indexedFeatures");
Dataset<Row> result = assembler.transform(featuresAndLabelData)
.withColumn("indexedLabel", functions.col("label").cast("double"))
.select("indexedFeatures", "indexedLabel");
The result dataframe has the required format:
+----------------------------------------+------------+
|indexedFeatures |indexedLabel|
+----------------------------------------+------------+
|[0.6720004294237854,-0.4033586564886893]|0.0 |
|[0.6659082469383558,0.07688976580256132]|0.0 |
|[0.8086502311695247,0.564354801275521] |1.0 |
+----------------------------------------+------------+
root
|-- indexedFeatures: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
FeaturesAndLabelData is a simple Java bean to make sure that the column names are correct:
public class FeaturesAndLabelData {
private double feature1;
private double feature2;
private int label;
//getters and setters...
}
You have different ways of achieving this.
Create a schema as per your CSV file.
public class CSVData implements Serializable {
String col1;
String col2;
long col3;
String col4;
//getters and setters
}
Then convert the file into an RDD.
JavaSparkContext sc;
JavaRDD<String> data = sc.textFile("path-to-csv-file");
JavaSQLContext sqlContext = new JavaSQLContext(sc);
JavaRDD<Record> csv_rdd = sc.textFile(data).map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);
return sd;
}
});
Or
Create a Spark Session to read the file as a Dataset.
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//Read file
Dataset<Row> ds = spark.read().text("path-to-csv-file");
or
Dataset<Row> ds = spark.read().csv("path-to-csv-file");
ds.show();

Categories