How to eliminate Metadata from all columns in a Spark Table? (Java)

How to eliminate Metadata from all columns in a Spark Table? (Java) - java

I have a dataframe df with four columns id, ts, lat and lon. If I run df.schema() in debug mode, I get
0 = {StructField#13126} "StructField(id,LongType,true)"
name = "id"
dataType = {LongType$#12993} "LongType"
nullable = true
metadata = {Metadata#13065} "{"encoding":"UTF-8"}"
1 = {StructField#13127} "StructField(ts,LongType,true)"
name = "timestamp"
dataType = {LongType$#12993} "LongType"
nullable = true
metadata = {Metadata#13069} "{"encoding":"UTF-8"}"
2 = {StructField#13128} "StructField(lat,DoubleType,true)"
name = "position_lat"
dataType = {DoubleType$#13034} "DoubleType"
nullable = true
metadata = {Metadata#13073} "{"encoding":"UTF-8"}"
3 = {StructField#13129} "StructField(lon,DoubleType,true)"
name = "position_lon"
dataType = {DoubleType$#13034} "DoubleType"
nullable = true
metadata = {Metadata#13076} "{"encoding":"UTF-8"}"
Now, I want to get rid of all metadata,, i.e. "{"encoding":"ZSTD"}"shouold be replaced by "" for each column. Please note that my actual table has many columns, so the solution needs to be somewhat generic. Thank you in advance!

You can use encode("XX","ignore")
Example :
Val df=data.map(lambda x: x.encode("ascii", "ignore").

Related

Add row to Spark Dataframe with timestamps and id

I have a dataframe named timeDF which has the schema below:
root
|-- Id: long (nullable = true)
|-- Model: timestamp (nullable = true)
|-- Prevision: timestamp (nullable = true)
I want to add a new row at the end of timeDF by transforming two Calendar objects c1 & c2 to Timestamp. I know I can do it by first converting them to Timestamp like so :
val t1 = new Timestamp(c1.getTimeInMillis)
val t2 = new Timestamp(c2.getTimeInMillis)
However, I can't figure out how I then write those variables to timeDF as a new row, and how to let spark increase the Id column value ?
Should I create a List object with t1 and t2 and make a temporary dataframe from this list to then union the two dataframes ? If so how do I manage the Id column ? Isn't it too much a mess for such a simple operation ?
Can someone explain me please ?
Thanks.

Here is a solution you can try, in a nutshell:
Ingest your file.
Create a new dataframe with your data and unionByName().
Correct the id.
Clean up.
Create the extra record
First you create the extra record from scratch. As you mix several types, I used a POJO, here is the code:
List<ModelPrevisionRecord> data = new ArrayList<>();
ModelPrevisionRecord b = new ModelPrevisionRecord(
-1L,
new Timestamp(System.currentTimeMillis()),
new Timestamp(System.currentTimeMillis()));
data.add(b);
Dataset<ModelPrevisionRecord> ds = spark.createDataset(data,
Encoders.bean(ModelPrevisionRecord.class));
timeDf = timeDf.unionByName(ds.toDF());
ModelPrevisionRecord is a very basic POJO:
package net.jgp.labs.spark.l999_scrapbook.l000;
import java.sql.Timestamp;
public class ModelPrevisionRecord {
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public Timestamp getModel() {
return model;
}
public void setModel(Timestamp model) {
this.model = model;
}
public Timestamp getPrevision() {
return prevision;
}
public void setPrevision(Timestamp prevision) {
this.prevision = prevision;
}
private long id;
private Timestamp model;
private Timestamp prevision;
public ModelPrevisionRecord(long id, Timestamp model, Timestamp prevision) {
this.id = id;
this.model = model;
this.prevision = prevision;
}
}
Correct the Id
The id is -1, so the id is to create a new column, id2, with the right id:
timeDf = timeDf.withColumn("id2",
when(
col("id").$eq$eq$eq(-1), timeDf.agg(max("id")).head().getLong(0)+1)
.otherwise(col("id")));
Cleanup the dataframe
Finally, clean up your dataframe:
timeDf = timeDf.drop("id").withColumnRenamed("id2", "id");
Important notes
This solution will only work if you add one record at a time, otherwise, you will end up having the same id.
You can see the whole example here: https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l999_scrapbook/l000, it might be easier to clone...

If your first dataframe can be sorted by ID and you need to add rows one by one you can find maximum ID in your list:
long max = timeDF.agg(functions.max("Id")).head().getLong(0);
and then increment and add it to your dataframe by Union. To do this, follow the following example which age can act like id. people.json is a file in spark examples.
Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
df.show();
long max = df.agg(functions.max("age")).head().getLong(0);
List<Row> rows = Arrays.asList(RowFactory.create(max+1, "test"));
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("age", DataTypes.LongType, false, Metadata.empty()),
DataTypes.createStructField("name", DataTypes.StringType, false, Metadata.empty())));
Dataset<Row> df2 = spark.createDataFrame(rows, schema);
df2.show();
Dataset<Row> df3 = df.union(df2);
df3.show();

I tried this but I don't know why, when printing the table saved, it only keep the last 2 rows, all others being deleted.
This is how I init the delta table :
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var timestampDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
val write_format = "delta"
val partition_by = "Model"
val save_path = "/mnt/path/to/folder"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timestampDF.write.partitionBy(partition_by)
.format(write_format)
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
And this how I add a new item to it
def addTimeToData(model: Calendar, target: Calendar): Unit = {
var timeDF = spark.read
.format("delta")
.load("/mnt/path/to/folder")
val modelTS = new Timestamp(model.getTimeInMillis)
val targetTS = new Timestamp(target.getTimeInMillis)
var id: Long = 0
if (!timeDF.head(1).isEmpty) {
id = timeDF.agg(max("Id")).head().getLong(0) + 1
}
val newTime = Arrays.asList(RowFactory.create(id, modelTS, targetTS))
val schema = StructType(
StructField("Id", LongType, false) ::
StructField("Model", TimestampType, false) ::
StructField("Prevision", TimestampType, false) :: Nil
)
var newTimeDF = spark.createDataFrame(newTime, schema)
val unionTimeDF = timeDF.union(newTimeDF)
timeDF = unionTimeDF
unionTimeDF.show
val save_path = "/mnt/datalake/Exploration/Provisionning/MeteoFrance/Timestamps/"
val table_name = "myTable"
spark.sql("DROP TABLE IF EXISTS " + table_name)
dbutils.fs.rm(save_path, true)
timeDF.write.partitionBy("Model")
.format("delta")
.save(save_path)
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
}
I'm not very familiar with delta tables so I don't know if I can just use SQL on it to add values like so :
spark.sql("INSERT INTO 'myTable' VALUES (" + id + ", " + modelTS + ", " + previsionTS + ")");
And I don't if just putting the timestamps variable like so will work.

BeanIO - How to detect multiple record with same type

How to detect, in a fixed length file, the same record type?
BeanIO dected only first record header without the others two documents.
What I would like is to obtain wrapper class with the three documents and their item codes.
This is an example of fixed length txt file:
Unknown record
Unknown record
RH20210607A
RDitem1
RDitem2
Unknown record
RH20210607B
RDitem2
RDitem3
Unknown record
Unknown record
Unknown record
RH20210607C
RDitem1
RDitem4
RDitem5
I want to detect the header (RH) and detail (RD) record.
I designed a group list with another group for each subgroup.
Top group class:
#Group(name = "fixedFile")
public class ListDocumentWrapper {
#Group(minOccurs=1, type=Document.class, collection = List.class)
List<Document> documentList;
}
Subgroup class:
#Data
public class Document{
#Record(order = 1, minOccurs=1, maxOccurs=1)
private RH recordHeader;
#Record(order = 2, minOccurs=1, type=RD.class, collection = List.class)
private List<RD> recordDetails;
}
Single record class:
#Data
#Record
public class RH{
#Field(ordinal = 0, required = true, length = 2, align = Align.LEFT, rid = true, literal = "RH")
private String recordType;
#Field(ordinal = 1, required = true, length=8, format = "yyyyMMdd")
private LocalDate documentDate;
#Field(ordinal = 2, required = true, length = 1, padding = ' ', align = Align.LEFT)
private String documentCode;
}
#Data
#Record
public class RD{
#Field(ordinal = 0, required = true, length = 2, align = Align.LEFT, rid = true, literal = "RD")
private String recordType;
#Field(ordinal = 1, required = true, length = 5, padding = ' ', align = Align.LEFT)
private String itemCode;
}
Init of BeanReader:
// create a StreamFactory
StreamFactory factory = StreamFactory.newInstance();
// load the mapping file
String streamBuilderName = "fixedFile";
factory.define(
new StreamBuilder(streamBuilderName)
.format("fixedlength")
.parser(new FixedLengthParserBuilder())
.ignoreUnidentifiedRecords()
.addGroup(ListDocumentWrapper.class)
);
BeanReader beanReader = factory.createReader(streamBuilderName, aFileReader, Locale.ITALIAN);
Thanks for help

I didn't find the right solution, but I found a workaround.
The desire is to make a single read and get a list of document
This is the workaround.
First I didn't understand that beanReader.read() extract the first element of stream. So it's no necessary to create ListDocumentWrapper but read all Document with a simple while loop.
This is the extract of code:
List<Document> documentList = new ArrayList<>();
Document documentMessage = null;
int documentIndex = 0;
while ((documentMessage = (Document) beanReader.read()) != null) {
// process the documentMessage...
log.debug("Reading documentMessage ["+documentIndex+"]... => " + documentMessage);
documentList.add(documentMessage);
documentIndex++;
}

Can't get media metadata compat string

I'm creating a kotlin app. I have an issue with mediaMetadataCompat. Maybe I have to put data to extras? I put data like this
audios = allAudios!!.map { audio ->
MediaMetadataCompat.Builder()
.putString(METADATA_KEY_WRITER, audio.writer._id)
.putString(METADATA_KEY_ARTIST, audio.writer.name)
.putString(METADATA_KEY_DISPLAY_SUBTITLE, audio.writer.name)
.putString(METADATA_KEY_MEDIA_ID, audio._id)
.putString(METADATA_KEY_TITLE, audio.title)
.putString(METADATA_KEY_DISPLAY_TITLE, audio.title)
.putString(METADATA_KEY_DISPLAY_ICON_URI, audio.writer.image)
.putString(METADATA_KEY_DATE, audio.createdAt)
.putString(METADATA_KEY_MEDIA_URI, audio.filePath)
.putString(METADATA_KEY_DISPLAY_DESCRIPTION, audio.description)
.build()
}
Get it like this
fun MediaMetadataCompat.toAudio(): Audio? {
return let {
Audio(
_id = it.description.mediaId ?: "",
title = it.description.title.toString(),
filePath = it.description.mediaUri.toString(),
description = it.description.description.toString(),
writer = User(
_id = it.description.extras?.getString("writerId").toString(),
name = it.description.subtitle.toString(),
image = it.description.iconUri.toString()
),
tags = listOf("Shit"),
listened = 1,
language = "en",
isForKids = false,
duration = 70,
createdAt = "2020:01:01"
)
}
}
It only gives my title, icon_uri, media_uri and media_id

Sharing a small part of my code:
Below is building object of MediaMetaDataCompat. I am adding couple of data there and have used in different parts of app.
var media = MediaMetadataCompat.Builder()
.putString(MediaMetadataCompat.METADATA_KEY_MEDIA_ID, data.id.toString())
.putString(MediaMetadataCompat.METADATA_KEY_ARTIST, data.artist_name)
.putString(MediaMetadataCompat.METADATA_KEY_TITLE, data.title)
.putString(MediaMetadataCompat.METADATA_KEY_MEDIA_URI, data.audio_path)
.putString(MediaMetadataCompat.METADATA_KEY_DATE, data.track_year)
.putString(MediaMetadataCompat.METADATA_KEY_DISPLAY_ICON_URI, data.image_path)
.putLong(MediaMetadataCompat.METADATA_KEY_DURATION, data.duration.toLong())
.build()
mediaItem is object of MediaMetadataCompat and that is how I am getting values of the fields which was added to object.
mediaItem is object of MediaMetadataCompat
var artist = mediaItem.bundle.getString(MediaMetadataCompat.METADATA_KEY_ARTIST)!!
var title = mediaItem.bundle.getString(MediaMetadataCompat.METADATA_KEY_TITLE)!!
var duration = mediaItem.bundle.getLong(MediaMetadataCompat.METADATA_KEY_DURATION)!!
var icon = mediaItem.bundle.getString(MediaMetadataCompat.METADATA_KEY_DISPLAY_ICON_URI)!!

How to get Page as result in Querydsl query with fetch or fetchResults properly?

Hi what i trying to achieve here is, i want to submit Pageable data into QueryDsl query and get the result as Page, how can i do it properly? here is what i do until now :
here is my controller :
#PostMapping("/view-latest-stock-by-product-codes")
public ResponseEntity<RequestResponseDTO<Page<StockAkhirResponseDto>>> findStockByProductCodes(
#RequestBody StockViewByProductCodesDto request) {
Page<StockAkhirResponseDto> stockAkhir = stockService.findByBulkProduct(request);
return ResponseEntity.ok(new RequestResponseDTO<>(PESAN_TAMPIL_BERHASIL, stockAkhir));
}
in my controller i submit StockViewByProductCodesDto which is looked like this :
#Data
public class StockViewByProductCodesDto implements Serializable {
private static final long serialVersionUID = -2530161364843162467L;
#Schema(description = "Kode gudang yang ingin di tampilkan", example = "GBKTJKT1", required = true)
private String warehouseCode;
#Schema(description = "id dari sebuah branch", example = "1", required = true)
private Long branchId;
#Schema(description = "Kode Branch", example = "JKT", required = true)
private String branchCode;
#Schema(description = "Kode Product yang merupakan kode yang di ambil dari master product", example = "[\"MCM-508\",\"TL-101\"]", required = true)
private List<String> productCodes;
#Schema(description = "Size of row per page", example = "15", required = true)
#NotNull
private int size;
#Schema(description = "Page number", example = "1", required = true)
#NotNull
private int page;
#Schema(description = "Sort by", example = "id", required = false)
private String sort;
}
and here is my service :
public Page<StockAkhirResponseDto> findByBulkProduct(StockViewByProductCodesDto request) {
String warehouseCode = request.getWarehouseCode();
Long branchId = request.getBranchId();
String branchCode = request.getBranchCode();
List<String> productCodes = request.getProductCodes();
Set<String> productCodesSet = new HashSet<String>(productCodes);
Pageable pageable = PageUtils.pageableUtils(request);
Page<StockAkhirResponseDto> stockAkhir = iStockQdslRepository.findBulkStockAkhirPage(warehouseCode, branchId, branchCode, productCodesSet, pageable);
return stockAkhir;
}
as you can see, i extract pageable information with PageUtils.pageableUtils(request), here is my pageableUtils function looked like :
public static Pageable pageableUtils(RequestKeyword request) {
int page = 0;
int size = 20;
if (request.getPage() > 0) {
page = request.getPage() - 1;
}
if (request.getSize() > 0) {
size = request.getSize();
}
if (!request.getSort().isEmpty()) {
return PageRequest.of(page, size, Sort.by(request.getSort()).descending());
} else {
return PageRequest.of(page, size);
}
}
after i got the Pageable data, i submit it into my repository, which is looked like this :
public Page<StockAkhirResponseDto> findBulkStockAkhirPage(String warehouseCode, Long branchId, String branchCode,
Set<String> productCodes, Pageable pageable) {
JPQLQuery<Tuple> query = new JPAQuery<>(em);
long offset = pageable.getOffset();
long limit = pageable.getPageSize();
QStock qStock = QStock.stock;
NumberExpression<Integer> totalQty = qStock.qty.sum().intValue();
query = query.select(qStock.productId, qStock.productCode, totalQty).from(qStock)
.where(qStock.warehouseCode.eq(warehouseCode), qStock.productCode.in(productCodes),
qStock.branchCode.eq(branchCode), qStock.branchId.eq(branchId))
.groupBy(qStock.productId, qStock.productCode);
query.limit(limit);
query.offset(offset);
QueryResults<Tuple> result = query.fetchResults();
long total = result.getTotal();
List<Tuple> rows = result.getResults();
List<StockAkhirResponseDto> stockAkhirDto = rows.stream()
.map(t -> new StockAkhirResponseDto(t.get(0, Long.class), t.get(1, String.class), t.get(2, Integer.class)))
.collect(Collectors.toList());
return new PageImpl<>(stockAkhirDto, pageable, total);
}
there is no error in my editor when viewing this my repository and i able to run my project, but when i execute my repository function, i got this error :
"org.hibernate.hql.internal.ast.QuerySyntaxException: expecting CLOSE,
found ',' near line 1, column 38 [select count(distinct
stock.productId, stock.productCode, stock.warehouseId,
stock.warehouseCode, stock.branchCode, stock.branchId)\nfrom
com.bit.microservices.b2b.warehouse.entity.Stock stock\nwhere
stock.warehouseCode = ?1 and stock.productCode in ?2 and
stock.branchCode = ?3 and stock.branchId = ?4]; nested exception is
java.lang.IllegalArgumentException:
org.hibernate.hql.internal.ast.QuerySyntaxException: expecting CLOSE,
found ',' near line 1, column 38 [select count(distinct
stock.productId, stock.productCode, stock.warehouseId,
stock.warehouseCode, stock.branchCode, stock.branchId)\nfrom
com.bit.microservices.b2b.warehouse.entity.Stock stock\nwhere
stock.warehouseCode = ?1 and stock.productCode in ?2 and
stock.branchCode = ?3 and stock.branchId = ?4]"
the problem is here, on this line :
QueryResults<Tuple> result = query.fetchResults();
when i execute that line, it give me that error, i try to get the fetchResult, because i want to get the .getTotal() for the total.
but if i execute the query with .fetch(), it worked fine, like this :
List<StockAkhirResponseDto> stockAkhirDto = query.fetch()
i got my sql result execute correctly, what did i missed here? how do i get Page result correctly?

Your problem could be related with an open QueryDSL issue. The documented issue has to do with the use of fetchCount but I think very likely could be also your case.
Consider the following comment in the mentioned issue:
fetchCount() uses a COUNT function, which is an aggregate function. Your query already has aggregate functions. You cant aggregate aggregate functions, unless a subquery is used (which is not available in JPA). Therefore this use case cannot be supported.
The issue also provides a temporary solution.
Basically, the idea is be able to perform the COUNT by creating a statement over the initial select. AFAIK it is not possible with QueryDsl and this is why in the indicated workarounds they access the underline mechanisms provided by Hibernate.
Perhaps, another thing that you can try to avoid the limitation is to create a database view for your query, the corresponding QueryDsl objects over it, and use these objects to perform the actual computation. I am aware that it is not an ideal solution, but it will bypass this current QueryDsl limitation.

Room compile problem - column references a foreign key but it is not part of an index

I'm new to android app development and am creating an app that has Trips which stores Locations.
I'm getting a compile error :"trip_Id column references a foreign key but it is not part of an index. This may trigger full table scans whenever parent table is modified so you are highly advised to create an index that covers this column."
I have 2 tables: Trip & Location.
I've tried indexing the tripId and the locationId in their respective classes but it doesn't solve the issue.
A Trip has its id (PK), title, description and priority.
A Location has its locationId (PK), tripId(FK), locationName and LatLng of the place.
#Entity(tableName = "location_table",
foreignKeys = #ForeignKey(entity = Trip.class, parentColumns = "location_Id", childColumns = "trip_Id"),
indices = {#Index(value = {"locationId"}, unique = true)})
public class Location {
#PrimaryKey(autoGenerate = true)
private int locationId;
#ColumnInfo (name = "trip_Id")
private int tripId;
private String locationName;
#TypeConverters(LatLngConverter.class)
private LatLng latLng;
#Entity(tableName = "trip_table", indices = {#Index(value = {"id"}, unique = true)})
public class Trip {
#PrimaryKey(autoGenerate = true) // with each new row SQLite will automatically increment this ID so it will be unique
#ColumnInfo (name = "location_Id")
private int id;
private String title;
private String description;
private int priority;
I can't seem to find out what's wrong

That message is a warning, there would be other errors (see 2nd suggested line). e.g.
Using indices = {#Index(value = {"locationId"}, unique = true),#Index(value = {"trip_Id"})}) should overcome that warning.
HOWEVER, there is no need to have an (additional) index on locationId as it is already indexed being the primary key (this would be a waste and also inefficient). So it is suggested that you use :-
indices = {#Index(value = {"trip_Id"})})
I believe that your overall issue is that you are referring to the object's variable name for the column, when if you have #ColumnInfo(name ="????") then you should be referring to the given name
i.e. ???? is the column name in the underlying table.
You should also be using location_Id instead of id in the Trip:-
#Entity(tableName = "trip_table", indices = {#Index(value = {"location_Id"}, unique = true)})

I had this problem when I had quite a few foreign keys. I found out (after a lot of trial and error) that I set up my indices wrong... whoops. :P
Wrong:
indices = [
Index("active_color_style_id",
"ambient_color_style_id",
"hour_hand_dimensions_id",
"minute_hand_dimensions_id",
"second_hand_dimensions_id")
]
Right:
indices = [
Index("active_color_style_id"),
Index("ambient_color_style_id"),
Index("hour_hand_dimensions_id"),
Index("minute_hand_dimensions_id"),
Index("second_hand_dimensions_id")
]
Full example:
#Entity(
tableName = "analog_watch_face_table",
foreignKeys = [
ForeignKey(
entity = WatchFaceColorStyleEntity::class,
parentColumns = ["id"],
childColumns = ["active_color_style_id"],
onDelete = NO_ACTION
),
ForeignKey(
entity = WatchFaceColorStyleEntity::class,
parentColumns = ["id"],
childColumns = ["ambient_color_style_id"],
onDelete = NO_ACTION
),
ForeignKey(
entity = WatchFaceArmDimensionsEntity::class,
parentColumns = ["id"],
childColumns = ["hour_hand_dimensions_id"],
onDelete = NO_ACTION
),
ForeignKey(
entity = WatchFaceArmDimensionsEntity::class,
parentColumns = ["id"],
childColumns = ["minute_hand_dimensions_id"],
onDelete = NO_ACTION
),
ForeignKey(
entity = WatchFaceArmDimensionsEntity::class,
parentColumns = ["id"],
childColumns = ["second_hand_dimensions_id"],
onDelete = NO_ACTION
)
],
indices = [
Index("active_color_style_id"),
Index("ambient_color_style_id"),
Index("hour_hand_dimensions_id"),
Index("minute_hand_dimensions_id"),
Index("second_hand_dimensions_id")
]
)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to eliminate Metadata from all columns in a Spark Table? (Java) - java

You can use encode("XX","ignore") Example : Val df=data.map(lambda x: x.encode("ascii", "ignore").

Related

Add row to Spark Dataframe with timestamps and id

BeanIO - How to detect multiple record with same type

Can't get media metadata compat string

How to get Page as result in Querydsl query with fetch or fetchResults properly?

Room compile problem - column references a foreign key but it is not part of an index

Categories

Resources