Why Spark writes Null in DeltaLake Table

Why Spark writes Null in DeltaLake Table - java

My Java application uses a Spark Structured Streaming connected to a socket server continuously getting sensor measurement records (IoT) wrapped in an RDMessage object that records the message type for control in the protocol.
When messages arrive they are checked and converted into a Dataset using Encoder<RDMeasurement> measurementEncoder = Encoders.bean(RDMeasurement.class).
Although the stream is read correctly and the RDMeasurement objects are created correctly, the output stream is set to None or zero depending on the data type. I see this in the DeltaFrame table or in the console when I change the format (.format("console")).
What did I miss here? what is going wrong?
See below the most significant segments of Java code
public final class SocketRDMeasurement {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("SSSocketRDMeasurement")
.master("local[*]")
.getOrCreate();
Encoder<StringArray> stringArrayEncoder = Encoders.bean(StringArray.class);
Encoder<RDMessage> messageEncoder = Encoders.bean(RDMessage.class);
Encoder<RDMeasurement> measurementEncoder = Encoders.bean(RDMeasurement.class);
Dataset<Row> records = spark
.readStream()
.format("socket")
.option("host", host)
.option("port", port)
.load();
Dataset<String> inputReceived = records.as(Encoders.STRING());
Dataset<StringArray> input = inputReceived.as(Encoders.STRING())
.map((MapFunction<String, StringArray>) x ->
new StringArray(x),
stringArrayEncoder);
Dataset<RDMessage> messages = input.map(
(MapFunction<StringArray, RDMessage>)
r -> new RDMessage(r), messageEncoder);
Dataset<RDMeasurement> measurements = messages
.map((MapFunction<RDMessage, RDMeasurement>) r ->
new RDMeasurement(), measurementEncoder);
// The code executes without warning or error but despite the
// objects being created correctly the output of dataset is
// is saved with nulls/nan
StreamingQuery query = measurements.writeStream()
.outputMode("append")
.format("delta")
.option("checkpointLocation",
"/opt/data/delta/_checkpoints/ss-socket-rd-measurement")
.start("/opt/data/delta/ss-socket-rd-measurement");
query.awaitTermination();
}
}
public class StringArray implements Serializable {
private String[] tokens;
public StringArray(String tokens) {
this.tokens = tokens.split(",");
}
// getters, setters and toString goes here
}
public class RDMeasurement implements Serializable {
private String dataSourceName = null;
private double dt = 0.0f;
private double t0 = 0f;
private double endTimestamp = 0L;
private double[] valuesArray;
public RDMeasurement() { }
public RDMeasurement(String dataSourceName, double t0,
double dt, double endTimestamp, double[] valuesArray) {
this.dataSourceName = dataSourceName;
this.t0 = t0;
this.dt = dt;
this.endTimestamp = endTimestamp;
this.valuesArray = valuesArray;
}
// getters, setters and toString goes here
}
public class RDMessage implements Serializable {
String type;
RDMeasurement rdMeasurement;
public RDMessage(String type, RDMeasurement rdMeasurement) {
this.type = type;
this.rdMeasurement = rdMeasurement;
}
public RDMessage(StringArray stringArray) {
this(stringArray.getTokens()[0] ,
new RDMeasurement(stringArray.getTokens()[1],
Double.parseDouble(stringArray.getTokens()[2]),
Double.parseDouble(stringArray.getTokens()[3]),
Double.parseDouble(stringArray.getTokens()[4]),
toDoubleArray(5, stringArray))
);
}
private static double[] toDoubleArray(int skip, StringArray stringArray) {
double[] ret = new double[stringArray.getTokens().length - 5];
for (int i = 0; i < stringArray.getTokens().length - 5; i++) {
ret[i] = Double.parseDouble(stringArray.getTokens()[i+skip]);
}
return ret;
}
// getters, setters and toString goes here
}
Each line of input follows the format below:
V1_start_rd_0,ds_1,1642442598.266,1.0,1642442618.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00
V1_rd_1,ds_2,1642442619.266,1.0,1642442639.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00
V1_rd_2,ds_3,1642442640.266,1.0,1642442660.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00

After refactoring my java code and adding a code segment for debug I was able to identify the error.
See the refactoring:
StreamingQuery query = dataStreamReader.load()
.as(Encoders.STRING())
.map((MapFunction<String, StringArray>) x -> new StringArray(x),
stringArrayEncoder)
.map((MapFunction<StringArray, RDMessage>)
r -> new RDMessage(r), messageEncoder)
.map((MapFunction<RDMessage, RDMeasurement>) e ->
e.getRdMeasurement(), measurementEncoder)
/*
.map((MapFunction<RDMeasurement, String>) e -> {
if (e.getDataSourceName() != null) {
System.out.println("•••> " + e);
}
return e.toString();
}, Encoders.STRING())
.map((MapFunction<String, RDMeasurement>) s -> new RDMeasurement(s),
measurementEncoder)
*/
.writeStream()
.outputMode("append")
.format("console")
.start();
query.awaitTermination();
The code commented above allowed me to identify the problem.

Related

Get number of post in list array from another class. But It doesn't work?

I try to get number of post from arraylist in the class into mainactivity.
But it is wrong .
Here is my code.
public static int countNotify;
public static List<Notification> bindNotifyData(JsonElement list)
{
List<Notification> results= new ArrayList<>();
JsonObject dataJsonObj = list.getAsJsonObject();
// get data api from Json array "updates"
JsonArray notifyJsonArray = dataJsonObj.get("updates").getAsJsonArray();
ArrayList<Notification> notifyList = new ArrayList<>();
countNotify=notifyJsonArray.size();
if(notifyJsonArray != null && notifyJsonArray.size() > 0) {
for(int i = 0; i < notifyJsonArray.size(); i++) {
JsonObject notifyJson = (JsonObject) notifyJsonArray.get(i);
Notification notification = new Notification();
notification.setContent(notifyJson.get("content").getAsString());
// Convert timestamp to Datetime
String timestamp= notifyJson.get("time").getAsString();
notification.setTime(ConvertTimestamp(timestamp));
results.add(notification);
// count numbers of the post in the list json array.
}
}
return results;
}
And in the MainActivity.class
final int count=BindFetchDataHelper.countNotify;
But the value of count always is 0

Try to create a instance of your class
BindFetchDataHelper bindFetchDataHelper = new BindFetchDataHelper ()
and then call final int count=bindFetchDataHelper.countNotify;
I had the same issue, it should work now.
EDIT
Try like this :
public class BindFetchDataHelper {
private static int sTest;
static {
public static int countNotify=0;
}
public static int getcountNotify() {
return countNotify;
}
public static void setcountNotify(int setcountNotify) {
this.countNotify = countNotify;
}
//your others functions
}
And now to access variable or to set it :
BindFetchDataHelper bindFetchDataHelper = new BindFetchDataHelper ()
bindFetchDataHelper.setcountNotify(YOURVALUE); //set
int whatyourwant = bindFetchDataHelper.getcountNotify(); //get

Why updating broadcast variable sample code didn't work?

I want to update broadcast variable every minute. So I use the sample code you give by Aastha in this question.
how can I update a broadcast variable in Spark streaming?
But it didn't work. The function updateAndGet() only works when the streaming application start. When I debug my code , it didn't went into the fuction updateAndGet() twice. So the broadcast variable didn't update every minute.
Why?
Here is my sample code.
public class BroadcastWrapper {
private Broadcast<List<String>> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<List<String>> updateAndGet(JavaStreamingContext jsc) {
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (broadcastVar == null || diff > 60000) { // Lets say we want to refresh every 1 min =
// 60000 ms
if (broadcastVar != null)
broadcastVar.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
// Your logic to refreshs
// List<String> data = getRefData();
List<String> data = new ArrayList<String>();
data.add("tang");
data.add("xiao");
data.add(String.valueOf(System.currentTimeMillis()));
broadcastVar = jsc.sparkContext().broadcast(data);
}
return broadcastVar;}}
//Here is the computing code submit to spark streaming.
lines.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
Broadcast<List<String>> blacklist =
BroadcastWrapper.getInstance().updateAndGet(jsc);
#Override
public JavaRDD<String> call(JavaRDD<String> rdd) {
JavaRDD<String> dd=rdd.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String word) {
if (blacklist.getValue().contains(word)) {
return false;
} else {
return true;
}
}
});
return dd;
}});

Spark DataFrame aggregation

I have the following code:
public class IPCCodes {
public static class IPCCount implements Serializable {
public IPCCount(long permid, int year, int count, String ipc) {
this.permid = permid;
this.year = year;
this.count = count;
this.ipc = ipc;
}
public long permid;
public int year;
public int count;
public String ipc;
}
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("IPC codes");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame df = sqlContext.sql("SELECT * FROM test.some_table WHERE year>2004");
JavaRDD<Row> rdd = df.javaRDD();
JavaRDD<IPCCount> map = rdd.flatMap(new FlatMapFunction<Row, IPCCount>() {
#Override
public Iterable<IPCCount> call(Row row) throws Exception {
List<IPCCount> counts = new ArrayList<>();
try {
String codes = row.getString(7);
for (String s : codes.split(",")) {
if(s.length()>4){
counts.add(new IPCCount(row.getLong(4), row.getInt(6), 1, s.substring(0, 4)));
}
}
} catch (NumberFormatException e) {
System.out.println(e.getMessage());
}
return counts;
}
});
I created DataFrame from Hive table and apply flatMap function for splitting ipc codes (this field is array of string in hive table), after that I need aggregate codes with count per permid and year, result table should be permid/year/ipc/count.
What is the most efficient way to do it?

If you want a DataFrame as an output there is no good reason to use RDD and flatMap. As far as I can tell everything can be easily handled using basic Spark SQL functions. Using Scala:
import org.apache.spark.sql.functions.{col, explode, length, split, substring}
val transformed = df
.select(col("permid"), col("year"),
// Split ipc and explode into multiple rows
explode(split(col("ipc"), ",")).alias("code"))
.where(length(col("code")).gt(4)) // filter
.withColumn("code", substring(col("code"), 0, 4))
transformed.groupBy(col("permid"), col("year"), col("code")).count

generating a number between a range using json

How can we generate a number between a range using Json.
Like we have to generate a number between 0 to 50, how can we perform this in Java using a Json.
This is my Json Data
{
"rand": {
"type': "number",
"minimum": 0,
"exclusiveMinimum": false,
"maximum": 50,
"exclusiveMaximum": true
}
}
This is what I have tried in Java
public class JavaApplication1 {
public static void main(String[] args) {
try {
for (int i=0;i<5;i++)
{
FileInputStream fileInputStream = new FileInputStream("C://users/user/Desktop/V.xls");
HSSFWorkbook workbook = new HSSFWorkbook(fileInputStream);
HSSFSheet worksheet = workbook.getSheet("POI Worksheet");
HSSFRow row1 = worksheet.getRow(0);
String e1Val = cellE1.getStringCellValue();
HSSFCell cellF1 = row1.getCell((short) 5);
System.out.println("E1: " + e1Val);
JSONObject obj = new JSONObject();
obj.put("value", e1Val);
System.out.print(obj + "\n");
Map<String,Object> c_data = mapper.readValue(e1Val, Map.class);
System.out.println(a);
}
} catch (FileNotFoundException e) {
} catch (IOException e) {
}
}
}
Json Data is stored in excel sheet, from there I am reading it in Java program

Get a Json-reader like GSON.
Read in the JSON to an equivalent Object like
public class rand{
private String type;
private int minimum;
private boolean exclusiveMinimum;
private int maximum;
private boolean exclusiveMaximum;
//this standard-constructor is needed for the JsonReader
public rand(){
}
//Getter for all Values
}
and after reading in your JSON you can access your Data via your getter-methods

I think that Jackson may be of help here.
I suggest that you create a data model in Java that reflects the JSON. This can along the lines of:
// This is the root object. It contains the input data (RandomizerInput) and a
// generate-function that is used for generating new random ints.
public class RandomData {
private RandomizerInput input;
#JsonCreator
public RandomData(#JsonProperty("rand") final RandomizerInput input) {
this.input = input;
}
#JsonProperty("rand")
public RandomizerInput getInput() {
return input;
}
#JsonProperty("generated")
public int generateRandomNumber() {
int max = input.isExclusiveMaximum()
? input.getMaximum() - 1 : input.getMaximum();
int min = input.isExclusiveMinimum()
? input.getMinimum() + 1 : input.getMinimum();
return new Random().nextInt((max - min) + 1) + min;
}
}
// This is the input data (pretty much what is described in the question).
public class RandomizerInput {
private final boolean exclusiveMaximum;
private final boolean exclusiveMinimum;
private final int maximum;
private final int minimum;
private final String type;
#JsonCreator
public RandomizerInput(
#JsonProperty("type") final String type,
#JsonProperty("minimum") final int minimum,
#JsonProperty("exclusiveMinimum") final boolean exclusiveMinimum,
#JsonProperty("maximum") final int maximum,
#JsonProperty("exclusiveMaximum") final boolean exclusiveMaximum) {
this.type = type; // Not really used...
this.minimum = minimum;
this.exclusiveMinimum = exclusiveMinimum;
this.maximum = maximum;
this.exclusiveMaximum = exclusiveMaximum;
}
public int getMaximum() {
return maximum;
}
public int getMinimum() {
return minimum;
}
public String getType() {
return type;
}
public boolean isExclusiveMaximum() {
return exclusiveMaximum;
}
public boolean isExclusiveMinimum() {
return exclusiveMinimum;
}
}
To use these classes the ObjectMapper from Jackson can be used like this:
public static void main(String... args) throws IOException {
String json =
"{ " +
"\"rand\": { " +
"\"type\": \"number\", " +
"\"minimum\": 0, " +
"\"exclusiveMinimum\": false, " +
"\"maximum\": 50, " +
"\"exclusiveMaximum\": true " +
"} " +
"}";
// Create the mapper
ObjectMapper mapper = new ObjectMapper();
// Convert JSON to POJO
final RandomData randomData = mapper.readValue(json, RandomData.class);
// Either you can get the random this way...
final int random = randomData.generateRandomNumber();
// Or, you can serialize the whole thing as JSON....
String str = mapper.writeValueAsString(randomData);
// Output is:
// {"rand":{"type":"number","minimum":0,"exclusiveMinimum":false,"maximum":50,"exclusiveMaximum":true},"generated":21}
System.out.println(str);
}
The actual generation of a random number is based on this SO question.

Slow chunk response in Play 2.2

In my play-framework-based web application users can download all the rows of different database tables in csv or json format. Tables are relatively large (100k+ rows) and I am trying to stream back the result using chunking in Play 2.2.
However the problem is although println statements shows that the rows get written to the Chunks.Out object, they do not show up in the client side! If I limit the rows getting sent back it will work, but it also has a big delay in the beginning which gets bigger if I try to send back all the rows and causes a time-out or the server runs out of memory.
I use Ebean ORM and the tables are indexed and querying from psql doesn't take much time. Does anyone have any idea what might be the problem?
I appreciate your help a lot!
Here is the code for one of the controllers:
#SecureSocial.UserAwareAction
public static Result showEpex() {
User user = getUser();
if(user == null || user.getRole() == null)
return ok(views.html.profile.render(user, Application.NOT_CONFIRMED_MSG));
DynamicForm form = DynamicForm.form().bindFromRequest();
final UserRequest req = UserRequest.getRequest(form);
if(req.getFormat().equalsIgnoreCase("html")) {
Page<EpexEntry> page = EpexEntry.page(req.getStart(), req.getFinish(), req.getPage());
return ok(views.html.epex.render(page, req));
}
// otherwise chunk result and send back
final ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
Chunks<String> chunks = new StringChunks() {
#Override
public void onReady(play.mvc.Results.Chunks.Out<String> out) {
Page<EpexEntry> page = EpexEntry.page(req.getStart(), req.getFinish(), 0);
ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
streamer.stream(out, page, req);
}
};
return ok(chunks).as("text/plain");
}
And the streamer:
public class ResultStreamer<T extends Entry> {
private static ALogger logger = Logger.of(ResultStreamer.class);
public void stream(Out<String> out, Page<T> page, UserRequest req) {
if(req.getFormat().equalsIgnoreCase("json")) {
JsonContext context = Ebean.createJsonContext();
out.write("[\n");
for(T e: page.getList())
out.write(context.toJsonString(e) + ", ");
while(page.hasNext()) {
page = page.next();
for(T e: page.getList())
out.write(context.toJsonString(e) + ", ");
}
out.write("]\n");
out.close();
} else if(req.getFormat().equalsIgnoreCase("csv")) {
for(T e: page.getList())
out.write(e.toCsv(CSV_SEPARATOR) + "\n");
while(page.hasNext()) {
page = page.next();
for(T e: page.getList())
out.write(e.toCsv(CSV_SEPARATOR) + "\n");
}
out.close();
}else {
out.write("Invalid format! Only CSV, JSON and HTML can be generated!");
out.close();
}
}
public static final String CSV_SEPARATOR = ";";
}
And the model:
#Entity
#Table(name="epex")
public class EpexEntry extends Model implements Entry {
#Id
#Column(columnDefinition = "pg-uuid")
private UUID id;
private DateTime start;
private DateTime finish;
private String contract;
private String market;
private Double low;
private Double high;
private Double last;
#Column(name="weight_avg")
private Double weightAverage;
private Double index;
private Double buyVol;
private Double sellVol;
private static final String START_COL = "start";
private static final String FINISH_COL = "finish";
private static final String CONTRACT_COL = "contract";
private static final String MARKET_COL = "market";
private static final String ORDER_BY = MARKET_COL + "," + CONTRACT_COL + "," + START_COL;
public static final int PAGE_SIZE = 100;
public static final String HOURLY_CONTRACT = "hourly";
public static final String MIN15_CONTRACT = "15min";
public static final String FRANCE_MARKET = "france";
public static final String GER_AUS_MARKET = "germany/austria";
public static final String SWISS_MARKET = "switzerland";
public static Finder<UUID, EpexEntry> find =
new Finder(UUID.class, EpexEntry.class);
public EpexEntry() {
}
public EpexEntry(UUID id, DateTime start, DateTime finish, String contract,
String market, Double low, Double high, Double last,
Double weightAverage, Double index, Double buyVol, Double sellVol) {
this.id = id;
this.start = start;
this.finish = finish;
this.contract = contract;
this.market = market;
this.low = low;
this.high = high;
this.last = last;
this.weightAverage = weightAverage;
this.index = index;
this.buyVol = buyVol;
this.sellVol = sellVol;
}
public static Page<EpexEntry> page(DateTime from, DateTime to, int page) {
if(from == null && to == null)
return find.order(ORDER_BY).findPagingList(PAGE_SIZE).getPage(page);
ExpressionList<EpexEntry> exp = find.where();
if(from != null)
exp = exp.ge(START_COL, from);
if(to != null)
exp = exp.le(FINISH_COL, to.plusHours(24));
return exp.order(ORDER_BY).findPagingList(PAGE_SIZE).getPage(page);
}
#Override
public String toCsv(String s) {
return id + s + start + s + finish + s + contract +
s + market + s + low + s + high + s +
last + s + weightAverage + s +
index + s + buyVol + s + sellVol;
}

1. Most of browsers wait for 1-5 kb of data before showing any results. You can check if Play Framework actually sends data with command curl http://localhost:9000.
2. You create streamer twice, remove first final ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
3. - You use Page class for retrieving large data set - this is incorrect. Actually you do one big initial request and then one request per iteration. This is SLOW. Use simple findIterate().
add this to EpexEntry (feel free to change it as you need)
public static QueryIterator<EpexEntry> all() {
return find.order(ORDER_BY).findIterate();
}
your new stream method implementation:
public void stream(Out<String> out, QueryIterator<T> iterator, UserRequest req) {
if(req.getFormat().equalsIgnoreCase("json")) {
JsonContext context = Ebean.createJsonContext();
out.write("[\n");
while (iterator.hasNext()) {
out.write(context.toJsonString(iterator.next()) + ", ");
}
iterator.close(); // its important to close iterator
out.write("]\n");
out.close();
} else // csv implementation here
And your onReady method:
QueryIterator<EpexEntry> iterator = EpexEntry.all();
ResultStreamer<EpexEntry> streamer = new ResultStreamer<EpexEntry>();
streamer.stream(new BuffOut(out, 10000), iterator, req); // notice buffering here
4. Another problem is - you call Out<String>.write() too often. Call of write() means that server needs to send new chunk of data to client immediately. Every call of Out<String>.write() have significant overhead.
Overhead appears because server needs to wrap response into chunked result - 6-7 bytes for each message Chunked response Format. Since you send small messages, overhead is significant.
Also, server needs to wrap your reply in TCP packet which size will be far less from optimal.
And, server needs to perform some internal action to send an chunk, this is also require some resources. As result, download bandwidth will be far from optimal.
Here is simple test: send 10000 lines of text TEST0 to TEST9999 in chunks. This takes 3 sec on my computer in average. But with buffering this takes 65 ms. Also, download sizes are 136 kb and 87.5 kb.
Example with buffering:
Controller
public class Application extends Controller {
public static Result showEpex() {
Chunks<String> chunks = new StringChunks() {
#Override
public void onReady(play.mvc.Results.Chunks.Out<String> out) {
new ResultStreamer().stream(out);
}
};
return ok(chunks).as("text/plain");
}
}
new BuffOut class. It's dumb, I know
public class BuffOut {
private StringBuilder sb;
private Out<String> dst;
public BuffOut(Out<String> dst, int bufSize) {
this.dst = dst;
this.sb = new StringBuilder(bufSize);
}
public void write(String data) {
if ((sb.length() + data.length()) > sb.capacity()) {
dst.write(sb.toString());
sb.setLength(0);
}
sb.append(data);
}
public void close() {
if (sb.length() > 0)
dst.write(sb.toString());
dst.close();
}
}
This implementation have 3 second download time and 136 kb size
public class ResultStreamer {
public void stream(Out<String> out) {
for (int i = 0; i < 10000; i++) {
out.write("TEST" + i + "\n");
}
out.close();
}
}
This implementation have 65 ms download time and 87.5 kb size
public class ResultStreamer {
public void stream(Out<String> out) {
BuffOut out2 = new BuffOut(out, 1000);
for (int i = 0; i < 10000; i++) {
out2.write("TEST" + i + "\n");
}
out2.close();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why Spark writes Null in DeltaLake Table - java

Related

Get number of post in list array from another class. But It doesn't work?

Why updating broadcast variable sample code didn't work?

Spark DataFrame aggregation

generating a number between a range using json

Slow chunk response in Play 2.2

Categories

Resources