I am reading data from S3 in the parquet format, and then I process this data as a DataFrame.
The question is how to efficiently iterate over rows in DataFrame? I know that the method collect loads data into memory, so, though my DataFrame is not big, I would prefer to avoid loading the complete data set into memory. How could I optimize the given code?
Also, I am using indices to access columns in DataFrame. Can I access them by column names (I know them)?
DataFrame parquetFile = sqlContext.read().parquet("s3n://"+this.aws_bucket+"/"+this.aws_key_members);
parquetFile.registerTempTable("mydata");
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM mydata");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> props = new HashMap<>();
props.put("field1", rddRows[rowIdx].get(0).toString());
props.put("field2", rddRows[rowIdx].get(1).toString());
// further processing
}
You can use Map function in spark.
You can iterate the whole data frame without collecting the dataset/dataframe:
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age
BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map((MapFunction<Row, String>) row -> "Name:" + row.getString(0),Encoders.STRING());
namesDS.show();
You can create a map function if the operations you are doing are complex:
// Map function
Row doSomething(Row row){
// get column
String field = row.getAs(COLUMN)
// construct a new row and add all the existing/modified columns in the row .
return row.
}
Now this map function can be called into dataframe's map function:
StructType structType = dataset.schema();
namesDF.map((MapFunction<Row, Row>)dosomething,
RowEncoder.apply(structType))
Source: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Related
I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
How can I create a JavaRDD out of a JavaRDD
How can I use map to the JavaRDD?
Thanks!
Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");
You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
#Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField,
finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);
I want to read a table from Hive and write to Kafka Producer(batch job).
Currently, I am reading the table as Dataset<Row> in my java class and trying to convert to a json so that i can write as json message using KafkaProducer.
Dataset<Row> data = spark.sql("select * from tablename limit 5");
List<Row> rows = data.collectAsList();
for(Row row: rows) {
List<String> stringList = new ArrayList<String>(Arrays.asList(row.schema().fieldNames()));
Seq<String> row_seq = JavaConverters.asScalaIteratorConverter(stringList.iterator()).asScala().toSeq();
Map map = (Map) row.getValuesMap(row_seq);
JSONObject json = new JSONObject();
json.putAll( map);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(SPARK_CONF.get("topic.name"), json.toString());
producer.send(record);
I am getting ClassCastException
As soon as you write collectAsList();, you are no longer using Spark, just raw Kafka Java API.
My suggestion would be to use Spark Structured Streaming Kafka Integration and you can do
Here is an example, and you need to form a DataFrame with at least two columns because Kafka takes keys and values.
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic_name")
.save()
As far as getting data into JSON, again, collectToList() is wrong. Do not pull data into a single node.
You can use data.map() to convert a DataSet from one format into another.
For example, you would map a Row into a String that is in JSON format.
row -> "{\"f0\":" + row.get(0) + "}"
I need to pass my Spark 1.6.2 code to Spark 2.2.0 in Java.
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM my_data");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> myProperties = new HashMap<>();
myProperties.put("startdate", rddRows[rowIdx].get(1).toString());
JEDIS.hmset("PK:" + rddRows[rowIdx].get(0).toString(), myProperties); // JEDIS is a Redis client for Java
}
As far as I understand, there is no DataFrame in Spark 2.2.0 for Java. Only Dataset. However, if I substitute DataFrame with Dataset, then I get Object[] instead of Row[] as output of eventsRaw.collect(). Then get(1) is marked in red and I cannot compile the code.
How can I correctly do it?
DataFrame (Scala) is Dataset<Row>:
SparkSession spark;
...
Dataset<Row> eventsRaw = spark.sql("SELECT * FROM my_data");
but instead of collect you should rather use foreach (use lazy singleton connection) :
eventsRaw.foreach(
(ForeachFunction<Row>) row -> ... // replace ... with appropriate logic
);
or foreachPartition (initialize connection for each partition):
eventsRaw.foreachPartition((ForeachPartitionFunction<Row)) rows -> {
... // replace ... with appropriate logic
})
Is it possible to generate a histogram dataframe with Spark 2.1 in Java from a Dataset<Row> table?
Convert the Dataset into JavaRDD where Datatype can be Integer, Double etc. using toJavaRDD().map() function.
Again Convert the JavaRDD to JavaDoubleRDD using mapToDouble function.
Then you can apply histogram(int bucketcount) to get the histogram of the data.
Example :
I got a table in spark with table name as 'nation' having column as 'n_nationkey' which is Integer then this is how I did it:
String query = "select n_nationkey from nation" ;
Dataset<Row> df = spark.sql(query);
JavaRDD<Integer> jdf = df.toJavaRDD().map(row -> row.getInt(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);
In case the column have a double type, you simply replace some things as :
JavaRDD<Double> jdf = df.toJavaRDD().map(row -> row.getDouble(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);
I have Some question about Hector Insert double/float data into Cassandra
new Double("13.45")------->13.468259733915328
new Float("64.13") ------->119.87449
When I insert data into Cassandra by hector
TestDouble ch = new TestDouble("talend_bj",
"localhost:9160");
String family = "talend_1";
ch.ensureColumnFamily(family);
List values = new ArrayList();
values.add(HFactory.createColumn("id", 2, StringSerializer.get(),
IntegerSerializer.get()));
values.add(HFactory.createColumn("name", "zhang",
StringSerializer.get(), StringSerializer.get()));
values.add(HFactory.createColumn("salary", 13.45,
StringSerializer.get(), DoubleSerializer.get()));
ch.insertSuper("14", values, "user1", family, StringSerializer.get(),
StringSerializer.get());
StringSerializer se = StringSerializer.get();
MultigetSuperSliceQuery<String, String, String, String> q = me.prettyprint.hector.api.factory.HFactory
.createMultigetSuperSliceQuery(ch.getKeyspace(), se, se, se, se);
// q.setSuperColumn("user1").setColumnNames("id","name")
q.setKeys("12", "11","13", "14");
q.setColumnFamily(family);
q.setRange("z", "z", false, 100);
QueryResult<SuperRows<String, String, String, String>> r = q
.setColumnNames("user1", "user").execute();
Iterator iter = r.get().iterator();
while (iter.hasNext()) {
SuperRow superRow = (SuperRow) iter.next();
SuperSlice s = superRow.getSuperSlice();
List<HSuperColumn> superColumns = s.getSuperColumns();
for (HSuperColumn superColumn : superColumns) {
List<HColumn> columns = superColumn.getColumns();
System.out.println(DoubleSerializer.get().fromBytes(((String) superColumn.getSubColumnByName("salary").getValue()).getBytes()));
}
}
You can see 13.45 but I get the column value is 13.468259733915328
You should break the problem in two. After writing, IF you defined part of your schema OR use te ASSUME keyword on the commandline cli, view the data in cassandra to see if it is correct. PlayOrm has this EXACT unit test(which is on PlayOrm on top of astyanax not hector) and it works just fine....Notice the comparison in the test of -200.23...
https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/TestColumnSlice.java
Once down, does your data in cassandra look correct? If so, the issue is on your reading the value in code, otherwise, it is the writes.