I want to read a table from Hive and write to Kafka Producer(batch job).
Currently, I am reading the table as Dataset<Row> in my java class and trying to convert to a json so that i can write as json message using KafkaProducer.
Dataset<Row> data = spark.sql("select * from tablename limit 5");
List<Row> rows = data.collectAsList();
for(Row row: rows) {
List<String> stringList = new ArrayList<String>(Arrays.asList(row.schema().fieldNames()));
Seq<String> row_seq = JavaConverters.asScalaIteratorConverter(stringList.iterator()).asScala().toSeq();
Map map = (Map) row.getValuesMap(row_seq);
JSONObject json = new JSONObject();
json.putAll( map);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(SPARK_CONF.get("topic.name"), json.toString());
producer.send(record);
I am getting ClassCastException
As soon as you write collectAsList();, you are no longer using Spark, just raw Kafka Java API.
My suggestion would be to use Spark Structured Streaming Kafka Integration and you can do
Here is an example, and you need to form a DataFrame with at least two columns because Kafka takes keys and values.
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
data.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic_name")
.save()
As far as getting data into JSON, again, collectToList() is wrong. Do not pull data into a single node.
You can use data.map() to convert a DataSet from one format into another.
For example, you would map a Row into a String that is in JSON format.
row -> "{\"f0\":" + row.get(0) + "}"
Related
Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.
I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
How can I create a JavaRDD out of a JavaRDD
How can I use map to the JavaRDD?
Thanks!
Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");
You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
#Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField,
finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);
I need to pass my Spark 1.6.2 code to Spark 2.2.0 in Java.
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM my_data");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> myProperties = new HashMap<>();
myProperties.put("startdate", rddRows[rowIdx].get(1).toString());
JEDIS.hmset("PK:" + rddRows[rowIdx].get(0).toString(), myProperties); // JEDIS is a Redis client for Java
}
As far as I understand, there is no DataFrame in Spark 2.2.0 for Java. Only Dataset. However, if I substitute DataFrame with Dataset, then I get Object[] instead of Row[] as output of eventsRaw.collect(). Then get(1) is marked in red and I cannot compile the code.
How can I correctly do it?
DataFrame (Scala) is Dataset<Row>:
SparkSession spark;
...
Dataset<Row> eventsRaw = spark.sql("SELECT * FROM my_data");
but instead of collect you should rather use foreach (use lazy singleton connection) :
eventsRaw.foreach(
(ForeachFunction<Row>) row -> ... // replace ... with appropriate logic
);
or foreachPartition (initialize connection for each partition):
eventsRaw.foreachPartition((ForeachPartitionFunction<Row)) rows -> {
... // replace ... with appropriate logic
})
I have an RDD, i need to convert it into a Dataset, i tried:
Dataset<Person> personDS = sqlContext.createDataset(personRDD, Encoders.bean(Person.class));
the above line throws the error,
cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD
Main.Person, org.apache.spark.sql.Encoder T)
however, i can convert to Dataset after converting to Dataframe. the below code works:
Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));
.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:
Dataset<Person> personDS = sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));
on your rdd use .toDS() you will get a dataset.
Let me know if it helps. Cheers.
In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:
StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS = sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));
StructType schema = new StructType()
.add("Id", DataTypes.StringType)
.add("Name", DataTypes.StringType)
.add("Country", DataTypes.StringType);
Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);
Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns
I am reading data from S3 in the parquet format, and then I process this data as a DataFrame.
The question is how to efficiently iterate over rows in DataFrame? I know that the method collect loads data into memory, so, though my DataFrame is not big, I would prefer to avoid loading the complete data set into memory. How could I optimize the given code?
Also, I am using indices to access columns in DataFrame. Can I access them by column names (I know them)?
DataFrame parquetFile = sqlContext.read().parquet("s3n://"+this.aws_bucket+"/"+this.aws_key_members);
parquetFile.registerTempTable("mydata");
DataFrame eventsRaw = sqlContext.sql("SELECT * FROM mydata");
Row[] rddRows = eventsRaw.collect();
for (int rowIdx = 0; rowIdx < rddRows.length; ++rowIdx)
{
Map<String, String> props = new HashMap<>();
props.put("field1", rddRows[rowIdx].get(0).toString());
props.put("field2", rddRows[rowIdx].get(1).toString());
// further processing
}
You can use Map function in spark.
You can iterate the whole data frame without collecting the dataset/dataframe:
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age
BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map((MapFunction<Row, String>) row -> "Name:" + row.getString(0),Encoders.STRING());
namesDS.show();
You can create a map function if the operations you are doing are complex:
// Map function
Row doSomething(Row row){
// get column
String field = row.getAs(COLUMN)
// construct a new row and add all the existing/modified columns in the row .
return row.
}
Now this map function can be called into dataframe's map function:
StructType structType = dataset.schema();
namesDF.map((MapFunction<Row, Row>)dosomething,
RowEncoder.apply(structType))
Source: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html