I have a dataframe (Dataset<Row>) which has six columns in it, out of six, four needs to be grouped and for the other two columns it may repeat the grouped columns n times based on the varying value in those two columns.
required dataset like below:
id | batch | batch_Id | session_name | time | value
001| abc | 098 | course-I | 1551409926133 | 2.3
001| abc | 098 | course-I | 1551404747843 | 7.3
001| abc | 098 | course-I | 1551409934220 | 6.3
I tired something like below
Dataset<Row> df2 = df.select("*")
.groupBy(col("id"), col("batch_Id"), col("session_name"))
.agg(max("time"));
I added agg to get groupby output but don't know how to achieve it.
Help much appreciated... Thank you.
I don't think you were too far off.
Given your first dataset:
+---+-----+--------+------------+-------------+-----+
| id|batch|batch_Id|session_name| time|value|
+---+-----+--------+------------+-------------+-----+
|001| abc| 098| course-I|1551409926133| 2.3|
|001| abc| 098| course-I|1551404747843| 7.3|
|001| abc| 098| course-I|1551409934220| 6.3|
|002| def| 097| course-II|1551409926453| 2.3|
|002| def| 097| course-II|1551404747843| 7.3|
|002| def| 097| course-II|1551409934220| 6.3|
+---+-----+--------+------------+-------------+-----+
And assuming your desired output is:
+---+--------+------------+-------------+
| id|batch_Id|session_name| max(time)|
+---+--------+------------+-------------+
|002| 097| course-II|1551409934220|
|001| 098| course-I|1551409934220|
+---+--------+------------+-------------+
I would write the following code for the aggregation:
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
And the whole app would look like:
package net.jgp.books.spark.ch13.lab900_max_value;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.max;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class MaxValueAggregationApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
MaxValueAggregationApp app = new MaxValueAggregationApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Aggregates max values")
.master("local[*]")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a
// dataframe
Dataset<Row> rawDf = spark.read().format("csv")
.option("header", true)
.option("sep", "|")
.load("data/misc/courses.csv");
// Shows at most 20 rows from the dataframe
rawDf.show(20);
// Performs the aggregation, grouping on columns id, batch_id, and
// session_name
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
maxValuesDf.show(5);
}
}
Does it help?
Related
Is there any way to connect to Hive from Spark without using "hive-site.xml"?
SparkLauncher sl = new SparkLauncher(evnProps);
sl.addSparkArg("--verbose");
sl.addAppArgs(appArgs);
sl.addFile(evnProps.get(KEY_YARN_CONF_DIR) + "/hive-site.xml");
We are passing "hive-site.xml" to SparkLauncher.I want to remove dependency on "hive-site.xml"enter code here.
Spark SQL supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DataFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(
(MapFunction<Row, String>) row -> "Key: " + row.get(0) + ", Value: " + row.get(1),
Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...
I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code.
List<Row> lsts= countsByAge.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
The Code, Input and Output Datasets are as shown below.
package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
sc.setLogLevel("ERROR");
Dataset<Row> src= sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv");
Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
List<Row> lsts= unq_manf.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
}
}
Input Table
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
| 632| west| Pipe Wrenches|126|
| 804| bil| Masonry Brushes|127|
| 497| west| Power Tools Other|128|
| 496| west| Power Tools Other|129|
| 495| bil| Hole Saws|130|
| 499| bil| Battery Chargers|131|
| 497| west| Power Tools Other|132|
+------+------------+--------------------+---+
Output
+------------+
|Manufacturer|
+------------+
| ael|
| west|
| bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+-----------------+---+
| 632| west| Pipe Wrenches|126|
| 497| west|Power Tools Other|128|
| 496| west|Power Tools Other|129|
| 497| west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+----------------+---+
| 804| bil| Masonry Brushes|127|
| 495| bil| Hole Saws|130|
| 499| bil|Battery Chargers|131|
+------+------------+----------------+---+
You have two choice in this case:
First you have to collect unique manufacturer values and then map
over resulting array:
val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")
val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
val BrandArray = brands.map(brand => df.where($"k" <=> brand))
BrandArray.foreach { x =>
x.show()
println("---------------------------------------")
}
You can also save the data frame based on manufacturer.
df.write.partitionBy("hour").saveAsTable("parquet")
Instead of splitting the dataset/dataframe by manufacturers it might be optimal to write the dataframe using manufacturer as the partition key if you need to query based on manufacturer frequently
Incase you still want separate dataframes based on one of the column values one of the approaches using pyspark and spark 2.0+ could be-
from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately
for m in manufacturers:
df1 = df.where(F.col('manufacturers')==m[0])
df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])
I get some entries from the stream in linux terminal, assign them as lines, break them into words. But instead of printing them out I want to save them to Cassandra.
I have a Keyspace named ks, with a table inside it named record.
I know that some code like CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra(); has to do the job but I guess I am doing something wrong. Can someone help ?
Here is my Cassandra ks.record schema (I added these data through CQLSH)
id | birth_date | name
----+---------------------------------+-----------
10 | 1987-12-01 23:00:00.000000+0000 | Catherine
11 | 2004-09-07 22:00:00.000000+0000 | Isadora
1 | 2016-05-10 13:00:04.452000+0000 | John
2 | 2016-05-10 13:00:04.452000+0000 | Troy
12 | 1970-10-01 23:00:00.000000+0000 | Anna
3 | 2016-05-10 13:00:04.452000+0000 | Andrew
Here is my Java code :
import com.datastax.spark.connector.japi.CassandraStreamingJavaUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import java.util.Arrays;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow;
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.*;
public class CassandraStreaming2 {
public static void main(String[] args) {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("CassandraStreaming");
JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = sc.socketTextStream("localhost", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(
(FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" "))
);
words.print();
//CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra();
sc.start(); // Start the computation
sc.awaitTermination(); // Wait for the computation to terminate
}
}
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md#saving-data-to-cassandra
As per the docs, you need to also pass a RowWriter factory. The most common way to do this is to use the mapToRow(Class) api, this is the missing parameter described.
But you have an additional problem, your code doesn't yet specify the data in a way that can be written to C*. You have a JavaDStream of only Strings. And a single String cannot be made into a Cassandra Row for your given schema.
Basically you are telling the connector
Write "hello" to CassandraTable (id, birthday, value)
Without telling it where the hello goes (what should the id be? what should the birthday be?)
I want to save translation values in a class. Because it's to convenient, Java's Locale implementation seems like the correct key for the mapping. The problem is: If I just use HashMap<Locale, String> translations = ...; for the translation, my code will not be able to fall back when a specific locale is not available.
How can I achieve a good data structure for storing translations of an object?
Note that these translations are not translations of the program elements, like an user interface, imagine the class being a Dictionary entry, so each class has its own amount of translations that are different every time.
Here is a example of what the problem with a HashMap would be:
import java.util.HashMap;
import java.util.Locale;
public class Example
{
private final HashMap<Locale, String> translationsMap = new HashMap<>();
/*
* +------------------------+-------------------+-------------------+
* | Input | Expected output | Actual output |
* +------------------------+-------------------+-------------------+
* | new Locale("en") | "enTranslation" | "enTranslation" |
* | new Locale("en", "CA") | "enTranslation" | null | <-- Did not fall back
* | new Locale("de") | "deTranslation" | "deTranslation" |
* | new Locale("de", "DE") | "deTranslation" | null | <-- Did not fall back
* | new Locale("de", "AT") | "deATTranslation" | "deATTranslation" |
* | new Locale("fr") | "frTranslation" | "frTranslation" |
* | new Locale("fr", "CA") | "frTranslation" | null | <-- Did not fall back
* +------------------------+-------------------+-------------------+
*/
public String getTranslation(Locale locale)
{
return translationsMap.get(locale);
}
public void addTranslation(Locale locale, String translation)
{
translationsMap.put(locale, translation);
}
// dynamic class initializer
{
addTranslation(new Locale("en"), "enTranslation");
addTranslation(new Locale("de"), "deTranslation");
addTranslation(new Locale("fr"), "frTranslation");
addTranslation(new Locale("de", "AT"), "deATTranslation");
}
}
This is a little bit hacky, but it works. Using a ResourceBundle.Control, it's possible to use a standard implementation for fallbacks.
private Map<Locale, String> translations = new HashMap<>();
/** static: this instance is not modified or bound, it can be reused for multiple instances */
private static final ResourceBundle.Control CONTROL = ResourceBundle.Control.getControl(ResourceBundle.Control.FORMAT_PROPERTIES);
#Nullable
public String getTranslation(#NotNull Locale locale)
{
List<Locale> localeCandidates = CONTROL.getCandidateLocales("_dummy_", locale); // Sun's implementation discards the string argument
for (Locale currentCandidate : localeCandidates)
{
String translation = translations.get(currentCandidate);
if (translation != null)
return translation;
}
return null;
}
Have your classes extend ListResourceBundle.
See here: https://docs.oracle.com/javase/tutorial/i18n/resbundle/list.html
I'm using GNU NNTP to connect to leafnode, which is an NNTP server, on localhost. The GNU API utilizes javax.mail.Message, which comes with the following caveat:
From the Message API:
..the message number for a particular Message can change during a
session if other messages in the Folder are deleted and expunged.
So, currently, I'm using javax.mail.search to search for a known message. Unfortunately, for each search the entire folder has be searched. I could keep the folder open and in that way speed the search a bit, but it just seems klunky.
What's an alternate approach to using javax.mail.search? This:
SearchTerm st = new MessageIDTerm(id);
List<Message> messages = Arrays.asList(folder.search(st));
works fine when the javax.mail.Folder only has a few Message's. However, for very large Folder's there must be a better approach. Instead of the Message-ID header field, Xref might be preferable, but still has the same fundamental problem of searching strings.
Here's the database, which just needs to hold enough information to find/get/search the Folder's for a specified message:
mysql>
mysql> use usenet;show tables;
Database changed
+------------------+
| Tables_in_usenet |
+------------------+
| articles |
| newsgroups |
+------------------+
2 rows in set (0.00 sec)
mysql>
mysql> describe articles;
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | bigint(20) | NO | PRI | NULL | auto_increment |
| MESSAGEID | varchar(255) | YES | | NULL | |
| NEWSGROUP_ID | bigint(20) | YES | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)
mysql>
mysql> describe newsgroups;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| ID | bigint(20) | NO | PRI | NULL | auto_increment |
| NEWSGROUP | varchar(255) | YES | | NULL | |
+-----------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
mysql>
While the schema is very simple at the moment, I plan to add complexity to it.
messages are queried for with getMessage():
package net.bounceme.dur.usenet.model;
import java.util.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.mail.*;
import javax.mail.search.MessageIDTerm;
import javax.mail.search.SearchTerm;
import net.bounceme.dur.usenet.controller.Page;
public enum Usenet {
INSTANCE;
private final Logger LOG = Logger.getLogger(Usenet.class.getName());
private Properties props = new Properties();
private Folder root = null;
private Store store = null;
private List<Folder> folders = new ArrayList<>();
private Folder folder = null;
Usenet() {
LOG.fine("controller..");
props = PropertiesReader.getProps();
try {
connect();
} catch (Exception ex) {
Logger.getLogger(Usenet.class.getName()).log(Level.SEVERE, "FAILED TO LOAD MESSAGES", ex);
}
}
public void connect() throws Exception {
LOG.fine("Usenet.connect..");
Session session = Session.getDefaultInstance(props);
session.setDebug(true);
store = session.getStore(new URLName(props.getProperty("nntp.host")));
store.connect();
root = store.getDefaultFolder();
setFolders(Arrays.asList(root.listSubscribed()));
}
public List<Message> getMessages(Page page) throws Exception {
Newsgroup newsgroup = new Newsgroup(page);
LOG.fine("fetching.." + newsgroup);
folder = root.getFolder(newsgroup.getNewsgroup());
folder.open(Folder.READ_ONLY);
List<Message> messages = Arrays.asList(folder.getMessages());
LOG.fine("..fetched " + folder);
return Collections.unmodifiableList(messages);
}
public List<Folder> getFolders() {
LOG.fine("folders " + folders);
return Collections.unmodifiableList(folders);
}
private void setFolders(List<Folder> folders) {
this.folders = folders;
}
public Message getMessage(Newsgroup newsgroup, Article article) throws MessagingException {
LOG.fine("\n\ntrying.." + newsgroup + article);
String id = article.getMessageId();
Message message = null;
folder = root.getFolder(newsgroup.getNewsgroup());
folder.open(Folder.READ_ONLY);
SearchTerm st = new MessageIDTerm(id);
List<Message> messages = Arrays.asList(folder.search(st));
LOG.severe(messages.toString());
if (!messages.isEmpty()) {
message = messages.get(0);
}
LOG.info(message.getSubject());
return message;
}
}
The problem, which I'm only now realizing, is that:
...the message number for a particular Message can change during a session if other messages in the Folder are deleted and expunged.
Regardless of which particular header is used, it's something like:
Message-ID: <x1-CZwog1NTZLd68+JJY35Zrl9OqXE#gwene.org>
or
Xref: dur.bounceme.net gwene.com.economist:541
So that there's always a String which needs parsing and searching, which is quite awkward.
I do notice that MimeMessage has a very convenient getMessageID method. Unfortunately, GNU uses javax.mail.Message and not MimeMessage. Granted, it's possible to instantiate a folder and MimeMessage, but I don't see any savings there in that from one run to another there's no guarantee that getMessageID will return the correct message.
The awkward solution I see is to maybe create a persistent folder of MimeMessage's, but that seems like overkill.
Hence, using a header, either Xref or Message-ID and then parsing and searching strings...
Is there a better way?
javax.mail is a lowest-common-denominator API, and it's behavior depends entirely on what is the backend. So, without knowing what you are talking to, it's not really possible to give a good answer to your question. Chances are, however, that you'll need to talk directly to whatever you're talking to and learn more about its behavior.
This might be a comment rather than an answer, but I'm thinking that the information that this API is just a thin layer might be enough information to justify.