Hi i'm a new learner for apache spark using java
This is a correct way or not?
This code is working,but performance wise very slow i don't know which one is best approach to access data for every loop.
Dataset<Row> javaRDD = sparkSession.read().jdbc(dataBase_url, "sample", properties);
javaRDD.toDF().registerTempTable("sample");
Dataset<Row> Users = sparkSession.sql("SELECT DISTINCT FROM_USER FROM sample ");
List<Row> members = Users.collectAsList();
for (Row row : members) {
Dataset<Row> userConversation = sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER ='"+ row.getDecimal(0) +"'");
userConversation.show();
}
Try create a set with all Users and then execute a query like
sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER IN usersSet);
This way you execute only one query so you pay the overhead needed to DB connection only once.
Another approach if you run Spark over HDFS and this is a one time query is to use a tool like Sqoop to load the SQL table in Hadoop and use the data natively in Spark.
Related
I am new to Java and Apache spark and trying to figure out how to get values of a single column from a Dataset in spark as a flat list.
Dataset<Row> sampleData = sparkSession.read()
.....
.option("query", "SELECT COLUMN1, column2 from table1")
.load();
List<Row> columnsList = sampleData.select("COLUMN1")
.where(sampleData.col("COLUMN1").isNotNull()).collectAsList();
String result = StringUtils.join(columnsList, ", ");
// Result I am getting is
[15230321], [15306791], [15325784], [15323326], [15288338], [15322001], [15307950], [15298286], [15327223]
// What i want is":
15230321, 15306791......
How do I achieve this in spark using java?
Spark row can be converted to String by Encoders:
List<String> result = sampleData.select("COLUMN1").as(Encoders.STRING()).collectAsList();
I am pasting the answer in Scala. You can convert it into Java as there are online tools available.
Also I am not creating String result as the way you specified because it would require creating table and doing the query per your process but I am replicating the problem variable directly using
import org.apache.spark.sql.Row
val a = List(Row("123"),Row("222"),Row("333"))
Printing a is giving me
List([123], [222], [333])
So apply a simple map operation along with mkString method to flatten the List
a.map(x => x.mkString(","))
gives
List(123, 222, 333) which I assume is your expectation.
Let me know if this sorts out your issue.
I'm using spark SQL (spark 2.1) to read in a hive table.
The schema of the hive table is the following (simplified to the only field that is intesrting related to my problem, the other are useless) :
Body type:Bynary
The body is a JSON with multiple field and the one I'm interested in is an array. In each index of this array I have another JSON that contains a date.
My goal is to obtain a dataset filled with all the object of my array that have a date superior to "insert the wanted date".
To do so I use the following code :
SparkConfig conf = //set the kryo serializer and tungsten at true
SparkSession ss = //set the conf on the spark session
Dataset<String> dataset = creatMyDatasetWithTheGoodType(SS.SQL("select * from mytable "));
Dataset<String> finalds = dataset.flatmap(json->
List<String> l = new ArrayList<>();
List<String> ldate =//i use Jackson to obtain the array of date, this action return a list
For(int i = O; i < ldate.size ; i++) {
//if date is ok i add it to l
}
Return l.iterator()
});
(My code is working on a small dataset I gave it to give an idea of what I was doing)
The problem is that this hive table has like 22 millions lines.
The job turned for 14 hours and didn't finish (I killed it but no error or GC overhead)
I'm running it with yarn-client with 4 executors having 16 go of memory each. The driver has 4 go of memory. 1 core for the executor each.
I used a hdfs dfs -du hiveTableLocationPath and I had like 45 Go as a result.
What can I do to tune my job ?
I recommend to try this UDTF that allows working on json columns within hive
It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.
I have a comma separated file, which I want to load into memory and query it as if it was a database, I've come across many concepts/names but am not sure which is correct like ... embedded DB, in-memory database (Apache ignite, etc ...), how can I achieve that ?
I recommend to work with Apache Spark, you can load your file and then query it using spark-sql as follow:
val df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
// Select only the "user_id" column
df.select("user_id").show()
see link for more information.
If you are using Apache Spark 1.6 version, your code would be
HiveContext hqlContext = new HiveContext(sparkContext);
DataFrame df = hqlContext.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load(csvpath);
df.registerTempTable("Table name");
And then you can query from the table
I am trying to read some data from Phoenix to Spark using its
String connectionString="jdbc:phoenix:auper01-01-20-01-0.prod.vroc.com.au,auper01-02-10-01-0.prod.vroc.com.au,auper01-02-10-02-0.prod.vroc.com.au:2181:/hbase-unsecure";
Map<String, String> options2 = new HashMap<String, String>();
options2.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");
//options2.put("dbtable", url);
options2.put("table", "VROC_SENSORDATA_3");
options2.put("zkUrl", connectionString);
DataFrame phoenixFrame2 = this.hc.read().format("org.apache.phoenix.spark")
.options(options2)
.load();
System.out.println("The phoenix table is:");
phoenixFrame2.printSchema();
phoenixFrame2.show(20, false);
But I need to do a select with where clause, I also used the dbtable which is used for a JDBC connection in Spark but I guess it doesn't have any effect!
Based on the documentation
"In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering."
But seems there is no way to parallelize the reading from Phoenix, it would be really inefficient to read the whole table in a Spark Dataframe and then doing filtering, but it seems I can find a way to apply a where clause. Does anyone know how to apply where clause in my above codes?
I want to display the data of a postgresql database using the CRUD in the play framework; I looked for any examples to get idea which I didn't find after a long time of searching in google. Someone help me with this if you can or post a valid link regarding this. Thanks in advance!
I use, Play 1.2.5, java and postgresql.
I assume you want to do this in your application code in runtime.
You can execute query using DB plugin to search DB metadata using native PostgreSQL queries.
Here's an example how to get column names of my system DOCUMENT table:
List<String> columns = new ArrayList<>();
ResultSet result = DB.executeQuery("select column_name from INFORMATION_SCHEMA.COLUMNS where table_name = 'DOCUMENT';");
while(result.next()) {
String column = result.getString(1);
columns.add(column);
}
This note that code is somewhat simplified and you should use prepared statements if anything in this query will be inserted from data that user or any other system entered.
Use DB.getConnection().prepareStatement() to get PreparedStatement instance.