replace one column values with another Spark Java - java

I have a dataframe df1 of the format
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| A | z | m |
| B | w | n |
| C | x | o |
| A | z | n |
| A | p | o |
+------+------+------+
and another dataframe df2 of the format
+------+------+
| Col1 | Col2 |
+------+------+
| 0-A | 0-z |
| 1-B | 3-w |
| 2-C | 1-x |
| | 2-P |
+------+------+-
I am trying to replace the values in Col1 and Col2 of df1 with values from df2 using Spark Java.
The end dataframe df3 should look like this.
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| 0-A | 0-z | m |
| 1-B | 3-w | n |
| 2-C | 1-x | o |
| 0-A | 0-z | n |
| 0-A | 2-p | o |
+------+------+------+
I am trying to replace all the values in the column1 and column2 of df1 with values from col1 and col2 of df2.
Is there anyway that i can achieve this in Spark Java dataframe syntax.?
The initial idea i had was to do the following.
String pattern1="\\p{L}+(?: \\p{L}+)*$";
df1=df1.join(df2, df1.col("col1").equalTo(regexp_extract(df2.col("col1"),pattern1,1)),"left-semi");

Replace your last join operation with below join.
df1.alias("x").join(df2.alias("y").select(col("y.Col1").alias("newCol1")), col("x.Col1") === regexp_extract(col("newCol1"),"\\p{L}+(?: \\p{L}+)*$",0), "left")
.withColumn("Col1", col("newCol1"))
.join(df2.alias("z").select(col("z.Col2").alias("newCol2")), col("x.Col2") === regexp_extract(col("newCol2"),"\\p{L}+(?: \\p{L}+)*$",0), "left")
.withColumn("Col2", col("newCol2"))
.drop("newCol1", "newCol2")
.show(false)
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
|2-C |1-x |o |
|0-A |0-z |m |
|0-A |0-z |n |
|0-A |2-p |o |
|1-B |3-w |n |
+----+----+----+

Related

Select row value from a different column in Spark Java using some rule

I want to select different row values for each row from different columns using some complex rule.
For example I have this data set:
+----------+---+---+---+
| Column A | 1 | 2 | 3 |
+ -------- +---+---+---+
| User 1 | A | H | O |
| User 2 | B | L | J |
| User 3 | A | O | N |
| User 4 | F | S | E |
| User 5 | S | G | V |
+----------+---+---+---+
I want to get something like this:
+----------+---+---+---+---+
| Column A | 1 | 2 | 3 | F |
+ -------- +---+---+---+---+
| User 1 | A | H | O | O |
| User 2 | B | L | J | J |
| User 3 | A | O | N | A |
| User 4 | F | S | E | E |
| User 5 | S | G | V | S |
+----------+---+---+---+---+
The selected values for column F are selected using a complex rule wherein the when function is not applicable. If there are 1000 columns to select from, can I make a UDF do this?
I already tried making a UDF to store the string of the column name to select the value from so it can be used to access that column name's row value. For example, I tried storing the row value 233 (the result of the complex rule) of row 100, then try to use it as a column name (column 233) to access its row value for row 100. However, I never got it to work.

Finding duplicates from column in the rest of database with JPA

I've got few columns in my db. I want to choose one and then return all of the records where values are duplicated. So I want to like, get one column and check which values from my column appeared from the rest of the db. Then return this records. Let's say that database looks like this:
id;col1;col2;col3;col4
'1','ab','cd','ef','1'
'2','ad','bg','ee','5'
'3','xx','bg','cc','6'
'4','vv','zz','ff','4'
'5','zz','ee','gg','4'
'6','zz','vv','zz','2'
'7','vv','aa','bb','8'
'8','ww','nn','zz','4'
'9','zz','yy','ff','9'
'10','qq','oo','ii','3'
and I want my result for col1 to look like so
4,'vv','zz','ff',4
5,'zz','ee','gg',4
6,'zz','vv','zz',2
7,'vv','aa','bb',8
9,'zz','yy','ff',9
Here we present the duplicates in 2 different ways. The first is the format you have requested, with additional information. The second is more concise.
create table t1(
id varchar(10),
col1 varchar(10),
col2 varchar(10),
col3 varchar(10),
col4 varchar(10));
insert into t1 values
('1','ab','cd','ef','1'),
('2','ad','bg','ee','5'),
('3','xx','bg','cc','6'),
('4','vv','zz','ff','4'),
('5','zz','ee','gg','4'),
('6','zz','vv','zz','2'),
('7','vv','aa','bb','8'),
('8','ww','nn','zz','4'),
('9','zz','yy','ff','9'),
('10','qq','oo','ii','3');
select * from t1;
id | col1 | col2 | col3 | col4
:- | :--- | :--- | :--- | :---
1 | ab | cd | ef | 1
2 | ad | bg | ee | 5
3 | xx | bg | cc | 6
4 | vv | zz | ff | 4
5 | zz | ee | gg | 4
6 | zz | vv | zz | 2
7 | vv | aa | bb | 8
8 | ww | nn | zz | 4
9 | zz | yy | ff | 9
10 | qq | oo | ii | 3
with cte as(
select
id,
col1,
col2,
col3,
col4,
row_number() over
( partition by col1
order by id desc) r1,
row_number() over
( partition by col2
order by id desc) r2,
row_number() over
( partition by col3
order by id desc) r3,
row_number() over
( partition by col4
order by id desc) r4
from t1
)
select *
from cte
where
r1 > 1
or r2 > 1
or r3 > 1
or r4 > 1;
id | col1 | col2 | col3 | col4 | r1 | r2 | r3 | r4
:- | :--- | :--- | :--- | :--- | -: | -: | -: | -:
6 | zz | vv | zz | 2 | 2 | 1 | 2 | 1
5 | zz | ee | gg | 4 | 3 | 1 | 1 | 2
4 | vv | zz | ff | 4 | 2 | 1 | 2 | 3
2 | ad | bg | ee | 5 | 1 | 2 | 1 | 1
select 'col1' as "column",
col1 "value",
count(id) "count"
from t1 group by col1
having count(id)>1
union all
select 'col2',col2, count(id)
from t1 group by col2
having count(id)>1
union all
select 'col3',col3, count(id)
from t1 group by col3
having count(id)>1
union all
select 'col4',col4, count(id)
from t1 group by col4
having count(id)>1
order by "column","value";
column | value | count
:----- | :---- | ----:
col1 | vv | 2
col1 | zz | 3
col2 | bg | 2
col3 | ff | 2
col3 | zz | 2
col4 | 4 | 3
db<>fiddle here

Join two tables in mysql and return multiple rows (first table contains one row and second table contains multiple rows)

I have two tables
table-1
|stdid | stdname |
|-------|---------|
|1 | raghav |
|2 | sowmya |
|3 | kiran |
table-2
| skillid | stdname | skill |
|---------|---------|--------|
| 1 | raghav | java |
| 2 | raghav | c |
| 3 | raghav | c++ |
| 4 | sowmya | python|
| 5 | sowmya | c++ |
| 6 | kiran | c |
I want output like
raghav c,c++,python.
Soumya python,c++.
kiran c.
How can join those two tables like this and store them in Arraylist
Does Arraylist accept array variables? if yes how can I approach it?
Join the tables and then aggregate by name:
SELECT t1.stdname, GROUP_CONCAT(COALESCE(t2.skill, 'NA')) AS skills
FROM Table1 t1
LEFT JOIN Table2 t2
ON t2.stdname = t1.stdname
GROUP BY
t1.stdname;

Filter Dataset using where column is not a number using Spark Java API 2.2?

I'm new in Spark Java API. I want to filter my Dataset where a column is not a number. My dataset ds1 is something like this.
+---------+------------+
| account| amount |
+---------+------------+
| aaaaaa | |
| aaaaaa | |
| bbbbbb | |
| 123333 | |
| 555555 | |
| 666666 | |
I want return a datset ds2 like this:
+---------+------------+
| account| amount |
+---------+------------+
| 123333 | |
| 555555 | |
| 666666 | |
I tried this but id doesn't work for me.
ds2=ds1.select("account"). where(dsFec.col("account").isNaN());
Can someone please guides me with a sample spark expression to resolve this.
You can define a udf function to check whether the string in account column is numeric or not as
UDF1 checkNumeric = new UDF1<String, Boolean>() {
public Boolean call(final String account) throws Exception {
return StringUtils.isNumeric(account);
}
};
sqlContext.udf().register("numeric", checkNumeric, DataTypes.BooleanType);
and then use callUDF function to call the udf function as
df.filter(callUDF("numeric", col("account"))).show();
which should give you
+-------+------+
|account|amount|
+-------+------+
| 123333| |
| 555555| |
| 666666| |
+-------+------+
Just cast and check if result is null:
ds1.select("account").where(dsFec.col("account").cast("bigint").isNotNull());
One way to do this:
Scala Equivalent:
import scala.util.Try
df.filter(r => Try(r.getString(0).toInt).isSuccess).show()
+-------+------+
|account|amount|
+-------+------+
| 123333| |
| 555555| |
| 666666| |
+-------+------+
Or You can use the same using Java's try catch:
df.map(r => (r.getString(0),r.getString(1),{try{r.getString(0).toInt; true
}catch {
case runtime: RuntimeException => {
false}
}
})).filter(_._3 == true).drop("_3").show()
+------+---+
| _1| _2|
+------+---+
|123333| |
|555555| |
|666666| |
+------+---+

spark-sql - using nested query to filter data

I have huge .csv file which has several columns but the columns of importance to me are USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.).
So what I am trying to do is : replace all null values in DURATION column with average of duration of all the calls of same type by the same user(i.e. of same USER_ID).
I have found the average as following :
In the query below I am finding out the average of duration of all the calls of same type by the same user.
Dataset<Row> filteredData = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull()).and(col(DATE).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .groupBy(col(USER_ID), col(TYPE), col(NORMALIZE_NUMBER))
/*3*/ .agg(sum(DURATION).alias(DURATION_IN_MIN).divide(count(col(USER_ID))));
filteredData.show() gives :
|USER_ID |type |normalized_number|(sum(duration) AS `durationInMin` / count(USER_ID))|
+--------------------------------+--------+-----------------+---------------------------------------------------+
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+435657456354 |0.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+876454354353 |48.6 |
|8a8a8a8a592b4ace01595e099764000c|INCOMING|+132445686765 |15.0 |
|8a8a8a8a592b4ace01592b4ff4b90000|INCOMING|+097645634324 |74.16666666666667 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+134435657656 |15.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+135879878543 |31.0 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+768435245243 |11.0 |
|8a8a8a8a592b4ace01592cd8fd160003|INCOMING|+787685534523 |0.0 |
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+098976865745 |61.5 |
|8a8a8a8a592b4ace01592b4ff4b90000|OUTGOING|+123456787644 |43.333333333333336 |
In the query below I am filtering the data and replacing all the null occurences with 0 in step 2.
DataSet<Row> filteredData2 = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull())
.and(col(DATE).gt(0)).and(col(DURATION).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .withColumn(DURATION, when(col(DURATION).isNull(), 0).otherwise(col(DURATION).cast(LONG)))
/*3*/ .withColumn(DATE, col(DATE).cast(LONG).minus(col(DATE).cast(LONG).mod(ROUND_ONE_MIN)).cast(LONG))
/*4*/ .groupBy(col(USER_ID), col(DURATION), col(TYPE), col(DATE), col(NORMALIZE_NUMBER))
/*5*/ .agg(sum(DURATION).alias(DURATION_IN_MIN))
/*6*/ .withColumn(DAY_TIME, lit(""))
/*7*/ .withColumn(WEEK_DAY, lit(""))
/*8*/ .withColumn(HOUR_OF_DAY, lit(0));
filteredData2.show() gives :
|USER_ID |duration|type |date |normalized_number|durationInMin|DAY_TIME|WEEK_DAY|HourOfDay|
+--------------------------------+--------+--------+-------------+-----------------+-------------+--------+--------+---------+
|8a8a8a8a592b4ace01595e70dcbd0016|25 |INCOMING|1479017220000|+465435534353 |25 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|29 |INCOMING|1482562560000|+545765765775 |29 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|75 |OUTGOING|1483363980000|+124435665755 |75 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|34 |OUTGOING|1483261920000|+098865563645 |34 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|22 |OUTGOING|1481712180000|+232434656765 |22 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|64 |OUTGOING|1482984060000|+875634521325 |64 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|179 |OUTGOING|1482825060000|+876542543554 |179 | | |0 |
|8a8a8a8a592b4ace01595e65901b0013|12 |OUTGOING|1482393360000|+098634563456 |12 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|14 |OUTGOING|1482820860000|+1344365i8787 |14 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|105 |INCOMING|1478772240000|+234326886784 |105 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|453 |OUTGOING|1480944480000|+134435676578 |453 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|42 |OUTGOING|1483193100000|+413247687686 |42 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|41 |OUTGOING|1481696820000|+134345435645 |41 | | |0 |
Please help me to combine these two or use these two get the required result. I am new to Spark and SparkSQL.
Thanks.

Categories