I want to select different row values for each row from different columns using some complex rule.
For example I have this data set:
+----------+---+---+---+
| Column A | 1 | 2 | 3 |
+ -------- +---+---+---+
| User 1 | A | H | O |
| User 2 | B | L | J |
| User 3 | A | O | N |
| User 4 | F | S | E |
| User 5 | S | G | V |
+----------+---+---+---+
I want to get something like this:
+----------+---+---+---+---+
| Column A | 1 | 2 | 3 | F |
+ -------- +---+---+---+---+
| User 1 | A | H | O | O |
| User 2 | B | L | J | J |
| User 3 | A | O | N | A |
| User 4 | F | S | E | E |
| User 5 | S | G | V | S |
+----------+---+---+---+---+
The selected values for column F are selected using a complex rule wherein the when function is not applicable. If there are 1000 columns to select from, can I make a UDF do this?
I already tried making a UDF to store the string of the column name to select the value from so it can be used to access that column name's row value. For example, I tried storing the row value 233 (the result of the complex rule) of row 100, then try to use it as a column name (column 233) to access its row value for row 100. However, I never got it to work.
I've got few columns in my db. I want to choose one and then return all of the records where values are duplicated. So I want to like, get one column and check which values from my column appeared from the rest of the db. Then return this records. Let's say that database looks like this:
id;col1;col2;col3;col4
'1','ab','cd','ef','1'
'2','ad','bg','ee','5'
'3','xx','bg','cc','6'
'4','vv','zz','ff','4'
'5','zz','ee','gg','4'
'6','zz','vv','zz','2'
'7','vv','aa','bb','8'
'8','ww','nn','zz','4'
'9','zz','yy','ff','9'
'10','qq','oo','ii','3'
and I want my result for col1 to look like so
4,'vv','zz','ff',4
5,'zz','ee','gg',4
6,'zz','vv','zz',2
7,'vv','aa','bb',8
9,'zz','yy','ff',9
Here we present the duplicates in 2 different ways. The first is the format you have requested, with additional information. The second is more concise.
create table t1(
id varchar(10),
col1 varchar(10),
col2 varchar(10),
col3 varchar(10),
col4 varchar(10));
insert into t1 values
('1','ab','cd','ef','1'),
('2','ad','bg','ee','5'),
('3','xx','bg','cc','6'),
('4','vv','zz','ff','4'),
('5','zz','ee','gg','4'),
('6','zz','vv','zz','2'),
('7','vv','aa','bb','8'),
('8','ww','nn','zz','4'),
('9','zz','yy','ff','9'),
('10','qq','oo','ii','3');
select * from t1;
id | col1 | col2 | col3 | col4
:- | :--- | :--- | :--- | :---
1 | ab | cd | ef | 1
2 | ad | bg | ee | 5
3 | xx | bg | cc | 6
4 | vv | zz | ff | 4
5 | zz | ee | gg | 4
6 | zz | vv | zz | 2
7 | vv | aa | bb | 8
8 | ww | nn | zz | 4
9 | zz | yy | ff | 9
10 | qq | oo | ii | 3
with cte as(
select
id,
col1,
col2,
col3,
col4,
row_number() over
( partition by col1
order by id desc) r1,
row_number() over
( partition by col2
order by id desc) r2,
row_number() over
( partition by col3
order by id desc) r3,
row_number() over
( partition by col4
order by id desc) r4
from t1
)
select *
from cte
where
r1 > 1
or r2 > 1
or r3 > 1
or r4 > 1;
id | col1 | col2 | col3 | col4 | r1 | r2 | r3 | r4
:- | :--- | :--- | :--- | :--- | -: | -: | -: | -:
6 | zz | vv | zz | 2 | 2 | 1 | 2 | 1
5 | zz | ee | gg | 4 | 3 | 1 | 1 | 2
4 | vv | zz | ff | 4 | 2 | 1 | 2 | 3
2 | ad | bg | ee | 5 | 1 | 2 | 1 | 1
select 'col1' as "column",
col1 "value",
count(id) "count"
from t1 group by col1
having count(id)>1
union all
select 'col2',col2, count(id)
from t1 group by col2
having count(id)>1
union all
select 'col3',col3, count(id)
from t1 group by col3
having count(id)>1
union all
select 'col4',col4, count(id)
from t1 group by col4
having count(id)>1
order by "column","value";
column | value | count
:----- | :---- | ----:
col1 | vv | 2
col1 | zz | 3
col2 | bg | 2
col3 | ff | 2
col3 | zz | 2
col4 | 4 | 3
db<>fiddle here
I have two tables
table-1
|stdid | stdname |
|-------|---------|
|1 | raghav |
|2 | sowmya |
|3 | kiran |
table-2
| skillid | stdname | skill |
|---------|---------|--------|
| 1 | raghav | java |
| 2 | raghav | c |
| 3 | raghav | c++ |
| 4 | sowmya | python|
| 5 | sowmya | c++ |
| 6 | kiran | c |
I want output like
raghav c,c++,python.
Soumya python,c++.
kiran c.
How can join those two tables like this and store them in Arraylist
Does Arraylist accept array variables? if yes how can I approach it?
Join the tables and then aggregate by name:
SELECT t1.stdname, GROUP_CONCAT(COALESCE(t2.skill, 'NA')) AS skills
FROM Table1 t1
LEFT JOIN Table2 t2
ON t2.stdname = t1.stdname
GROUP BY
t1.stdname;
I'm new in Spark Java API. I want to filter my Dataset where a column is not a number. My dataset ds1 is something like this.
+---------+------------+
| account| amount |
+---------+------------+
| aaaaaa | |
| aaaaaa | |
| bbbbbb | |
| 123333 | |
| 555555 | |
| 666666 | |
I want return a datset ds2 like this:
+---------+------------+
| account| amount |
+---------+------------+
| 123333 | |
| 555555 | |
| 666666 | |
I tried this but id doesn't work for me.
ds2=ds1.select("account"). where(dsFec.col("account").isNaN());
Can someone please guides me with a sample spark expression to resolve this.
You can define a udf function to check whether the string in account column is numeric or not as
UDF1 checkNumeric = new UDF1<String, Boolean>() {
public Boolean call(final String account) throws Exception {
return StringUtils.isNumeric(account);
}
};
sqlContext.udf().register("numeric", checkNumeric, DataTypes.BooleanType);
and then use callUDF function to call the udf function as
df.filter(callUDF("numeric", col("account"))).show();
which should give you
+-------+------+
|account|amount|
+-------+------+
| 123333| |
| 555555| |
| 666666| |
+-------+------+
Just cast and check if result is null:
ds1.select("account").where(dsFec.col("account").cast("bigint").isNotNull());
One way to do this:
Scala Equivalent:
import scala.util.Try
df.filter(r => Try(r.getString(0).toInt).isSuccess).show()
+-------+------+
|account|amount|
+-------+------+
| 123333| |
| 555555| |
| 666666| |
+-------+------+
Or You can use the same using Java's try catch:
df.map(r => (r.getString(0),r.getString(1),{try{r.getString(0).toInt; true
}catch {
case runtime: RuntimeException => {
false}
}
})).filter(_._3 == true).drop("_3").show()
+------+---+
| _1| _2|
+------+---+
|123333| |
|555555| |
|666666| |
+------+---+
I have huge .csv file which has several columns but the columns of importance to me are USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.).
So what I am trying to do is : replace all null values in DURATION column with average of duration of all the calls of same type by the same user(i.e. of same USER_ID).
I have found the average as following :
In the query below I am finding out the average of duration of all the calls of same type by the same user.
Dataset<Row> filteredData = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull()).and(col(DATE).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .groupBy(col(USER_ID), col(TYPE), col(NORMALIZE_NUMBER))
/*3*/ .agg(sum(DURATION).alias(DURATION_IN_MIN).divide(count(col(USER_ID))));
filteredData.show() gives :
|USER_ID |type |normalized_number|(sum(duration) AS `durationInMin` / count(USER_ID))|
+--------------------------------+--------+-----------------+---------------------------------------------------+
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+435657456354 |0.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+876454354353 |48.6 |
|8a8a8a8a592b4ace01595e099764000c|INCOMING|+132445686765 |15.0 |
|8a8a8a8a592b4ace01592b4ff4b90000|INCOMING|+097645634324 |74.16666666666667 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+134435657656 |15.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+135879878543 |31.0 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+768435245243 |11.0 |
|8a8a8a8a592b4ace01592cd8fd160003|INCOMING|+787685534523 |0.0 |
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+098976865745 |61.5 |
|8a8a8a8a592b4ace01592b4ff4b90000|OUTGOING|+123456787644 |43.333333333333336 |
In the query below I am filtering the data and replacing all the null occurences with 0 in step 2.
DataSet<Row> filteredData2 = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull())
.and(col(DATE).gt(0)).and(col(DURATION).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .withColumn(DURATION, when(col(DURATION).isNull(), 0).otherwise(col(DURATION).cast(LONG)))
/*3*/ .withColumn(DATE, col(DATE).cast(LONG).minus(col(DATE).cast(LONG).mod(ROUND_ONE_MIN)).cast(LONG))
/*4*/ .groupBy(col(USER_ID), col(DURATION), col(TYPE), col(DATE), col(NORMALIZE_NUMBER))
/*5*/ .agg(sum(DURATION).alias(DURATION_IN_MIN))
/*6*/ .withColumn(DAY_TIME, lit(""))
/*7*/ .withColumn(WEEK_DAY, lit(""))
/*8*/ .withColumn(HOUR_OF_DAY, lit(0));
filteredData2.show() gives :
|USER_ID |duration|type |date |normalized_number|durationInMin|DAY_TIME|WEEK_DAY|HourOfDay|
+--------------------------------+--------+--------+-------------+-----------------+-------------+--------+--------+---------+
|8a8a8a8a592b4ace01595e70dcbd0016|25 |INCOMING|1479017220000|+465435534353 |25 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|29 |INCOMING|1482562560000|+545765765775 |29 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|75 |OUTGOING|1483363980000|+124435665755 |75 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|34 |OUTGOING|1483261920000|+098865563645 |34 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|22 |OUTGOING|1481712180000|+232434656765 |22 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|64 |OUTGOING|1482984060000|+875634521325 |64 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|179 |OUTGOING|1482825060000|+876542543554 |179 | | |0 |
|8a8a8a8a592b4ace01595e65901b0013|12 |OUTGOING|1482393360000|+098634563456 |12 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|14 |OUTGOING|1482820860000|+1344365i8787 |14 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|105 |INCOMING|1478772240000|+234326886784 |105 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|453 |OUTGOING|1480944480000|+134435676578 |453 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|42 |OUTGOING|1483193100000|+413247687686 |42 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|41 |OUTGOING|1481696820000|+134345435645 |41 | | |0 |
Please help me to combine these two or use these two get the required result. I am new to Spark and SparkSQL.
Thanks.