I have a csv file that holds country names and years they won on the eurovision:
country, year
Israel, 1998
Sweden, 2012
Sweden, 2015
United Kingdom, 1997
and my csv (using tototoshi):
object CountryEurovision {
def countrEurovisionYearFile: File = new File("conf/countryEurovision.csv")
lazy val countrEurovisionYearMap: Map[String, String] = getConvertData
private def getConvertData: Map[String, String] = {
implicit object CodesFormat extends CSVFormat {
val delimiter: Char = ','
val quoteChar: Char = '"'
val escapeChar: Char = '"'
val lineTerminator: String = "\r\n"
val quoting: Quoting = QUOTE_NONNUMERIC
val treatEmptyLineAsNil: Boolean = false
}
val csvDataReader = CSVReader.open(countrEurovisionYearFile, "UTF-8")(CodesFormat)
val linesIterator = csvDataReader.iteratorWithHeaders
val convertedData = linesIterator.map {
row => row("Country") -> row("Year")
}.toMap
csvDataReader.close()
convertedData
}
}
now, since the country and year is not unique, cause a country can have several years when they won, so when I get Sweden:
CountryEurovision.countrEurovisionYearMap.get("Sweden")
I only get option res0: Option[String] = Some(2015)
which I would expect to be the list of years per country... even if it's a country of just one year I will get a list, and in case of Sweden I will get list of 2012 and 2015...
How can I change my setup for that behavior?
When you transform linesIterator.map { row => row("Country") -> row("Year") } into a Map with .toMap, for duplicated keys only the last one will be kept as it will override the previous one.
You can change this by having a unique element per key (country) by grouping values (dates) per key (before applying toMap) and modifying the value of your Map to be a List:
linesIterator
.map { row => row("Country") -> row("Year") } // List(("Sweden", 1997), ("France", 2008), ("Sweden", 2017))
.groupBy(_._1) // Map(France -> List((France,2008)), Sweden -> List((Sweden,1997), (Sweden,2017)))
.mapValues(_.map(_._2)) // Map(France -> List(2008), Sweden -> List(1997, 2017))
.toMap
which produces:
Map(France -> List(2008), Sweden -> List(1997, 2017))
This way, .get("Sweden") will return Some(List(1997, 2017)).
Related
Loading in the data:
SparkConf sc= new SparkConf().setAppName("TEST").setMaster("local[*]");
JavaSparkContext JSC = new JavaSparkContext(sc);
JavaRDD<String> stringRDDVotes = JSC.textFile("HarryPotter.csv");
I currently have this table loaded into an RDD:
ID
A
B
Name
1
23
50
Harry;Potter
I want to convert it to the table below:
ID
A
B
Name
1
23
50
Harry
1
23
50
Potter
All the solutions I found use SparkSQL which I can't use, so how would I get this result using only things like flatMap and mapToPair.
Something like this maybe?
flatMap(s -> Arrays.asList(s.split(";")).iterator())
The code above produces this:
ID
A
B
Name
1
23
50
Harry
Potter
I know that in scala it can be done like this, but I don't know how to it with java:
val input: RDD[String] = sc.parallelize(Seq("1,23,50,Harry;Potter"))
val csv: RDD[Array[String]] = input.map(_.split(','))
val result = csv.flatMap { case Array(s1, s2, s3, s4) => s4.split(";").map(part => (s1, s2, s3, part)) }
The first part is quite simple to convert from Scala to Java, you only need to use map to split each line by comma to get a JavaRDD<String[]>. Then using flatMap, for each row, split the last part of the array corresponding to Name, and using java streams, you can transform each element of the names list into a new list.
Here is a complete example:
JavaRDD<String> input = JSC.parallelize(
Arrays.asList("1,23,50,Harry;Potter", "2,24,60,Hermione;Granger")
);
JavaRDD<String[]> result = input.map(line -> line.split(","))
.flatMap(r -> {
List<String> names = Arrays.asList(r[3].split(";"));
String[][] values = names.stream()
.map(name -> new String[]{r[0], r[1], r[2], name})
.toArray(String[][]::new);
return Arrays.asList(values).iterator();
});
// print the result RDD
for (String[] line : result.collect()) {
System.out.println(Arrays.toString(line));
}
// [1, 23, 50, Harry]
// [1, 23, 50, Potter]
// [2, 24, 60, Hermione]
// [2, 24, 60, Granger]
I'm doing a real time pipeline where I connect Spark Streaming with HBase. For the sake of this process, I have to execute a filter in a HBase table, secifically a prefix filter, since I want to match the records where the key starts with a certain string.
The table I'm filtering is called "hm_notificaciones". I can connect successfully to Hbase shell and scan the table from the command line. Running the following command:
scan "hm_notificaciones"
I get the following records:
ROW COLUMN+CELL
46948854-20180307 column=info_oferta:id_oferta, timestamp=1520459448795, value=123456
46948854-20180312170423 column=info_oferta:id_establecimiento, timestamp=1520892403770, value=9999
46948854-20180312170423 column=info_oferta:id_oferta, timestamp=1520892390858, value=123445
46948854-20180312170536 column=info_oferta:id_establecimiento, timestamp=1520892422044, value=9239
46948854-20180312170536 column=info_oferta:id_oferta, timestamp=1520892435173, value=4432
46948854-20180313110824 column=info_oferta:id_establecimiento, timestamp=1520957374921, value=9990
46948854-20180313110824 column=info_oferta:id_oferta, timestamp=1520957362458, value=12313
I've been tying to run a prefix filter using the Hbase API. I'm writing some Scala code to connect to the API and make the filter. The following code compiles and executes, however it returns an empty result:
def scanTable( table_name:String, family: String, search_key: String )= {
val conf: Configuration = HBaseConfiguration.create()
val connection: Connection = ConnectionFactory.createConnection(conf)
// This is a test to verify if I can connect to HBase API.
//This statements work and print all the table names in HBase
val admin = connection.getAdmin
println("Listing all tablenames")
val list_table_names = admin.listTableNames()
list_table_names.foreach(println)
val table: Table = connection.getTable( TableName.valueOf(table_name) )
//val htable = new HTable(conf, tableName)
var colValueMap: Map[String, String] = Map()
var keyColValueMap: Map[String, Map[String, String]] = Map()
val prefix = Bytes.toBytes(search_key)
val scan = new Scan(prefix)
scan.addFamily(Bytes.toBytes(family))
val prefix_filter = new PrefixFilter(prefix)
scan.setFilter(prefix_filter)
val scanner = table.getScanner(scan)
for( row <- scanner){
val content = row.getNoVersionMap
for( entry <- content.entrySet ){
for( sub_entry <- entry.getValue.entrySet){
colValueMap += (Bytes.toString( sub_entry.getKey) -> Bytes.toString(sub_entry.getValue) )
}
keyColValueMap += (Bytes.toString(row.getRow) -> colValueMap )
}
}
//this doesn't execute
for( ( k, v) <- colValueMap) {
printf( "key: %s", "value: %s\n", k, v )
}
//this never executes since scanner is null (or empty)
for (result <- scanner) {
for (cell <- result.rawCells) {
println("Cell: " + cell + ", Value: " + Bytes.toString(cell.getValueArray, cell.getValueOffset, cell.getValueLength))
}
}
scanner.close
table.close
connection.close
}
I've tried two approaches to print/get the data: composing a Map and iterating over the ResultScanner. However, it seems that my filter is not working since it's returning a null/empty set.
Do you know if there is an alternative way to execute a prefix filter on Hbase?
The code I'm using to test the above code is the following:
user_key = "46948854-20181303144609"
scanTable("hm_notificaciones", "info_oferta", user_key)
The second loop, will not enter, because you have already iterated the scanner on previous step.
for (result <- scanner) {
for (cell <- result.rawCells) {
println("Cell: " + cell + ", Value: " + Bytes.toString(cell.getValueArray, cell.getValueOffset, cell.getValueLength))
}
}
And use keyColValueMap to print. It worked for me, check you prefix filter again.
for( ( k, v) <- colValueMap) {
printf( "key: %s", "value: %s\n", k, v )
}
how to insert a if loop in setting a key in a spark map reduce.?
I want that if input word is a starting with uppercase then set it as a key otherwise not
(word count example
sample input - affa Agshs djd Dhh
sample output -
Agshs 1
Dhh 1)
you have to use filter()
sample_input.txt
affa Agshs djd Dhh
small Capital
Firstbig notFirstBig
spark-shell
val data = sc.textFile("sample_input.txt")
val filteredData = data.flatMap(line => line.split(" ")).filter( w => { w.length>0 && Character.isUpperCase(w.charAt(0)) } )
val mapout = filteredData.map(w => (w,1))
mapout.foreach(println)
output:
scala> mapout.foreach(println)
(Agshs,1)
(Firstbig,1)
(Dhh,1)
(Capital,1)
I'm writing a small data access library to help me use Cassandra prepared statements in a Scala program (its not open source but maybe one day). What I'd like to do is automatically generate a Java Array for the bind statement from the case class
com.datastax.driver.core
PreparedStatement...
public BoundStatement bind(Object... values);
So currently I have
case class Entity(foo:String, optionalBar:Option[String])
object Entity {
def toJArray(e:Entity) = { Array(e.foo, e.optionalBar.getOrElse(null)) }
}
val e1 = Entity("fred", Option("bill"))
val e2 = Entity("fred", None)
Entity.toJArray(e1)
res5: Array[String] = Array(fred, bill)
Entity.toJArray(e2)
res6: Array[String] = Array(fred, null)
The toJArray returns an Array I can use in the bind statement. The boiler plate code gets worse if there is a date or double or a java enum
new java.util.Date(createdOn)
scala.Double.box(price)
priceType.name
Is there a way of automatically generating the Array in Scala assuming the bind parameters have the same order as the case class fields?
EDIT Thanks to #srgfed01
Here's what I came up with (not complete) but allows me to do something like
val customer1 = Customer( "email", "name", None, Option(new Date), OrdStatus.New)
session.execute(populate(customer1, insert))
val customer2 = Customer( "email2", "name2", Option(22), Option(new Date), OrdStatus.Rejected)
session.execute(populate(customer2, insert))
using this function
def populate(state:Product, statement:PreparedStatement): BoundStatement = {
def set(bnd:BoundStatement, i:Int, aval:Any): Unit = {
aval match {
case v:Date => bnd.setDate(i, v)
case v:Int => bnd.setInt(i, v)
case v:Long => bnd.setLong(i, v)
case v:Double => bnd.setDouble(i, v)
case v:String => bnd.setString(i, v)
case null => bnd.setToNull(i)
case _ => bnd.setString(i, aval.toString)
}
}
val bnd = statement.bind
for(i <- 0 until state.productArity) {
state.productElement(i) match {
case op: Option[_] => set(bnd, i, op.getOrElse(null))
case v => set(bnd, i, v)
}
}
bnd
}
You can use productIterator call for your case class object:
case class Entity(foo: String, optionalBar: Option[String])
val e1 = Entity("fred", Option("bill"))
val e2 = Entity("fred", None)
def run(e: Entity): Array[Any] = e.productIterator
.map {
case op: Option[_] => op.getOrElse(null)
case v => v
}
.toArray
println(run(e1).mkString(" ")) // fred bill
println(run(e2).mkString(" ")) // fred null
in below two sql query sql1 not selecting any row, and sql2 selecting only 1 for 111#k2.com
var ids="'111#k2.com','222#k2.com','333#k2.com','444#k2.com','555#k2.com','666#k2.com'"
val sql1 = SQL("SELECT id,point,privacy FROM `pointTable` WHERE state=1 and id in ({users})").on("users" -> ids)
sql1().map { row =>
val point = if (row[Boolean]("privacy")) { row[Double]("point").toString } else { "0" }
println(write(Map("id" -> row[String]("id"), "point" -> point)))
}
val sql2 = SQL("SELECT id,point,privacy FROM `pointTable` WHERE state=1 and id in (" + ids + ")")
sql2().map { row =>
val point = if (row[Boolean]("privacy")) { row[Double]("point").toString } else { "0" }
println(write(Map("id" -> row[String]("id"), "point" -> point)))
}
in phpmyadmin when i run this query manualy it returns 6 rows then why not working perfectly here.
i am using play framework 2.2 with scala 2.1
That's not going to work. Passing users though on is going to escape the entire string, so it's going to appear as one value instead of a list. Anorm in Play 2.3 actually allows you to pass lists as parameters, but here you'll have to work around that.
val ids: List[String] = List("111#k2.com", "222#k2.com", "333#k2.com")
val indexedIds: List[(String, Int)] = ids.zipWithIndex
// Create a bunch of parameter tokens for the IN clause.. {id_0}, {id_1}, ..
val tokens: String = indexedIds.map{ case (id, index) => s"{id_${index}}" }.mkString(", ")
// Create the parameter bindings for the tokens
val parameters = indexedIds.map{ case (id, index) => (s"id_${index}" -> toParameterValue(id)) }
val sql1 = SQL(s"SELECT id,point,privacy FROM `pointTable` WHERE state=1 and id in (${tokens})")
.on(parameters: _ *)