How to split column into multiple rows using Spark JavaRDD

How to split column into multiple rows using Spark JavaRDD - java

Loading in the data:
SparkConf sc= new SparkConf().setAppName("TEST").setMaster("local[*]");
JavaSparkContext JSC = new JavaSparkContext(sc);
JavaRDD<String> stringRDDVotes = JSC.textFile("HarryPotter.csv");
I currently have this table loaded into an RDD:
ID
A
B
Name
1
23
50
Harry;Potter
I want to convert it to the table below:
ID
A
B
Name
1
23
50
Harry
1
23
50
Potter
All the solutions I found use SparkSQL which I can't use, so how would I get this result using only things like flatMap and mapToPair.
Something like this maybe?
flatMap(s -> Arrays.asList(s.split(";")).iterator())
The code above produces this:
ID
A
B
Name
1
23
50
Harry
Potter
I know that in scala it can be done like this, but I don't know how to it with java:
val input: RDD[String] = sc.parallelize(Seq("1,23,50,Harry;Potter"))
val csv: RDD[Array[String]] = input.map(_.split(','))
val result = csv.flatMap { case Array(s1, s2, s3, s4) => s4.split(";").map(part => (s1, s2, s3, part)) }

The first part is quite simple to convert from Scala to Java, you only need to use map to split each line by comma to get a JavaRDD<String[]>. Then using flatMap, for each row, split the last part of the array corresponding to Name, and using java streams, you can transform each element of the names list into a new list.
Here is a complete example:
JavaRDD<String> input = JSC.parallelize(
Arrays.asList("1,23,50,Harry;Potter", "2,24,60,Hermione;Granger")
);
JavaRDD<String[]> result = input.map(line -> line.split(","))
.flatMap(r -> {
List<String> names = Arrays.asList(r[3].split(";"));
String[][] values = names.stream()
.map(name -> new String[]{r[0], r[1], r[2], name})
.toArray(String[][]::new);
return Arrays.asList(values).iterator();
});
// print the result RDD
for (String[] line : result.collect()) {
System.out.println(Arrays.toString(line));
}
// [1, 23, 50, Harry]
// [1, 23, 50, Potter]
// [2, 24, 60, Hermione]
// [2, 24, 60, Granger]

Related

How to read from a .txt file into an array of objects

I have the following sample data in a .txt file
111, Sybil, 21
112, Edith, 22
113, Mathew, 30
114, Mary, 25
the required output is
[{"number":"111","name":"Sybil","age":"21" },
{"number":"112","name":"Edith","age":"22"},
{"number":"113","name":"Mathew","age":"30"},
"number":"114","name":"Mary","age":"25"]
Sadly, I have not gone far because I cant seem to get the values out of each line. instead, this is what is displayed
[one, two, three]
private void loadFile() throws FileNotFoundException, IOException {
File txt = new File("Users.txt");
try (Scanner scan = new Scanner(txt)) {
ArrayList data = new ArrayList<>() ;
while (scan.hasNextLine()) {
data.add(scan.nextLine());
System.out.print(scan.nextLine());
}
System.out.print(data);
}
I would appreciate any help. thank you

Not too sure about the requirements. If you just need to know how to get the values out, then use String.split() combined with Scanner.nextLine().
Codes below:
private void loadFile() throws FileNotFoundException, IOException {
File txt = new File("Users.txt");
try (Scanner scan = new Scanner(txt)) {
ArrayList data = new ArrayList<>();
while (scan.hasNextLine()) {
// split the data by ", " and split at most (3-1) times
String[] input = scan.nextLine().split(", ", 3);
data.add(input[0]);
data.add(input[1]);
data.add(input[2]);
System.out.print(scan.nextLine());
}
System.out.print(data);
}
}
The output would be as below and you can further modify it yourself:
[111, Sybil, 21, 112, Edith, 22, 113, Mathew, 30, 114, Mary, 25]
However, if you need the required format as well, the closest I can get is by using a HaspMap and put it into the ArrayList.
Codes below:
private void loadFile() throws FileNotFoundException, IOException {
File txt = new File("Users.txt");
try (Scanner scan = new Scanner(txt)) {
ArrayList data = new ArrayList<>();
while (scan.hasNextLine()) {
// Create a hashmap to store data in correct format,
HashMap<String, String> info = new HashMap();
String[] input = scan.nextLine().split(", ", 3);
info.put("number", input[0]);
info.put("name", input[1]);
info.put("age", input[2]);
// Put it inside the ArrayList
data.add(info);
}
System.out.print(data);
}
}
And the output would be:
[{number=111, name=Sybil, age=21}, {number=112, name=Edith, age=22}, {number=113, name=Mathew, age=30}, {number=114, name=Mary, age=25}]
Hope this answer helps you well.

Currently, you're skipping lines. A quote from the Scanner::nextLine documentation:
This method returns the rest of the current line, excluding any line separator at the end. The position is set to the beginning of the next line.
So you're adding one line to your list, and writing the next one to the console.
To get the data from each line, you can use the String::split method, which supports RegEx.
Example:
"line of my file".split(" ")

We can use streams to write some compact code.
First we define a record to hold our data.
Files.lines reads your file into memory, producing a stream of strings, one per line.
We call Stream#map to produce another stream, a series of string arrays. Each array has three elements, the three fields within each line.
We call map again, this time to produce a stream of Person objects. We construct each person object by parsing and passing to the constructor each of line’s three fields.
We call Stream#toList to collect those person objects into a list.
We call List#toString to generate text representing the contents of the list of person objects.
record Person ( int id , String name , int age ) {}
String output =
Files
.lines( Paths.of("/path/to/Users.txt" ) )
.map( line -> line.split( ", " ) )
.map( parts -> new Person(
Integer.parseInt( parts[ 0 ] ) ,
parts[ 1 ] ,
Integer.parseInt( parts[ 2 ] )
) )
.toList()
.toString()
;
If the format of the default Person#toString method does not suit you, add an override of that method to produce your desired output.

How to setup csv component to map list of strings

I have a csv file that holds country names and years they won on the eurovision:
country, year
Israel, 1998
Sweden, 2012
Sweden, 2015
United Kingdom, 1997
and my csv (using tototoshi):
object CountryEurovision {
def countrEurovisionYearFile: File = new File("conf/countryEurovision.csv")
lazy val countrEurovisionYearMap: Map[String, String] = getConvertData
private def getConvertData: Map[String, String] = {
implicit object CodesFormat extends CSVFormat {
val delimiter: Char = ','
val quoteChar: Char = '"'
val escapeChar: Char = '"'
val lineTerminator: String = "\r\n"
val quoting: Quoting = QUOTE_NONNUMERIC
val treatEmptyLineAsNil: Boolean = false
}
val csvDataReader = CSVReader.open(countrEurovisionYearFile, "UTF-8")(CodesFormat)
val linesIterator = csvDataReader.iteratorWithHeaders
val convertedData = linesIterator.map {
row => row("Country") -> row("Year")
}.toMap
csvDataReader.close()
convertedData
}
}
now, since the country and year is not unique, cause a country can have several years when they won, so when I get Sweden:
CountryEurovision.countrEurovisionYearMap.get("Sweden")
I only get option res0: Option[String] = Some(2015)
which I would expect to be the list of years per country... even if it's a country of just one year I will get a list, and in case of Sweden I will get list of 2012 and 2015...
How can I change my setup for that behavior?

When you transform linesIterator.map { row => row("Country") -> row("Year") } into a Map with .toMap, for duplicated keys only the last one will be kept as it will override the previous one.
You can change this by having a unique element per key (country) by grouping values (dates) per key (before applying toMap) and modifying the value of your Map to be a List:
linesIterator
.map { row => row("Country") -> row("Year") } // List(("Sweden", 1997), ("France", 2008), ("Sweden", 2017))
.groupBy(_._1) // Map(France -> List((France,2008)), Sweden -> List((Sweden,1997), (Sweden,2017)))
.mapValues(_.map(_._2)) // Map(France -> List(2008), Sweden -> List(1997, 2017))
.toMap
which produces:
Map(France -> List(2008), Sweden -> List(1997, 2017))
This way, .get("Sweden") will return Some(List(1997, 2017)).

Spark Sorting with JavaRDD<String>

Let's say I have a file with line of strings and I import it to a JavaRDD, if I am trying to sort the strings and export as a new file, how should I do it? The code below is my attempt and it is not working
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
JavaRDD<String> lines = sparkContext.textFile(args[0]);
JavaRDD<String> sorted = lines.sortBy(i->i, true,1);
sorted.saveAsTextFile(args[1]);
What I mean by "not working" is that the output file is not sorted. I think the issue is with my "i->i" code, I am not sure how to make it sort with the compare method of strings as each "i" will be a string (also not sure how to make it compare between different "i"
EDIT
I have modified the code as per the comments, I suspect the file was being read as 1 giant string.
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
long start = System.currentTimeMillis();
List<String> array = buildArrayList(args[0]);
JavaRDD<String> lines = sparkContext.parallelize(array);
JavaRDD<String> sorted = lines.sortBy(i->i, true, 1);
sorted.saveAsTextFile(args[1]);
Still not sorting it :(

I made a little research. Your code is correct. Here are the samples which I tested:
Spark initizalization
SparkSession spark = SparkSession.builder().appName("test")
.config("spark.debug.maxToStringFields", 10000)
.config("spark.sql.tungsten.enabled", true)
.enableHiveSupport().getOrCreate();
JavaSparkContext jSpark = new JavaSparkContext(spark.sparkContext());
Example for RDD
//RDD
JavaRDD rdd = jSpark.parallelize(Arrays.asList("z", "b", "c", "a"));
JavaRDD sorted = rdd.sortBy(i -> i, true, 1);
List<String> result = sorted.collect();
result.stream().forEach(i -> System.out.println(i));
The output is
a
b
c
z
You also can use dataset API
//Dataset
Dataset<String> stringDataset = spark.createDataset(Arrays.asList("z", "b", "c", "a"), Encoders.STRING());
Dataset<String> sortedDataset = stringDataset.sort(stringDataset.col(stringDataset.columns()[0]).desc()); //by defualt is ascending order
result = sortedDataset.collectAsList();
result.stream().forEach(i -> System.out.println(i));
The output is
z
c
b
a
Your problem I think is that your text file have a specific lines separator. If it's so - you can use flatMap function to split your giant text string into line strings.
Here the example with Dataset
//flatMap example
Dataset<String> singleLineDS= spark.createDataset(Arrays.asList("z:%b:%c:%a"), Encoders.STRING());
Dataset<String> splitedDS = singleLineDS.flatMap(i->Arrays.asList(i.split(":%")).iterator(),Encoders.STRING());
Dataset<String> sortedSplitedDs = splitedDS.sort(splitedDS.col(splitedDS.columns()[0]).desc());
result = sortedSplitedDs.collectAsList();
result.stream().forEach(i -> System.out.println(i));
So you should find which separator is in your text file and adopt the code above for your task

insert a if loop in setting key value pair in map reduce spark

how to insert a if loop in setting a key in a spark map reduce.?
I want that if input word is a starting with uppercase then set it as a key otherwise not
(word count example
sample input - affa Agshs djd Dhh
sample output -
Agshs 1
Dhh 1)

you have to use filter()
sample_input.txt
affa Agshs djd Dhh
small Capital
Firstbig notFirstBig
spark-shell
val data = sc.textFile("sample_input.txt")
val filteredData = data.flatMap(line => line.split(" ")).filter( w => { w.length>0 && Character.isUpperCase(w.charAt(0)) } )
val mapout = filteredData.map(w => (w,1))
mapout.foreach(println)
output:
scala> mapout.foreach(println)
(Agshs,1)
(Firstbig,1)
(Dhh,1)
(Capital,1)

Finding closest number from two arraylists

Recently one of my data servers went down and a large number of video files are damaged (over 15,000 files, or more than 60TB). I wrote a script to check all files and put results in a very big log.txt file (almost 8GB).
I wrote code to find all lines starting with "Input #0" and lines which contain "damaged", then added their line numbers to ArrayList's. Next, I need to compare those two ArrayLists and find the closest line number in list2 to the number in list1 so I can get back file names from the log file.
For example:
if list1 contains numbers {1, 5, 45, 55, 100, 2000... etc}
and list2 contains numbers {50, 51, 53, 2010... etc} the result should be {45, 2000... etc}
This is my current code:
import java.io.*;
import java.util.*;
public class Log {
public static void main(String [] args) throws IOException{
ArrayList<Integer> list1 = new ArrayList<Integer>();
ArrayList<Integer> list2 = new ArrayList<Integer>();
File file = new File("C:\\log.txt");
try {
Scanner scanner = new Scanner(file);
Scanner scanner2 = new Scanner(file);
int lineNum = 0;
int lineNum2 = 0;
while (scanner.hasNextLine()){
String line = scanner.nextLine();
String line2 = scanner.nextLine();
lineNum++;
lineNum2++;
if((line.startsWith("Input #0"))) {
list1.add(lineNum);
}
if((line2.contains("damaged"))) {
list2.add(lineNum2);
}
}
This is what I'm getting from the code above:
list1 [5, 262, 304, 488, 523, 1189, 1796, 2503, 2722, 4052, 4201, 4230, 4298, 4312, 4559, 4887, 4903, 5067....]
list2 [1838, 1841, 1842, 1844, 1851, 1861, 1865, 1866, 1868, 1875, 1878, 1879, 1880, 1881, 1886, 1887, 1891....]
Some log data:
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. NOVHighb668ca7d201411141051110636.m2v':
.
.
.
.
.
.
Data with damage:
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. NOVHighb668ca7d201411141051110636.m2v':
.
.
.
.
.
[error 0x090010] file damaged at 16 09
[error 0x090010] file damaged at 19 15
The log for each individual file does not contain any pattern except for the first 5-6 lines or so. Both damaged and non-damaged files contain info written in 20 to 100+ lines.
So, from these numbers the first result should be number 1796.
I'm pretty much a novice in Java and I need help.

Here's a small code that will do the work, but I don't know if you want redundant values in the result, so I saved them in a list and in a set, choose the one you prefer:
public static void main(String[] args) {
int[] list1 = {5, 262, 304, 488, 523, 1189, 1796, 2503, 2722, 4052, 4201, 4230, 4298, 4312, 4559};
int[] list2 = {1838, 1841, 1842, 1844, 1851, 1861, 1865, 1866, 1868, 1875, 1878, 1879, 1880, 1881};
ArrayList<Integer> resultList = new ArrayList<Integer>();
Set<Integer> resultSet = new HashSet<Integer>();
int j = 0;
for(int i = 0; i < list2.length; i++){
for(; j < list1.length; j++){
if(list1[j] > list2[i])
break;
}
resultList.add(list1[j-1]);
resultSet.add(list1[j-1]);
}
System.out.println(resultList);
System.out.println(resultSet);
}
Output:
[1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796, 1796]
[1796]

You defined two scanners (seems unnecessary) but you are only using one of them and calling nextline() twice on it. It looks like that is not intended and as a consequence the results you are getting are erroneous. It would be very helpful if you could post a sample excerpt from your logfile (you can filter the sensitive data) so that we can determine what the best approach is for this.
I think you should scrap your current approach because it does not seem like an efficient way to solve your problem of needing to find filenames of damaged files.
Depending on how your data looks, you can use regular expressions and possibly even extract the filenames directly into a Set.
Edit: Added some rough code that should do the job for you if you are indeed correct that each file starts with "Input #0". As long as there is a pattern in the log data for each file, then you should always be able to extract the data you need directly instead of going through the mess of matching entries from two separate arraylists.
public static void main(String [] args) throws FileNotFoundException{
Set<String> damagedFiles = new LinkedHashSet<String>();
File file = new File("C:\\log.txt");
Scanner scanner = new Scanner(file);
String filename = null;
try {
int lineNum = 0;
while (scanner.hasNextLine()){
String line = scanner.nextLine();
if(line.startsWith("Input #0")){
/*if desired, can use a regex lookahead to get only the path and filename
instead of the entire Input #0 line */
filename = line;
}
if(line.contains("damaged")){
if (filename != null){
damagedFiles.add(filename);
}
}
}
} finally {
scanner.close();
for (String s : damagedFiles){
System.out.println(s);
}
}
}
This is the result I got when running this code on a sample log file where I named the damaged files dmg#.m2v
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. dmg1.m2v':
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. dmg2.m2v':
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. dmg3.m2v':
Input #0, mpegvideo, from '/cinegy/cinegy/VIDEO/BSF/BLOK 3 - 14. dmg4.m2v':

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split column into multiple rows using Spark JavaRDD - java

Related

How to read from a .txt file into an array of objects

How to setup csv component to map list of strings

Spark Sorting with JavaRDD<String>

insert a if loop in setting key value pair in map reduce spark

Finding closest number from two arraylists

Categories

Resources