How to Convert .csv file to RDD<Vector>? - java

I have a CSV file containing following data with 9000+ records
id,Category1,Category2
How do I convert this csv file to RDD<Vector> so that I can use it to find similar column using columnSimilarities of Apache Spark in java.
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#RowMatrix-org.apache.spark.rdd.RDD-

You can try this:
sparkSession.read.csv(myCsvFilePath) // you should have a DataFrame here
.map((r: Row) => Vector.dense(r.getInt(0), r.getInt(1), r.getInt(2))) // you should have a Dataset of Vector
.rdd // you have your RDD[Vector]
Feel free to reach out if it doesn't work.

as I read, Vector can hold the ID and and double[] for the values.
you need to fill the Vector.
List<String> lines = Files.readAllLines(Paths.get("myfile.csv"), Charset.defaultCharset());
then you can iterate over lines, create a Vector for each line, fill it with the values (you need to parse them) and add them to the RDD

Related

Spark : How to merge the transformations

I have 1000 json files, i need to do some transformations on each of the file, and then create a merged output file, which can have overlapping operations on values, (for example, say, it should not have repeated values)
So, if i read the files as wholeTextFiles, as a title,content pair, and then in the map function, i parse the content as json tree and perform the transformation, where and how do i merge the output?
Do i need to have another transform on the resultant RDD's to merge the values, and how would this work? Can i have a shared object(a List or a Map or RDD(?)) amongst all map blocks, which will be updated as part of the transformation, so that i can check for repeated values there?
P.S: Even if the output creates part files, i would still like to have no repititions.
Code:
//read the files as JavaPairRDD , which gives <filename, content> pairs
String filename = "/sample_jsons";
JavaPairRDD<String,String> distFile = sc.wholeTextFiles(filename);
//then create a JavaRDD from the content.
JavaRDD<String> jsonContent = distFile.map(x -> x._2);
//apply transformations, the map function will return an ArrayList which would
//have property names.
JavaRDD<ArrayList<String>> apm = jsonContent.map(
new Function< String, ArrayList<String> >() {
#Override
public ArrayList<String> call(String arg0) throws Exception {
JsonNode rootNode = mapper.readTree(arg0);
return parseJsonAndFindKey(rootNode, "type", "rootParent");
}
});
So, this way i am able to get all first level properties in an ArrayList, from each json file.
Now i need a final ArrayList, as a union of all these arraylists, removing duplicates. How can i achieve that ?
Why do you need 1000 RDDs for 1000 json files?
Do you see any issue with merging the 1000 json files in the input stage into one RDD?
If you'll be using one RDD from the input stage, it shouldn't be hard to perform all the needed actions on this RDD.

Univocity parser - Handling lines with weird constructs

I am trying to figure the best way to use University parser to handle CSV log file with lines looks like below,
"23.62.3.74",80,"testUserName",147653,"Log Collection Device 100","31/02/15 00:05:10 GMT",-1,"10.37.255.3","TCP","destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3"
As you can see this is a comma delimited file but the last column has got bunch of values prefixed with its field names. My requirement is to capture values from normal fields and
selectively from this last big field.
I know the master details row processor in Univocity but I doubt if this fit into that category. Could you guide me to the right direction please?
Note: I can handle the name prefixed fields in rowProcessed(String[] row, ParsingContext context) if I implement a row processor but I am looking for something native to Univocity if possible?
Thanks,
R
There's nothing native in the parser for that. Probably the easiest way to go about it is to have your RowProcessor as you mentioned.
One thing you can try to do to make your life easier is to use another instance of CsvParser to parse that last record:
//initialize a parser for the pipe separated bit
CsvParserSettings detailSettings = new CsvParserSettings();
detailSettings.getFormat().setDelimiter('=');
detailSettings.getFormat().setLineSeparator("|");
CsvParser detailParser = new CsvParser(detailSettings);
//here is the content of the last column (assuming you got it from the parser)
String details = "destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3";
//The result will be a list of pairs
List<String[]> pairs = detailParser.parseAll(new StringReader(details));
//You can add the pairs to a map
Map<String, String> map = new HashMap<String, String>();
for (String[] pair : pairs) {
map.put(pair[0], pair[1]);
}
//this should print: {destination_ip=192.62.3.74, product_id=0071, proxy_machine_ip=10.1.255.3, option1_type=(s-dns)}
System.out.println(map);
That won't be extremely fast but at least it's easy to work with a map if that input can have random column names and values associated with them.

Parsing a CSV file for a unique row using the new Java 8 Streams API

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.
name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70
The parsing code is as follows:
#Override
public void init() throws Exception {
Map<String, String> params = getParameters().getNamed();
if (params.containsKey("csvfile")) {
Path path = Paths.get(params.get("csvfile"));
if (Files.exists(path)){
// use the new java 8 streams api to read the CSV column headings
Stream<String> lines = Files.lines(path);
List<String> columns = lines
.findFirst()
.map((line) -> Arrays.asList(line.split(",")))
.get();
columns.forEach((l)->System.out.println(l));
// find the relevant sections from the CSV file
// we are only interested in the row with Neda's name
int nameIndex = columns.indexOf("name");
int ageIndex columns.indexOf("age");
int heightIndex = columns.indexOf("height");
// we need to know the index positions of the
// have to re-read the csv file to extract the values
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map((line) -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
values.forEach((l)->System.out.println(l));
}
}
}
Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.
Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.
Use a BufferedReader explicitly:
List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
String firstLine=br.readLine();
if(firstLine==null) throw new IOException("empty file");
columns=Arrays.asList(firstLine.split(","));
values = br.lines()
.map(line -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
}
Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.
Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.
First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.
In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").
So in the second part of your code, you could have:
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map(line -> Arrays.asList(line.split(",")))
.filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
.collect(Collectors.toList());
Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.
Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in #Holger's answer.
Using a CSV-processing library
Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.
The Java ecosystem offers several such libraries. I use Apache Commons CSV.
The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.
create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
Apache Commons CSV does this automatically when you call withHeader.
return just one row in the form of List
Yes, easy to do.
As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.
List < String > tuple = List.of(); // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.
We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.
CSVFormat format = CSVFormat.RFC4180.withHeader() ;
We specify the file path where to find our input file.
Path path = Path.of("/Users/basilbourque/people.csv");
We use try-with-resources syntax (see Tutorial) to automatically close our parser.
As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.
try (
CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
for ( CSVRecord record : parser )
{
if ( record.get( "name" ).equals( "Neda" ) )
{
tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
break ;
}
}
}
catch ( FileNotFoundException e )
{
e.printStackTrace();
}
catch ( IOException e )
{
e.printStackTrace();
}
If we found success, we should see some items in our List.
if ( tuple.isEmpty() )
{
System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
System.out.println( "Success. Found this row for name of `Neda`:" );
System.out.println( tuple.toString() );
}
When run.
Success. Found this row for name of Neda:
[Neda, 14, 66]
Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.
I know I'm responding so late, but maybe it will help someone in the future
I've made a csv parser/writer , easy to use thanks to its builder pattern
For your case: you can filter the lines you want to parse using
csvLineFilter(Predicate<String>)
Hope you find it handy, here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?
DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");
Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader also provides json method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD.
function spark.read.json accepts list of file as a parameter.
spark.read.json(List_all_json file)
This will read all the files in the list and return a single data frame for all the information in the files.
Using pyspark, if you have all the json files in the same folder, you can use df = spark.read.json('folder_path'). This instruction will load all the json files inside the folder.
For reading performance, I recommend you for providing dataframe the schema:
import pyspark.sql.types as T
billing_schema = billing_schema = T.StructType([
T.StructField('accountId', T.LongType(),True),
T.StructField('accountName',T.StringType(),True),
T.StructField('accountOwnerEmail',T.StringType(),True),
T.StructField('additionalInfo',T.StringType(),True),
T.StructField('chargesBilledSeparately',T.BooleanType(),True),
T.StructField('consumedQuantity',T.DoubleType(),True),
T.StructField('consumedService',T.StringType(),True),
T.StructField('consumedServiceId',T.LongType(),True),
T.StructField('cost',T.DoubleType(),True),
T.StructField('costCenter',T.StringType(),True),
T.StructField('date',T.StringType(),True),
T.StructField('departmentId',T.LongType(),True),
T.StructField('departmentName',T.StringType(),True),
T.StructField('instanceId',T.StringType(),True),
T.StructField('location',T.StringType(),True),
T.StructField('meterCategory',T.StringType(),True),
T.StructField('meterId',T.StringType(),True),
T.StructField('meterName',T.StringType(),True),
T.StructField('meterRegion',T.StringType(),True),
T.StructField('meterSubCategory',T.StringType(),True),
T.StructField('offerId',T.StringType(),True),
T.StructField('partNumber',T.StringType(),True),
T.StructField('product',T.StringType(),True),
T.StructField('productId',T.LongType(),True),
T.StructField('resourceGroup',T.StringType(),True),
T.StructField('resourceGuid',T.StringType(),True),
T.StructField('resourceLocation',T.StringType(),True),
T.StructField('resourceLocationId',T.LongType(),True),
T.StructField('resourceRate',T.DoubleType(),True),
T.StructField('serviceAdministratorId',T.StringType(),True),
T.StructField('serviceInfo1',T.StringType(),True),
T.StructField('serviceInfo2',T.StringType(),True),
T.StructField('serviceName',T.StringType(),True),
T.StructField('serviceTier',T.StringType(),True),
T.StructField('storeServiceIdentifier',T.StringType(),True),
T.StructField('subscriptionGuid',T.StringType(),True),
T.StructField('subscriptionId',T.LongType(),True),
T.StructField('subscriptionName',T.StringType(),True),
T.StructField('tags',T.StringType(),True),
T.StructField('unitOfMeasure',T.StringType(),True)
])
billing_df = spark.read.json('/mnt/billingsources/raw-files/202106/', schema=billing_schema)
Function json(String... paths) takes variable arguments. (documentation)
So you can change your code like this:
sqlContext.read().json(file1, file2, ...)

How do I load a text file into an array and then delete an element? Java

I currently have a MemberList.txt file which I am using to contain the member's name, address, contact number etc. In the second field (separated by tabs) I have the member number which I am trying to find.
How can I load the text file into an array and then search the array for the second value of each member (ie array[i+1]) and compare the integer I'm given with the integer stored in that allocated space in memory.
I can post any relevant code which will help in answering this question, thanks in advance!
You can use any CSV parser that supports tab as a separator, for example Apache Commons CSV (https://commons.apache.org/proper/commons-csv) with CSVFormat.TDF (tab-delimited format):
File memberList = new File("/path/to/MemberList.txt");
CSVParser parser = CSVParser.parse(memberList, CSVFormat.TDF);
for (CSVRecord csvRecord : parser) {
int memberNum = Integer.parseInt(csvRecord.get(1));
...
}
And by the way it's not array[i+1], it's array[i][1].

Categories