Spark : How to merge the transformations - java

I have 1000 json files, i need to do some transformations on each of the file, and then create a merged output file, which can have overlapping operations on values, (for example, say, it should not have repeated values)
So, if i read the files as wholeTextFiles, as a title,content pair, and then in the map function, i parse the content as json tree and perform the transformation, where and how do i merge the output?
Do i need to have another transform on the resultant RDD's to merge the values, and how would this work? Can i have a shared object(a List or a Map or RDD(?)) amongst all map blocks, which will be updated as part of the transformation, so that i can check for repeated values there?
P.S: Even if the output creates part files, i would still like to have no repititions.
Code:
//read the files as JavaPairRDD , which gives <filename, content> pairs
String filename = "/sample_jsons";
JavaPairRDD<String,String> distFile = sc.wholeTextFiles(filename);
//then create a JavaRDD from the content.
JavaRDD<String> jsonContent = distFile.map(x -> x._2);
//apply transformations, the map function will return an ArrayList which would
//have property names.
JavaRDD<ArrayList<String>> apm = jsonContent.map(
new Function< String, ArrayList<String> >() {
#Override
public ArrayList<String> call(String arg0) throws Exception {
JsonNode rootNode = mapper.readTree(arg0);
return parseJsonAndFindKey(rootNode, "type", "rootParent");
}
});
So, this way i am able to get all first level properties in an ArrayList, from each json file.
Now i need a final ArrayList, as a union of all these arraylists, removing duplicates. How can i achieve that ?

Why do you need 1000 RDDs for 1000 json files?
Do you see any issue with merging the 1000 json files in the input stage into one RDD?
If you'll be using one RDD from the input stage, it shouldn't be hard to perform all the needed actions on this RDD.

Related

How to Convert .csv file to RDD<Vector>?

I have a CSV file containing following data with 9000+ records
id,Category1,Category2
How do I convert this csv file to RDD<Vector> so that I can use it to find similar column using columnSimilarities of Apache Spark in java.
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#RowMatrix-org.apache.spark.rdd.RDD-
You can try this:
sparkSession.read.csv(myCsvFilePath) // you should have a DataFrame here
.map((r: Row) => Vector.dense(r.getInt(0), r.getInt(1), r.getInt(2))) // you should have a Dataset of Vector
.rdd // you have your RDD[Vector]
Feel free to reach out if it doesn't work.
as I read, Vector can hold the ID and and double[] for the values.
you need to fill the Vector.
List<String> lines = Files.readAllLines(Paths.get("myfile.csv"), Charset.defaultCharset());
then you can iterate over lines, create a Vector for each line, fill it with the values (you need to parse them) and add them to the RDD

Univocity parser - Handling lines with weird constructs

I am trying to figure the best way to use University parser to handle CSV log file with lines looks like below,
"23.62.3.74",80,"testUserName",147653,"Log Collection Device 100","31/02/15 00:05:10 GMT",-1,"10.37.255.3","TCP","destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3"
As you can see this is a comma delimited file but the last column has got bunch of values prefixed with its field names. My requirement is to capture values from normal fields and
selectively from this last big field.
I know the master details row processor in Univocity but I doubt if this fit into that category. Could you guide me to the right direction please?
Note: I can handle the name prefixed fields in rowProcessed(String[] row, ParsingContext context) if I implement a row processor but I am looking for something native to Univocity if possible?
Thanks,
R
There's nothing native in the parser for that. Probably the easiest way to go about it is to have your RowProcessor as you mentioned.
One thing you can try to do to make your life easier is to use another instance of CsvParser to parse that last record:
//initialize a parser for the pipe separated bit
CsvParserSettings detailSettings = new CsvParserSettings();
detailSettings.getFormat().setDelimiter('=');
detailSettings.getFormat().setLineSeparator("|");
CsvParser detailParser = new CsvParser(detailSettings);
//here is the content of the last column (assuming you got it from the parser)
String details = "destination_ip=192.62.3.74|product_id=0071|option1_type=(s-dns)|proxy_machine_ip=10.1.255.3";
//The result will be a list of pairs
List<String[]> pairs = detailParser.parseAll(new StringReader(details));
//You can add the pairs to a map
Map<String, String> map = new HashMap<String, String>();
for (String[] pair : pairs) {
map.put(pair[0], pair[1]);
}
//this should print: {destination_ip=192.62.3.74, product_id=0071, proxy_machine_ip=10.1.255.3, option1_type=(s-dns)}
System.out.println(map);
That won't be extremely fast but at least it's easy to work with a map if that input can have random column names and values associated with them.

Parsing a CSV file for a unique row using the new Java 8 Streams API

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.
name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70
The parsing code is as follows:
#Override
public void init() throws Exception {
Map<String, String> params = getParameters().getNamed();
if (params.containsKey("csvfile")) {
Path path = Paths.get(params.get("csvfile"));
if (Files.exists(path)){
// use the new java 8 streams api to read the CSV column headings
Stream<String> lines = Files.lines(path);
List<String> columns = lines
.findFirst()
.map((line) -> Arrays.asList(line.split(",")))
.get();
columns.forEach((l)->System.out.println(l));
// find the relevant sections from the CSV file
// we are only interested in the row with Neda's name
int nameIndex = columns.indexOf("name");
int ageIndex columns.indexOf("age");
int heightIndex = columns.indexOf("height");
// we need to know the index positions of the
// have to re-read the csv file to extract the values
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map((line) -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
values.forEach((l)->System.out.println(l));
}
}
}
Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.
Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.
Use a BufferedReader explicitly:
List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
String firstLine=br.readLine();
if(firstLine==null) throw new IOException("empty file");
columns=Arrays.asList(firstLine.split(","));
values = br.lines()
.map(line -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
}
Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.
Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.
First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.
In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").
So in the second part of your code, you could have:
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map(line -> Arrays.asList(line.split(",")))
.filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
.collect(Collectors.toList());
Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.
Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in #Holger's answer.
Using a CSV-processing library
Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.
The Java ecosystem offers several such libraries. I use Apache Commons CSV.
The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.
create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
Apache Commons CSV does this automatically when you call withHeader.
return just one row in the form of List
Yes, easy to do.
As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.
List < String > tuple = List.of(); // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.
We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.
CSVFormat format = CSVFormat.RFC4180.withHeader() ;
We specify the file path where to find our input file.
Path path = Path.of("/Users/basilbourque/people.csv");
We use try-with-resources syntax (see Tutorial) to automatically close our parser.
As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.
try (
CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
for ( CSVRecord record : parser )
{
if ( record.get( "name" ).equals( "Neda" ) )
{
tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
break ;
}
}
}
catch ( FileNotFoundException e )
{
e.printStackTrace();
}
catch ( IOException e )
{
e.printStackTrace();
}
If we found success, we should see some items in our List.
if ( tuple.isEmpty() )
{
System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
System.out.println( "Success. Found this row for name of `Neda`:" );
System.out.println( tuple.toString() );
}
When run.
Success. Found this row for name of Neda:
[Neda, 14, 66]
Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.
I know I'm responding so late, but maybe it will help someone in the future
I've made a csv parser/writer , easy to use thanks to its builder pattern
For your case: you can filter the lines you want to parse using
csvLineFilter(Predicate<String>)
Hope you find it handy, here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works

How to efficiently handle huge JSON file, needs some ideas

It's a questions about the train of thought, so please don't let me to use a third-party library to deal this.
Recently, I took a job interview, There's a questions like below:
there is a huge JSON file, structure like a Database:
{
"tableName1 ":[
{"t1field1":"value1"},
{"t1field2":"value2"},
...
{"t1fieldN":"valueN"}
],
"tableName2 ":[
{"t2field1":"value1"},
{"t2field2":"value2"},
....
{"t2fieldN":"valueN"}
],
.......
.......
"tableNameN ":[
{"tNfield1":"value1"},
{"tNfield2":"value2"},
....
{"tNfieldN":"valueN"}
]
}
And the requirements is:
find some special child-node by given child-node' name and update it's field's value then save it to a new JSON file.
count the number of given field's name and value.
when it's a normal size JSON file, I wrote a Utility class to load the JSON file from local and parse it to JSON Object. Then I wrote two methods to deal the two requirements:
void upDateAndSaveJson(JSONObject json, String nodeName,
Map<String, Object> map, Map<String, Object> updateMap,
String outPath) {
//map saved target child-node's conditions
//updateMap saved update conditions
// first find the target child-node and update it finally save it
// ...code ...
}
int getCount(JSONObject json, Map<String, Object> map) {
//map saved field target field/value
// ...code...
}
But the interviewer let me thinking about the situation when the JSON file is very huge, then modify my code and how to make it more effective.
My idea is write a tool to split the JSON file first. Because finally I need take the JSON Object to invoke previous two methods, so before I split the huge JSON file I know the parameters of the two methods: a Map(saved target child-node's conditions/or field target field/value), nodeName(child-node name)
so when I load the JSON file I compare the inputstream String with the taget nodeName, and then start to count the number of object the child-node, if rule is 100, then when it have 100 objects, I split the child-node to a new smaller JSON file and remove it in source JSON file.
Like below:
while((line = reader.readLine()) != null){
for (String nodeName : nodeNames) {
//check if its' the target node
if (line.indexOf(nodeName) != -1) {
//count the target child-node's object
//and then split to smaller JSON file
}
}
}
After that I can use multiple thread to load the smaller JSON file previous created and invoke the two method to process the JSON Object.
It's a questions about the train of thought, so please don't tell me you can use a third-party library to deal this problem.
So if my though feasible? or is there some other idea you guys have, please share it.
Thanks.

ANDROID usage of Jackson library: How to load object with indexes - range from to

I have really big JSON file for parsing and managing. My JSON file contains structure like this
[
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
{"id": "11040549","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
....
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
}
]
My JSON file contains data about topics from news website and practically every day this JSON file will be increased dramatically.
For parsing of that file I use
URL urlLinkSource = new URL(OUTBOX_URL);
urlLinkSourceReader = new BufferedReader(new InputStreamReader(
urlLinkSource.openStream(), "UTF-8"));
ObjectMapper mapper = new ObjectMapper();
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { }); //DataContainerList contains id, key1, key2, key3..key15
My problem is that I want to load in this line
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only range of JSON object - just first ten object, just second ten object - because I need to display in my app just 10 news in paging mode (all the time I know the index of which 10 I need to display). It totally stuped to load 10 000 objects and to iterate just first 10 of them. So my question is how I can load
in similar way like this one:
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only objects with indexes FROM -TO (for example from 30 to 40) without loading of all objects in the entire JSON file?
Regards
It depends of what you mean by "load object with indexes from to", if you want to
Read everything but bind only a sublist
The solution in that case is to read the full stream and only bind values within those indexes.
You can use jacksons streaming api and do it yourself. Parse the stream use a counter to keep track of actual index and then bind to POJOs only what you need.
However this is not a good solution if your file is large and its done in real time.
Read only the data between those indexes
You should do that if your file is big and performance matters. Instead of having a single big file, do the pagination by splitting your json array into multiple files matching your ranges, and then just deserialize the specific file content into your array.
Hope this helps...

Categories