Spark: After CollectAsMap() or Collect(), every entry has same value - java

I need to read a text file and change this file to Map.
When I make JavaPairRDD, it works well.
However, when I change JavaPairRDD to Map, every entry has same value, more specifically the last data in text file.
inputFile:
1 book:3000
2 pencil:1000
3 coffee:2500
When I read a text file, I used Hadoop custom input format.
Using this format, Key is number and Value is custom class Expense<content,price>.
JavaPairRDD<Integer,Expense> inputRDD = JavaSparkContext.newAPIHadoopFile(inputFile, ExpenseInputFormat.class, Integer.class, Expense.class, HadoopConf);
inputRDD:
[1, (book,3000)]
[2, (pencil,1000)]
[3, (coffee,2500)]
However, when I do
Map<Integer,Expense> inputMap = new HashMap<Integer,Expense>(inputRDD.collectAsMap());
inputMap:
[1, (coffee,2500)]
[2, (coffee,2500)]
[3, (coffee,2500)]
As we can see, key is correctly inserted, but every value is the last value of input.
I don't know why it happens..

Related

JAVA: datastructure like a table without using a database?

I have a couple of files. Each file is for a month. e.g.
file-januar.csv
ID, From, To
1234, 2022-01-01, 2022-01-02
1235, 2022-07-01, 2022-08-20
file-februar.csv
ID, From, To
1234, 2022-01-01, 2022-01-02
1235, 2022-08-21, 2022-08-30
file-march.csv
...
The id is unique in each file. If the record did not change in the file for january and february is the same entry:
1234, 2022-01-01, 2022-01-02
If the record changes the entry for januar and februar are different
1235, 2022-07-01, 2022-08-20
1235, 2022-08-21, 2022-08-30
I need to create a single file without any duplicates in chronological order. My problem: I can not use a database.
Has sombody an idea howto create a single file januar-dezember without any duplicates? Each file has about 10.000 rows. How can i handle it? And how can i sort it chronological?
I would store each line in a Map<Integer, String> where the key is the unique ID (I suppose it is a number, but you also could use a string) and the value is the complete line.
Then read in each file in chronological order and store it into the map. Entries from later files will overwrite entries from earlier ones.
In the end you can write out the Map like this:
Set<Integer> keys = allValuesMap.keySet();
List<Integer> keyList = new ArrayList<>(keys);
Collections.sort(keyList);
for (Integer key: keyList) {
System.out.println(allValuesMap.get(key));
}
(Of course you would probably replace the System.out with another file.)
If that doesn't work, you could try to use an in-memory database like H2.

BOM Problem for matching 2 keys' value with TreeMap.getEntryUsingComparator()

I'm trying to follow the tutorial on this page (https://www.bezkoder.com/spring-boot-upload-csv-file/) in order to insert information from a csv file into a mysql DB but I got stuck by something in the class CSVHelper.
While doing my search, I've found the problem was located in TreeMap.getEntryUsingComparator() where the key value doesn't match with any of the values of the headerMap.
When I checked the variables in the Debug view, I saw the first values were different whereas the text was the same ("Id").
The key argument ("Id") has for value [73, 100]
The headersMap ("Id") has for value [-1, -2, 73, 0, 100, 0]
I have checked the header in the file and there's no space. Otherwise, all the other headers work fine.
After changing the order of the headers, it spotlights that the problem is for the 1st header name. It adds [-1,-2] at the beginning and 0 between the other values.
So, what do you think it can possibly be ? What can I do to solve this ?
Project on Github, branch dev-mysql-csv
This change at the beginning of the input was the consequence of the BOM (Byte Order Mark). The csv file wasn't saved at the good format (I changed from "csv delimited with comma" to "csv delimited with semicolon") then it worked.
But this is ok only when the separator was "," and not ";" that is very odd...
In order to handle the BOM, there is BOMInputStream. I succeeded to run "csv delimited with comma".
I tried to use withRecordSeparator(";") and withDelimiter(";") in order to make it work with ";" but it failed.

explode an spark array column to multiple columns sparksql

I have a column which has type Value defined like below
val Value: ArrayType = ArrayType(
new StructType()
.add("unit", StringType)
.add("value", StringType)
)
and data like this
[[unit1, 25], [unit2, 77]]
[[unit2, 100], [unit1, 40]]
[[unit2, 88]]
[[unit1, 33]]
I know spark sql can use functions.explode to make the data become multiple rows, but what i want is explode to multiple columns (or the 1 one column but 2 items for the one has only 1 item).
so the end result looks like below
unit1 unit2
25 77
40 100
value1 88
33 value2
How could I achieve this?
addtion after initial post and update
I want to get result like this (this is more like my final goal).
transformed-column
[[unit1, 25], [unit2, 77]]
[[unit2, 104], [unit1, 40]]
[[unit1, value1], [unit2, 88]]
[[unit1, 33],[unit2,value2]]
where value1 is the result of applying some kind of map/conversion function using the [unit2, 88]
similarly, value2 is the result of applying the same map /conversion function using the [unit1, 33]
I solved this problem using the map_from_entries as suggested by #jxc, and then used UDF to convert the map of 1 item to map of 2 items, using business logic to convert between the 2 units.
one thing to note is the map returned from map_from_entries is scala map. and if you use java, need to make sure the udf method takes scala map instead.
ps. maybe I did not have to use map_from_entries, instead maybe i could make the UDF to take array of structType

List as key for a key value store

I want to store key value pair in data base where key is a list of Integers or a set of Integers.
The use case that I have has the below steps
I will get a list of integers
I will need to check if that list of integers (as a key) is already present in the DB
If this is present, I will need to pick up the value from the DB
There are certain computations that I need to do if the list of integers (or set of integers) is not there in the DB already, if this there then I just want to pass the value and avoid the computations.
I am thinking of keeping the data in a key value store but I want the key to be specifically a list or set of integers.
I have thought about below options
Option A
Generate a unique hash for the list of integers and store that as key in key/value store
Problem:
I will have hash collision which will break my use case. I believe there is no way to generate hash with uniqueness 100% of the time.
This will not work.
If there is away to generate a unique hash (100%) times then that is the best way.
Option B
Create an immutable class with List of integers or Set of integers and store that as a key for my key value store.
Please share any feasible ways to achieve the need.
You don’t need to do anything special:
Map<List<Integer>, String> keyValueStore = new HashMap<>();
List<Integer> key = Arrays.asList(1, 2, 3);
keyValueStore.put(key, "foo");
All JDK collections implement sensible equals() and hashCode() that is based solely on the contents of the list.
Thank you. I would like to share some more findings.
I now tried the below further to what I mentioned in my earlier post.
I added the below documents in Mongodb
db.products.insertMany([
{
mapping: [1, 2,3],
hashKey:'ABC123',
date: Date()
},
{
mapping: [4, 5],
hashKey:'ABC45' ,
date: Date()
},
{
mapping: [6, 7,8],
hashKey:'ABC678' ,
date: Date()
},
{
mapping: [9, 10,11],
hashKey:'ABC91011',
date: Date()
},
{
mapping: [1, 9,10],
hashKey:'ABC1910',
date: Date()
},
{
mapping: [1, 3,4],
hashKey:'ABC134',
date: Date()
},
{
mapping: [4, 5,6],
hashKey:'ABC456',
date: Date()
}
]);
When I am now trying to find the mapping I am getting expected results
> db.products.find({ mapping: [4,5]}).pretty();
{
"_id" : ObjectId("5d4640281be52eaf11b25dfc"),
"mapping" : [
4,
5
],
"hashKey" : "ABC45",
"date" : "Sat Aug 03 2019 19:17:12 GMT-0700 (PDT)"
}
The above is giving the right result as the mapping [4,5] (insertion order retained) is present in the DB
> db.products.find({ mapping: [5,4]}).pretty();
The above is giving no result as expected as the mapping [5,4] is not present in the DB. The insertion order is retained
So it seems the "mapping" as List is working as expected.
I used Spring Data to read from MongoDB that is running locally.
The format of the document is
{
"_id" : 1,
"hashKey" : "ABC123",
"mapping" : [
1,
2,
3
],
"_class" : "com.spring.mongodb.document.Mappings"
}
I inserted 1.7 million records into DB using org.springframework.boot.CommandLineRunner
Then the query similar to my last example:
db.mappings.find({ mapping: [1,2,3]})
is taking average 1.05 seconds to find the mapping from 1.7 M records.
Please share if you have any suggestion to make it faster and how fast can I expect it to run.
I am not sure about create, update and delete performance as yet.

How to combine two JavapairRDD to a custom JavapairRDD?

I have created the following JavaPairRdds from the data received from different API endpoints.
listHeaderRDD<Integer, TeachersList> -> {list_id, list_details}
e.g {1,{list_id:1,name:"abc",quantity:"2"}},
{2,{list_id:2,name:"xyz",quantity:"5"}}...
ItemsGroupListRDD<Integer, Iterable<Tuple2<Integer, TeachersListItem>>> ->
{list_id, {{item_id1,item_details1},{item_id2,item_details2}..}}
e.g {1, {{11,{item_id:11,item_name:"abc"}},{12,{item_id:12,item_name:"acv"}}}..}
{2, {{14,{item_id:14,item_name:"bnh"}},{18,{item_id:18,item_name:"hjk"}}}..}
Desired output:
teachersListRDD<TeachersList, Iterable<TeachersListItem>> -> {list_details, all_item_details}
e.g {{{list_id:1,name:"abc",quantity:"2"},{{item_id:11,item_name:"abc"},{item_id:12,item_name:"acv"}}},
{{list_id:2,name:"xyz",quantity:"5"},{{item_id:14,item_name:"bnh"},{item_id:18,item_name:"hjk"}}}
}
Basically I want the value of first RDD to be the key in the desired RDD and the group of item_details from the second RDD corresponding to that list_id as the value for the desired RDD i.e teachersListRDD
I have tried different ways to do it but unable to get the desired output.

Categories