reading very wide data set performance optimization [duplicate] - java

I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code

Related

Efficient way to read very wide dataset in scala or java [duplicate]

I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code

Filling a file who needs values from 1000 other files - Java

Suppose you have this .csv that we'll name "toComplete":
[Date,stock1, stock2, ...., stockn]
[30-jun-2015,"NA", "NA", ...., "NA"]
....
[30-Jun-1994,"NA","NA",....,"NA"]
with n = 1000 and number of row = 5000. Each row is for a different date. That's kind of a big file and I'm not really used to it.
My goal is to fill the "NA" by values I'll take into other .csv.
In fact, I have 1 file (still a .csv) for each stock. This means I have 1000 files for my stock and my file "toComplete".
Here are what looks like the files stock :
[Date, value1, value2]
[27-Jun-2015, v1, v2]
....
[14-Fev-2013,z1,z2]
They are less date in each stock's file than in the "toComplete" file and each date in stock's file is necessarily in "toComplete"'s file.
My question is : What is the best way to fill my file "toComplete" ? I tried by reading it line by line but this is very slow. I've been reading "toComplete" line by line and every each line I'm reading the 1000 stock's file to complete my file "toComplete". I think there are better solutions but I can't see them.
EDIT :
For example, to replace the "NA" from the second row and second column from "toComplete", I need to call my file stock1, read it line by line to find the value from value1 corresponding to the date of second row in "toCompelte".
I hope it makes more sense now.
EDIT2 :
Dates are edited. For a lot of stocks, I won't have values. In this example, we only have dates from 14-Fev-2013 to 27-Jun-2015, which means that there will stay some "NA" at the end (but it's not a problem).
I know in which files to search because my files are named stock1.csv, stock2.csv, ... I put them in a unique directory so I can use .list() method.
So you have 1000 "price history" CSV files for certain stocks containing up to 5000 days of price history each, and you want to combine the data from those files into one CSV file where each line starts with a date and the rest of the entries on the line are the up to 1000 different stock prices for that historical day? -back of the napkin calculations indicate the final file would likely contain less than 1 MB of data (less than 20 bytes per stock price would mean less than 20kb per line * 5k lines). There should be plenty of RAM in a 256/512MB JVM to read the data you want to keep from those 1000 files into a Map where the keys are the dates and the value for each key is another Map with 1000 stock symbol keys and 1000 stock value values. Then write out your final file by iterating the Map(s).

extract strings from database's table

I have database's table containing source code (not a traditional language ) which i want to parse using regex ; I want to know, should I proceed row by row, or should I copy all of the rows to a text file to process them using regex?
I think it depends on the length of the code. If the code is short then it's better to parse it in one time (read all rows). But if the code length is large then it's better to parse it chunk by chunk. Let's say 100 rows and after that another 100 rows and etc. I think It depends on parsing performance

How can I improve performance of string processing with less memory?

I'm implementing this in Java.
Symbol file Store data file
1\item1 10\storename1
10\item20 15\storename6
11\item6 15\storename9
15\item14 1\storename250
5\item5 1\storename15
The user will search store names using wildcards like storename?
My job is to search the store names and produce a full string using symbol data. For example:
item20-storename1
item14-storename6
item14-storename9
My approach is:
reading the store data file line by line
if any line contains matching search string (like storename?), I will push that line to an intermediate store result file
I will also copy the itemno of a matching storename into an arraylist (like 10,15)
when this arraylist size%100==0 then I will remove duplicate item no's using hashset, reducing arraylist size significantly
when arraylist size >1000
sort that list using Collections.sort(itemno_arraylist)
open symbol file & start reading line by line
for each line Collections.binarySearch(itemno_arraylist,itmeno)
if matching then push result to an intermediate symbol result file
continue with step1 until EOF of store data file
...
After all of this I would combine two result files (symbol result file & store result file) to present actual strings list.
This approach is working but it is consuming more CPU time and main memory.
I want to know a better solution with reduced CPU time (currently 2 min) & memory (currently 80MB). There are many collection classes available in Java. Which one would give a more efficient solution for this kind of huge string processing problem?
If you have any thoughts on this kind of string processing problems that too in Java would be great and helpful.
Note: Both files would be nearly a million lines long.
Replace the two flat files with an embedded database (there's plenty of them, I used SQLite and Db4O in the past): problem solved.
So you need to replace 10\storename1 with item20-storename1 because the symbol file contains 10\item20. The obvious solution is to load the symbol file into a Map:
String tokens=symbolFile.readLine().split("\\");
map.put(tokens[0], tokens[1]);
Then read the store file line by line and replace:
String tokens=storelFile.readLine().split("\\");
output.println(map.get(tokens[0])+'-'+tokens[1]));
This is the fastest method, though still using a lot of memory for the map. You can reduce the memory storing the map in a database, but this would increase the time significantly.
If your input data file is not changing frequently, then parse the file once, put the data into a List of custom class e.g. FileStoreRecord mapping your record in the file. Define a equals method on your custom class. Perform all next steps over the List e.g. for search, you can call contains method by passing search string in form of the custom object FileStoreRecord .
If the file is changing after some time, you may want to refresh the List after certain interval or keep the track of list creation time and compare against the file update timestamp before using it. If ifferent, recreate the list. One other way to manage the file check could be to have a Thread continuously polling the file update and the moment, it is updated, it notifies to refresh the list.
Is there any limitation to use Map?
You can add Items to Map, then you can search easily?
1 million record means 1M * recordSize, therefore it will not be problem.
Map<Integer,Item> itemMap= new HashMap();
...
Item item= itemMap.get(store.getItemNo());
But, the best solution will be with Database.

java solution for hashing lines that contain variables in .csv

I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".

Categories