Suppose you have this .csv that we'll name "toComplete":
[Date,stock1, stock2, ...., stockn]
[30-jun-2015,"NA", "NA", ...., "NA"]
....
[30-Jun-1994,"NA","NA",....,"NA"]
with n = 1000 and number of row = 5000. Each row is for a different date. That's kind of a big file and I'm not really used to it.
My goal is to fill the "NA" by values I'll take into other .csv.
In fact, I have 1 file (still a .csv) for each stock. This means I have 1000 files for my stock and my file "toComplete".
Here are what looks like the files stock :
[Date, value1, value2]
[27-Jun-2015, v1, v2]
....
[14-Fev-2013,z1,z2]
They are less date in each stock's file than in the "toComplete" file and each date in stock's file is necessarily in "toComplete"'s file.
My question is : What is the best way to fill my file "toComplete" ? I tried by reading it line by line but this is very slow. I've been reading "toComplete" line by line and every each line I'm reading the 1000 stock's file to complete my file "toComplete". I think there are better solutions but I can't see them.
EDIT :
For example, to replace the "NA" from the second row and second column from "toComplete", I need to call my file stock1, read it line by line to find the value from value1 corresponding to the date of second row in "toCompelte".
I hope it makes more sense now.
EDIT2 :
Dates are edited. For a lot of stocks, I won't have values. In this example, we only have dates from 14-Fev-2013 to 27-Jun-2015, which means that there will stay some "NA" at the end (but it's not a problem).
I know in which files to search because my files are named stock1.csv, stock2.csv, ... I put them in a unique directory so I can use .list() method.
So you have 1000 "price history" CSV files for certain stocks containing up to 5000 days of price history each, and you want to combine the data from those files into one CSV file where each line starts with a date and the rest of the entries on the line are the up to 1000 different stock prices for that historical day? -back of the napkin calculations indicate the final file would likely contain less than 1 MB of data (less than 20 bytes per stock price would mean less than 20kb per line * 5k lines). There should be plenty of RAM in a 256/512MB JVM to read the data you want to keep from those 1000 files into a Map where the keys are the dates and the value for each key is another Map with 1000 stock symbol keys and 1000 stock value values. Then write out your final file by iterating the Map(s).
Related
I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code
I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code
Here's my problem statement. I have 3 large text files (each <5GB) each with 100 million lines (records). I want to compare files 1 and 2 and update the value in file 3. A single record in each file would look like this:
File 1
PrimaryValue1|OldValue1
File 2
PrimaryValue1|NewValue1
File 3
Field1Value|Field2Value|Field3Value|OldValue1|Field5Value....|Field100Value
All I have for every record is the OldValue1, which is unique for every record. Now, Using the 'PrimaryValue1', I need to get the NewValue1 corresponding to the OldValue1 using files 1&2, and then update this NewValue1 in file3 in place of OldValue1. Both OldValue and NewValue are unique for every record.
I understand that if I read all the files into memory, then I will be able to compare and replace the values. Since this could be memory-intensive, I would like to know if there are better approaches to handle this scenario.
File1 and File2 define the replacements that you have to apply to File3.
I'd do the following steps:
Read File2 into a HashMap newValues.
Read File1, and for every entry:
Look up the primary value in newValues.
If found, put the old value / new value pair into a HashMap replacement.
Delete the entry from newValues (I guess, there are no duplicates).
Read File3, and for every entry :
Do the replacements from the replacement Map.
Write the resulting line to the output.
You need memory for the replacements, but the main file can be processed sequentially, needing just one record buffer.
If you get an OutOfMemoryError, this will probably happen while building the replacement Map, before you start working on the main File3. You can then modify the program to work on managable chunks of File2. The rest of the algorithm need not change.
I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".
Question: I have two files one with list of serial number,items,price, location and other file has items. So i would like compare two files and printout the number times items are repeated in file1 with serial number.
Text1 file will have
Text2 file will have
Output should be
So the file1 is not formatted in proper order and file 2 is in order (line by line).
Since you have no apparent code or effort put into this, I'll only hint/guide you to some tools you can use.
For parsing strings: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
For reading in from a file: http://www.roseindia.net/java/beginners/java-read-file-line-by-line.shtml
And I would recommend reading file #2 first and saving those values to an arraylist, perhaps, so you can iterate through them later on when you do your searching.
Okay my approach to this would be
Read in the file1 and file2 into a string
"Split" the string in file 1 as well as file2 based on "," if that is what is being used
Check for the item in every 3rd one so my iteration would iterate +3 every time (You might need to sort if not in order both of these)
If found store in an Array,ArrayList etc. Go back to Step 3 if more items present. Else stop
Even though your file1 is not well formatted, it's content has some pattern which you can use to read it successfully.
For each item, it has all the information (i.e. serial number, name, price, location) but not in a certain order. So, you have pay attention to and use the following patterns while you read each item from the file1 -
Serial number is always a plain integer.
Price has that $ and . character.
Location is 2-character long, all capital.
And name is a string can not be any of the above.
Such problems are not best solved by monolithic JAVA code. If you don't have tool constraint then recommended way to solve it is to import data from file 1 into a database table and then run queries from your program to fetch whatever information you like. You can easily select serial numbers based on items and group them for count based on location.
This approach will ensure that you can keep up with changing requirements and if your files are huge you will have good performance.
I hope you are well versed with SQL and DB tools, so I have not posted any details on them.
Use regex.
Step one, tracing and splitting at [\d,], store results in map
Step two, read in the word from the second file. say it's "pen"
Step three, do regex search "pen" on each string within the map.
Step four, if the above returns true , do something like ([A-Z][A-Z],) on each string within the map.