extract strings from database's table - java

I have database's table containing source code (not a traditional language ) which i want to parse using regex ; I want to know, should I proceed row by row, or should I copy all of the rows to a text file to process them using regex?

I think it depends on the length of the code. If the code is short then it's better to parse it in one time (read all rows). But if the code length is large then it's better to parse it chunk by chunk. Let's say 100 rows and after that another 100 rows and etc. I think It depends on parsing performance

Related

Efficient way to read very wide dataset in scala or java [duplicate]

I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code

reading very wide data set performance optimization [duplicate]

I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code

java solution for hashing lines that contain variables in .csv

I have a file that represent a table recorded in .csv or similar format. Table may include missing values.
I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.
I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.
Any ideas?
Ok, here is the paper that holds the key to the answer: P. Gopalan & J. Radhakrishnan "Finding duplicates in a data stream".

Faster way to use data from a String array?

Currently, I have a table being read from like so:
ps = con.prepareStatement("SELECT * FROM testrow;");
rs = ps.executeQuery();
rs.next();
String[] skills = rs.getString("skills").split(";");
String[] skillInfo;
for (int i = 0; i < skills.length; i++) {
skillInfo = skills[i].split(",");
newid.add(Integer.parseInt(skillInfo[0]));
newspamt.add(Byte.parseByte(skillInfo[1]));
mastery.add(Byte.parseByte(skillInfo[2]));
}
rs.close();
ps.close();
The information is saved to the database by using StringBuilder to form a string of all the numbers that need to be stored, which would be in the format of number1,number2,number3;
I had written a test project to see if that method would be faster than using MySQL's batch method, and it beat MySQL by roughly 3 seconds. The only problem I'm facing now is when I go to read the information, MySQL completes the job in a few milliseconds, where as calling the information using String[] to split the data by the ";" character, and then also using String[] to split information within a loop by the "," character, takes about 3 to 5 seconds.
Is there anyway I can reduce the amount of time it takes to load the data using the String[], or possibly another method?
Do not store serialized arrays in database fields. Use 3NF?
Do you read the information more often than you write it ? If so (most likely) then optimising the write seems to be emphasising the wrong end of the operation. Why not store the info in separate columns and thus avoid splitting (i.e. normalise your data)?
If you can't do that, can you load the data in one thread, and hand off to another thread for splitting/storing the info. i.e. you read the data in one thread, and for each line, pass it through (say) a BlockingQueue to another thread that splits/stores.
in the format of number1,number2,number3
consider normalising the table, giving one number per row.
String.split uses a regular expression for its algorithm. I'm not how it's implemented, but the chance is that is quite cpu heavy. Try implementing your own split method, using a char value instead of a regular expression.
Drop the index while inserting, that'll make it faster.
Of course this is only an option for a batch load, not for 500-per-second transactions.
The most obvious alternative method is to have a separate skills table with rows and columns instead of a single field of concatenated values. I know it looks like a big change if you've already got data to migrate but it's worth the effort for so many reasons.
I recommend that instead of using the split method, you use a precompiled regular expression, especially in the loop.

Working with a giant matrix with Java

I hava read a related question here link text
It was suggested there to work with a giant file and then use RandomAccessFile.
My problem is that a matrix(consists of "0" and "1", not sparse) could be really huge. For example, a row size could be 10^10000. I need an efficient way to store such a matrix. Also, I need to work with such file (if I would store my matrix in it) in that way:
Say, I have a giant file which contains sequences of numbers. Numbers in a sequence are divided by ","(first number shows the raw number, remaining numbers show places in matrix where "1"s stay). Sequences are divided by symbol "|". In addition, there is a symbol "||" which divide all of sequences into two groups. (that is a view of two matrixes. May be it is not efficient, but I don't know the way to make it better. Do you have any ideas? =) ) I have to read, for example, 100 numbers from an each row from a first group (extract a submatrix) and determine by them which rows I need to read from the second group.
So I need the function seek(). Would it work with such a giant file?
I am a newbie. May be there are some efficient ways to store and read such a data?
There are about 10^80 atoms in the observable universe. So say you could store one bit in each atom, you need about 10^9920 universes about the same size as ours. Thats just to store one row.
How many rows were you condiering? You will need 10^9920 universes per row.
Hopefully you mean 10 000 entries and not 10^10000 Then you could use the BitSet class to store all in RAM (or you could use sth. like hadoop)

Categories