Limitation of OpenCSV Reader-Java - java

I used OpenCSV Reader - java to load my CSV File (file size:-1.47 GB (1,585,965,952 bytes)).
However, inside my coding, whenever it only manage to insert 10950 record to PostgreSQL database.
CSVReader csvReader = new CSVReader(new FileReader(csvFilename));
String[] row = null;
String sqlInsertCSV = "insert into ip2location_tmp_test
(ip_from, ip_to, xxxxx, "
+ "xxxxx, xxxxx,xxxxx, "
+ "xxxxx,xxxxx, xxxxx, xxxxx, xxxxx, xxxxx,xxxxx,xxxxx)"
+ " VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
while((row = csvReader.readNext()) != null) {
PreparedStatement insertCSV = conn.prepareStatement(sqlInsertCSV);
insertCSV.setLong(1, Long.parseLong(row[0]));
....
....
insertCSV.setString(14, row[13]); // usage_type
insertCSV.executeUpdate();
}
csvReader.close();
PreparedStatement insertCSV = conn.prepareStatement(sqlInsertCSV);
insertCSV.executeUpdate();
}
Is there any limitation of OpenCSV?
I need to use setString function to cater for single quote in PostgreSQL.

It has no error. It just stop like that
Hi Craig,
COPY command need to be super user. Has tried it before

First question: you are telling us how many bytes are in the file but how many records (lines) are in the file? If the file has 10950 lines then everything is working great .
Second question: Why do you have the prepareStatement/executeUpdate outside the while loop? It seems to me that would try to either insert the last record twice or insert a empty record because once you are outside the while you have no data because the csvReader returned null.
Using the record at a time method you are using there is no limit to the number of records you can read. Check your data file to see if there is no accidental carriage returns in the file around line 10950.

Related

Parse huge CSV file

I have a huge CSV file
need to read it
validate
write to db
After research, I found this solution
//configure input format using
CsvParserSettings settings = new CsvParserSettings();
//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();
//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)");
//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
//get next row from parser and set values in your statement
String[] row = it.next();
//validation
if (!row[0].matches(some regex)){
badDataList.add(row);
conitunue;
}
for(int i = 0; i < COLUMN_COUNT; i++){
if(i < row.length){
statement.setObject(i + 1, row[i]);
} else { //row in input is shorter than COLUMN_COUNT
statement.setObject(i + 1, null);
}
}
//add the values to the batch
statement.addBatch();
batchSize++;
//once 1000 rows made into the batch, execute it
if (batchSize == 1000) {
statement.executeBatch();
batchSize = 0;
}
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
statement.executeBatch();
}
// or use jook#loadArrays
context.loadInto("book")
.batchAfter(500)
.loadArrays(new ArrayList <String[]>)
However, it is still too slow because it's executing in same thread. Is there any way to do it faster with multi-threading?
Instead of iterating records one by one, use commands such as LOAD DATA INFILE that imports data in bulk:
JDBC: CSV raw data export/import from/to remote MySQL database using streams (SELECT INTO OUTFILE / LOAD DATA INFILE)
Note: As #XtremeBaumer said each database vendor has its own command for bulk importing from files.
Validation can be done with different strategies, for example if validation is possible using SQL, you can import data to a temporary table and then select valid data to target table.
Or you can validate data using Java code then use bulk import on validated data instead of importing them one by one.
First you should close statement and connection, use try-with.-resources.. Then check (auto)commit transactionality.
connection.setAutoCommit(true);
In the same category would be a database lock on the table, should the database be in use.
Regex is slow, instead:
if (!row[0].matches(some regex)) {
do
private static Pattern SKIP_PATTERN = Pattern.compile(some regex);
...
if (SKIP_PATTERN.matcher(row[0]).matches()) { continue; }
If there is a running number like an integer ID, the batch might be better by keeping the number in a long (statement.setLong(...)).
If the value is a short finite domain, instead of 1000 different String instances, you could use an identity map of string to the same string. Not sure whethe these two measures help.
Multithreading seems dubious and should be the last resource. You could write to a queue parsing the CSV and at the same time consume from it to the database.

Processing large number of records from a file in Java

I have million records in CSV file which has 3 columns id,firstName,lastName. I have to process this file in java and validate that id should be unique, firstName should not be null. If there are scenarios where id is not unique and/or firstName is null then I have to write these records in an output file with a fourth column as the reason("id not unique"/"firstName is NULL"). Performance should be good. Please suggest the best effective way.
You can use a collection (ArrayList) to store all the ID's in it in a loop and check if it doesn't already exist. If it doest, write it in a file.
The code should be like this:
if(!idList.contains(id)){
idList.add(id);
}else{
writer.write(id);
}
The above code should work in a loop for all the records being read from the CSV file.
You can use OpenCsv jar for the purpose you have specified. It's under Apache 2.0 licence.
You can download the jar from
http://www.java2s.com/Code/Jar/o/Downloadopencsv22jar.htm
below is the code for the same
Reader reader = Files.newBufferedReader(Paths.get(INPUT_SAMPLE_CSV_FILE_PATH));
CSVReader csvReader = new CSVReader(reader);
Writer writer = Files.newBufferedReader(Paths.get(OUTPUT_SAMPLE_CSV_FILE_PATH));
CSVWriter csvWriter = new CSVWriter(writer);
List<String[]> list = csvReader.readAll();
for (String[] row : list) {
//assuming First column to be Id
String id = row[0];
//assuming name to be second column
String name = row[1];
//assuming lastName to be third column
String lastName = row[2];
//Put your pattern here
if(id==null || !id.matches("pattern") || name==null || !name.matches("pattern")){
String[] outPutData = new String[]{id, name , lastName, "Invalid Entry"};
csvWriter.writeNext(outPutData);
}
}
let me know if this works or you need further help or clarifications.
If you want a good performance algorithm, you should not use ArrayList.contains(element) as explained here, uses O(n) complexity. Instead I suggest you to use a HashSet as the HashSet.Contains(element) operation has an O(1) complexity. To make things short, with ArrayList you would make 1,000,000^2 operations, while with HashSet you would use 1,000,000 operations.
In pseudo-code (to not give away the full answer and make you find the answer on your own) I would do this:
File outputFile
String[] columns
HashSet<String> ids
for(line in file):
columns = line.split(',')
if(ids.contains(columns.id):
outputFile.append(columns.id + " is not unique")
continue
if(columns.name == null):
outputFile.append("first name is null!")
continue
ids.add(columns.id)

How to optimize the import of data from a flat file to BD PostgreSQL?

Good morning to the community, I have a query you happen to have to import 14 million records containing the information of clients of a company.
Flat File. Txt weighs 2.8 GB, I have developed a java program that reads the flat file line by line, deal the information and put it in an object that in turn inserted into a table in the PostgreSQL database, the subject is that I have made ​​a calculation that 100000 records inserted in a time of 112 minutes, but the issue is that I insert parts.
public static void main(String[] args) {
// PROCESSING 100,000 records in 112 minutes
  // PROCESSING 1,000,000 records in 770 minutes = 18.66 hours
loadData(0L, 0L, 100000L);
}
/**
* Load the number of records Depending on the input parameters.
* #param counterInitial - Initial counter, type long.
* #param loadInitial - Initial load, type long.
* #param loadLimit - Load limit, type long.
*/
private static void loadData(long counterInitial, long loadInitial, long loadLimit){
Session session = HibernateUtil.getSessionFactory().openSession();
try{
FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
DataInputStream entrada = new DataInputStream(fstream);
BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
String strLinea;
while ((strLinea = buffer.readLine()) != null){
if(counterInitial > loadInitial){
if(counterInitial > loadLimit){
break;
}
Sppadron spadron= new Sppadron();
spadron.setSpId(counterInitial);
spadron.setSpNle(strLinea.substring(0, 9).trim());
spadron.setSpLib(strLinea.substring(9, 16).trim());
spadron.setSpDep(strLinea.substring(16, 19).trim());
spadron.setSpPrv(strLinea.substring(19, 22).trim());
spadron.setSpDst(strLinea.substring(22, 25).trim());
spadron.setSpApp(strLinea.substring(25, 66).trim());
spadron.setSpApm(strLinea.substring(66, 107).trim());
spadron.setSpNom(strLinea.substring(107, 143).trim());
String cadenaGriSecDoc = strLinea.substring(143, strLinea.length()).trim();
String[] tokensVal = cadenaGriSecDoc.split("\\s+");
if(tokensVal.length == 5){
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(tokensVal[3]);
spadron.setSpDoc(tokensVal[4]);
}else{
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(null);
spadron.setSpDoc(tokensVal[3]);
}
try{
session.getTransaction().begin();
session.save(spadron); // Insert
session.getTransaction().commit();
} catch (Exception e) {
session.getTransaction().rollback();
e.printStackTrace();
}
}
counterInitial++;
}
entrada.close();
} catch (Exception e) {
e.printStackTrace();
}finally{
session.close();
}
}
The main issue is if they check my code when I insert the first million records, the parameters would be as follows: loadData (0L, 0L, 1000000L);
The issue is that when you insert the following records in this case would be the next million records would be: loadData (0L, 1000000L, 2000000L);
What will cause it to scroll back the first 100 billion of records, and then when the counter is in the value 1000001 recently will begin insert following records, someone can give me a more optimal suggestion to insert the records, knowing that it is necessary treat information as seen in previous code shown.
See How to speed up insertion performance in PostgreSQL .
The first thing you should do is bypass Hibernate. ORMs are convienient, but you pay a price in speed for that convenience, especially with bulk operations.
You could group your inserts into reasonable sized transactions and use multi-valued inserts, using a JDBC PreparedStatement.
Personally though, I'd use PgJDBC's support for the COPY protocol to do the inserts more directly. Unwrap your Hibernate Session object to get the underlying java.sql.Connection, get the PGconnection interface for it, getCopyAPI() to get the CopyManager, and use copyIn to feed your data into the DB.
Since it looks like your data isn't in CSV form but fixed-width field form, what you'll need to do is start a thread that reads your data from the file, converts each datum into CSV form suitable for PostgreSQL input, and writes it to a buffer that copyIn can consume with the passed Reader. This sounds more complicated than it is, and there are lots of examples of Java producer/consumer threading implementations using java.io.Reader and java.io.Writer interfaces out there.
It's possible you may instead be able to write a filter for the Reader that wraps the underlying file reader and transforms each line. This would be much simpler than producer/consumer threading. Research it as the preferred option first.

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!
This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.
You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns
Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)
ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}
If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

error during grouping files based on the date field

I have a large file which has 10,000 rows and each row has a date appended at the end. All the fields in a row are tab separated. There are 10 dates available and those 10 dates have randomly been assigned to all the 10,000 rows. I am now writing a java code to write all those rows with the same date into a separate file where each file has the corresponding rows with that date.
I am trying to do it using string manipulations, but when I am trying to sort the rows based on date, I am getting an error while mentioning the date and the error says the literal is out of range. Here is the code that I used. Please have a look at it let me know if this is the right approach, if not, kindly suggest a better approach. I tried changing the datatype to Long, but still the same error. The row in the file looks something like this:
Each field is tab separated and the fields are:
business id, category, city, biz.name, longitude, state, latitude, type, date
**
qarobAbxGSHI7ygf1f7a_Q ["Sandwiches","Restaurants"] Gilbert Jersey
Mike's Subs -111.8120071 AZ 3.5 33.3788385 business 06012010
**
The code is:
File f=new File(fn);
if(f.exists() && f.length()>0)
{
BufferedReader br=new BufferedReader(new FileReader(fn));
BufferedWriter bw = new BufferedWriter(new FileWriter("FilteredDate.txt"));
String s=null;
while((s=br.readLine())!=null){
String[] st=s.split("\t");
if(Integer.parseInt(st[13])==06012010){
Thanks a lot for your time..
Try this,
List<String> sampleList = new ArrayList<String>();
sampleList.add("06012012");
sampleList.add("06012013");
sampleList.add("06012014");
sampleList.add("06012015");
//
//
String[] sampleArray = s.split(" ");
if (sampleArray != null)
{
String sample = sampleArray[sampleArray.length - 1];
if (sampleList.contains(sample))
{
stringBuilder.append(sample + "\n");
}
}
i suggest not to use split, but rather use
String str = s.subtring(s.lastIndexOf('\t'));
in any case, you try to take st[13] when i see you only have 9 columns. might be you just need st[8]
one last thing, look at this post to learn what 06012010 really means

Categories