Parse huge CSV file

Parse huge CSV file - java

I have a huge CSV file
need to read it
validate
write to db
After research, I found this solution
//configure input format using
CsvParserSettings settings = new CsvParserSettings();
//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();
//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)");
//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
//get next row from parser and set values in your statement
String[] row = it.next();
//validation
if (!row[0].matches(some regex)){
badDataList.add(row);
conitunue;
}
for(int i = 0; i < COLUMN_COUNT; i++){
if(i < row.length){
statement.setObject(i + 1, row[i]);
} else { //row in input is shorter than COLUMN_COUNT
statement.setObject(i + 1, null);
}
}
//add the values to the batch
statement.addBatch();
batchSize++;
//once 1000 rows made into the batch, execute it
if (batchSize == 1000) {
statement.executeBatch();
batchSize = 0;
}
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
statement.executeBatch();
}
// or use jook#loadArrays
context.loadInto("book")
.batchAfter(500)
.loadArrays(new ArrayList <String[]>)
However, it is still too slow because it's executing in same thread. Is there any way to do it faster with multi-threading?

Instead of iterating records one by one, use commands such as LOAD DATA INFILE that imports data in bulk:
JDBC: CSV raw data export/import from/to remote MySQL database using streams (SELECT INTO OUTFILE / LOAD DATA INFILE)
Note: As #XtremeBaumer said each database vendor has its own command for bulk importing from files.
Validation can be done with different strategies, for example if validation is possible using SQL, you can import data to a temporary table and then select valid data to target table.
Or you can validate data using Java code then use bulk import on validated data instead of importing them one by one.

First you should close statement and connection, use try-with.-resources.. Then check (auto)commit transactionality.
connection.setAutoCommit(true);
In the same category would be a database lock on the table, should the database be in use.
Regex is slow, instead:
if (!row[0].matches(some regex)) {
do
private static Pattern SKIP_PATTERN = Pattern.compile(some regex);
...
if (SKIP_PATTERN.matcher(row[0]).matches()) { continue; }
If there is a running number like an integer ID, the batch might be better by keeping the number in a long (statement.setLong(...)).
If the value is a short finite domain, instead of 1000 different String instances, you could use an identity map of string to the same string. Not sure whethe these two measures help.
Multithreading seems dubious and should be the last resource. You could write to a queue parsing the CSV and at the same time consume from it to the database.

Related

My Customer data is being truncated when added to my List [duplicate]

I am running data.bat file with the following lines:
Rem Tis batch file will populate tables
cd\program files\Microsoft SQL Server\MSSQL
osql -U sa -P Password -d MyBusiness -i c:\data.sql
The contents of the data.sql file is:
insert Customers
(CustomerID, CompanyName, Phone)
Values('101','Southwinds','19126602729')
There are 8 more similar lines for adding records.
When I run this with start > run > cmd > c:\data.bat, I get this error message:
1>2>3>4>5>....<1 row affected>
Msg 8152, Level 16, State 4, Server SP1001, Line 1
string or binary data would be truncated.
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
Also, I am a newbie obviously, but what do Level #, and state # mean, and how do I look up error messages such as the one above: 8152?

From #gmmastros's answer
Whenever you see the message....
string or binary data would be truncated
Think to yourself... The field is NOT big enough to hold my data.
Check the table structure for the customers table. I think you'll find that the length of one or more fields is NOT big enough to hold the data you are trying to insert. For example, if the Phone field is a varchar(8) field, and you try to put 11 characters in to it, you will get this error.

I had this issue although data length was shorter than the field length.
It turned out that the problem was having another log table (for audit trail), filled by a trigger on the main table, where the column size also had to be changed.

In one of the INSERT statements you are attempting to insert a too long string into a string (varchar or nvarchar) column.
If it's not obvious which INSERT is the offender by a mere look at the script, you could count the <1 row affected> lines that occur before the error message. The obtained number plus one gives you the statement number. In your case it seems to be the second INSERT that produces the error.

Just want to contribute with additional information: I had the same issue and it was because of the field wasn't big enough for the incoming data and this thread helped me to solve it (the top answer clarifies it all).
BUT it is very important to know what are the possible reasons that may cause it.
In my case i was creating the table with a field like this:
Select '' as Period, * From Transactions Into #NewTable
Therefore the field "Period" had a length of Zero and causing the Insert operations to fail. I changed it to "XXXXXX" that is the length of the incoming data and it now worked properly (because field now had a lentgh of 6).
I hope this help anyone with same issue :)

Some of your data cannot fit into your database column (small). It is not easy to find what is wrong. If you use C# and Linq2Sql, you can list the field which would be truncated:
First create helper class:
public class SqlTruncationExceptionWithDetails : ArgumentOutOfRangeException
{
public SqlTruncationExceptionWithDetails(System.Data.SqlClient.SqlException inner, DataContext context)
: base(inner.Message + " " + GetSqlTruncationExceptionWithDetailsString(context))
{
}
/// <summary>
/// PArt of code from following link
/// http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
/// </summary>
/// <param name="context"></param>
/// <returns></returns>
static string GetSqlTruncationExceptionWithDetailsString(DataContext context)
{
StringBuilder sb = new StringBuilder();
foreach (object update in context.GetChangeSet().Updates)
{
FindLongStrings(update, sb);
}
foreach (object insert in context.GetChangeSet().Inserts)
{
FindLongStrings(insert, sb);
}
return sb.ToString();
}
public static void FindLongStrings(object testObject, StringBuilder sb)
{
foreach (var propInfo in testObject.GetType().GetProperties())
{
foreach (System.Data.Linq.Mapping.ColumnAttribute attribute in propInfo.GetCustomAttributes(typeof(System.Data.Linq.Mapping.ColumnAttribute), true))
{
if (attribute.DbType.ToLower().Contains("varchar"))
{
string dbType = attribute.DbType.ToLower();
int numberStartIndex = dbType.IndexOf("varchar(") + 8;
int numberEndIndex = dbType.IndexOf(")", numberStartIndex);
string lengthString = dbType.Substring(numberStartIndex, (numberEndIndex - numberStartIndex));
int maxLength = 0;
int.TryParse(lengthString, out maxLength);
string currentValue = (string)propInfo.GetValue(testObject, null);
if (!string.IsNullOrEmpty(currentValue) && maxLength != 0 && currentValue.Length > maxLength)
{
//string is too long
sb.AppendLine(testObject.GetType().Name + "." + propInfo.Name + " " + currentValue + " Max: " + maxLength);
}
}
}
}
}
}
Then prepare the wrapper for SubmitChanges:
public static class DataContextExtensions
{
public static void SubmitChangesWithDetailException(this DataContext dataContext)
{
//http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
try
{
//this can failed on data truncation
dataContext.SubmitChanges();
}
catch (SqlException sqlException) //when (sqlException.Message == "String or binary data would be truncated.")
{
if (sqlException.Message == "String or binary data would be truncated.") //only for EN windows - if you are running different window language, invoke the sqlException.getMessage on thread with EN culture
throw new SqlTruncationExceptionWithDetails(sqlException, dataContext);
else
throw;
}
}
}
Prepare global exception handler and log truncation details:
protected void Application_Error(object sender, EventArgs e)
{
Exception ex = Server.GetLastError();
string message = ex.Message;
//TODO - log to file
}
Finally use the code:
Datamodel.SubmitChangesWithDetailException();

Another situation in which you can get this error is the following:
I had the same error and the reason was that in an INSERT statement that received data from an UNION, the order of the columns was different from the original table. If you change the order in #table3 to a, b, c, you will fix the error.
select a, b, c into #table1
from #table0
insert into #table1
select a, b, c from #table2
union
select a, c, b from #table3

on sql server you can use SET ANSI_WARNINGS OFF like this:
using (SqlConnection conn = new SqlConnection("Data Source=XRAYGOAT\\SQLEXPRESS;Initial Catalog='Healthy Care';Integrated Security=True"))
{
conn.Open();
using (var trans = conn.BeginTransaction())
{
try
{
using cmd = new SqlCommand("", conn, trans))
{
cmd.CommandText = "SET ANSI_WARNINGS OFF";
cmd.ExecuteNonQuery();
cmd.CommandText = "YOUR INSERT HERE";
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
cmd.CommandText = "SET ANSI_WARNINGS ON";
cmd.ExecuteNonQuery();
trans.Commit();
}
}
catch (Exception)
{
trans.Rollback();
}
}
conn.Close();
}

I had the same issue. The length of my column was too short.
What you can do is either increase the length or shorten the text you want to put in the database.

Also had this problem occurring on the web application surface.
Eventually found out that the same error message comes from the SQL update statement in the specific table.
Finally then figured out that the column definition in the relating history table(s) did not map the original table column length of nvarchar types in some specific cases.

I had the same problem, even after increasing the size of the problematic columns in the table.
tl;dr: The length of the matching columns in corresponding Table Types may also need to be increased.
In my case, the error was coming from the Data Export service in Microsoft Dynamics CRM, which allows CRM data to be synced to an SQL Server DB or Azure SQL DB.
After a lengthy investigation, I concluded that the Data Export service must be using Table-Valued Parameters:
You can use table-valued parameters to send multiple rows of data to a Transact-SQL statement or a routine, such as a stored procedure or function, without creating a temporary table or many parameters.
As you can see in the documentation above, Table Types are used to create the data ingestion procedure:
CREATE TYPE LocationTableType AS TABLE (...);
CREATE PROCEDURE dbo.usp_InsertProductionLocation
#TVP LocationTableType READONLY
Unfortunately, there is no way to alter a Table Type, so it has to be dropped & recreated entirely. Since my table has over 300 fields (😱), I created a query to facilitate the creation of the corresponding Table Type based on the table's columns definition (just replace [table_name] with your table's name):
SELECT 'CREATE TYPE [table_name]Type AS TABLE (' + STRING_AGG(CAST(field AS VARCHAR(max)), ',' + CHAR(10)) + ');' AS create_type
FROM (
SELECT TOP 5000 COLUMN_NAME + ' ' + DATA_TYPE
+ IIF(CHARACTER_MAXIMUM_LENGTH IS NULL, '', CONCAT('(', IIF(CHARACTER_MAXIMUM_LENGTH = -1, 'max', CONCAT(CHARACTER_MAXIMUM_LENGTH,'')), ')'))
+ IIF(DATA_TYPE = 'decimal', CONCAT('(', NUMERIC_PRECISION, ',', NUMERIC_SCALE, ')'), '')
AS field
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '[table_name]'
ORDER BY ORDINAL_POSITION) AS T;
After updating the Table Type, the Data Export service started functioning properly once again! :)

When I tried to execute my stored procedure I had the same problem because the size of the column that I need to add some data is shorter than the data I want to add.
You can increase the size of the column data type or reduce the length of your data.

A 2016/2017 update will show you the bad value and column.
A new trace flag will swap the old error for a new 2628 error and will print out the column and offending value. Traceflag 460 is available in the latest cumulative update for 2016 and 2017:
https://support.microsoft.com/en-sg/help/4468101/optional-replacement-for-string-or-binary-data-would-be-truncated
Just make sure that after you've installed the CU that you enable the trace flag, either globally/permanently on the server:
...or with DBCC TRACEON:
https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-traceon-trace-flags-transact-sql?view=sql-server-ver15

Another situation, in which this error may occur is in
SQL Server Management Studio. If you have "text" or "ntext" fields in your table,
no matter what kind of field you are updating (for example bit or integer).
Seems that the Studio does not load entire "ntext" fields and also updates ALL fields instead of the modified one.
To solve the problem, exclude "text" or "ntext" fields from the query in Management Studio

This Error Comes only When any of your field length is greater than the field length specified in sql server database table structure.
To overcome this issue you have to reduce the length of the field Value .
Or to increase the length of database table field .

If someone is encountering this error in a C# application, I have created a simple way of finding offending fields by:
Getting the column width of all the columns of a table where we're trying to make this insert/ update. (I'm getting this info directly from the database.)
Comparing the column widths to the width of the values we're trying to insert/ update.
Assumptions/ Limitations:
The column names of the table in the database match with the C# entity fields. For eg: If you have a column like this in database:
You need to have your Entity with the same column name:
public class SomeTable
{
// Other fields
public string SourceData { get; set; }
}
You're inserting/ updating 1 entity at a time. It'll be clearer in the demo code below. (If you're doing bulk inserts/ updates, you might want to either modify it or use some other solution.)
Step 1:
Get the column width of all the columns directly from the database:
// For this, I took help from Microsoft docs website:
// https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.getschema?view=netframework-4.7.2#System_Data_SqlClient_SqlConnection_GetSchema_System_String_System_String___
private static Dictionary<string, int> GetColumnSizesOfTableFromDatabase(string tableName, string connectionString)
{
var columnSizes = new Dictionary<string, int>();
using (var connection = new SqlConnection(connectionString))
{
// Connect to the database then retrieve the schema information.
connection.Open();
// You can specify the Catalog, Schema, Table Name, Column Name to get the specified column(s).
// You can use four restrictions for Column, so you should create a 4 members array.
String[] columnRestrictions = new String[4];
// For the array, 0-member represents Catalog; 1-member represents Schema;
// 2-member represents Table Name; 3-member represents Column Name.
// Now we specify the Table_Name and Column_Name of the columns what we want to get schema information.
columnRestrictions[2] = tableName;
DataTable allColumnsSchemaTable = connection.GetSchema("Columns", columnRestrictions);
foreach (DataRow row in allColumnsSchemaTable.Rows)
{
var columnName = row.Field<string>("COLUMN_NAME");
//var dataType = row.Field<string>("DATA_TYPE");
var characterMaxLength = row.Field<int?>("CHARACTER_MAXIMUM_LENGTH");
// I'm only capturing columns whose Datatype is "varchar" or "char", i.e. their CHARACTER_MAXIMUM_LENGTH won't be null.
if(characterMaxLength != null)
{
columnSizes.Add(columnName, characterMaxLength.Value);
}
}
connection.Close();
}
return columnSizes;
}
Step 2:
Compare the column widths with the width of the values we're trying to insert/ update:
public static Dictionary<string, string> FindLongBinaryOrStringFields<T>(T entity, string connectionString)
{
var tableName = typeof(T).Name;
Dictionary<string, string> longFields = new Dictionary<string, string>();
var objectProperties = GetProperties(entity);
//var fieldNames = objectProperties.Select(p => p.Name).ToList();
var actualDatabaseColumnSizes = GetColumnSizesOfTableFromDatabase(tableName, connectionString);
foreach (var dbColumn in actualDatabaseColumnSizes)
{
var maxLengthOfThisColumn = dbColumn.Value;
var currentValueOfThisField = objectProperties.Where(f => f.Name == dbColumn.Key).First()?.GetValue(entity, null)?.ToString();
if (!string.IsNullOrEmpty(currentValueOfThisField) && currentValueOfThisField.Length > maxLengthOfThisColumn)
{
longFields.Add(dbColumn.Key, $"'{dbColumn.Key}' column cannot take the value of '{currentValueOfThisField}' because the max length it can take is {maxLengthOfThisColumn}.");
}
}
return longFields;
}
public static List<PropertyInfo> GetProperties<T>(T entity)
{
//The DeclaredOnly flag makes sure you only get properties of the object, not from the classes it derives from.
var properties = entity.GetType()
.GetProperties(System.Reflection.BindingFlags.Public
| System.Reflection.BindingFlags.Instance
| System.Reflection.BindingFlags.DeclaredOnly)
.ToList();
return properties;
}
Demo:
Let's say we're trying to insert someTableEntity of SomeTable class that is modeled in our app like so:
public class SomeTable
{
[Key]
public long TicketID { get; set; }
public string SourceData { get; set; }
}
And it's inside our SomeDbContext like so:
public class SomeDbContext : DbContext
{
public DbSet<SomeTable> SomeTables { get; set; }
}
This table in Db has SourceData field as varchar(16) like so:
Now we'll try to insert value that is longer than 16 characters into this field and capture this information:
public void SaveSomeTableEntity()
{
var connectionString = "server=SERVER_NAME;database=DB_NAME;User ID=SOME_ID;Password=SOME_PASSWORD;Connection Timeout=200";
using (var context = new SomeDbContext(connectionString))
{
var someTableEntity = new SomeTable()
{
SourceData = "Blah-Blah-Blah-Blah-Blah-Blah"
};
context.SomeTables.Add(someTableEntity);
try
{
context.SaveChanges();
}
catch (Exception ex)
{
if (ex.GetBaseException().Message == "String or binary data would be truncated.\r\nThe statement has been terminated.")
{
var badFieldsReport = "";
List<string> badFields = new List<string>();
// YOU GOT YOUR FIELDS RIGHT HERE:
var longFields = FindLongBinaryOrStringFields(someTableEntity, connectionString);
foreach (var longField in longFields)
{
badFields.Add(longField.Key);
badFieldsReport += longField.Value + "\n";
}
}
else
throw;
}
}
}
The badFieldsReport will have this value:
'SourceData' column cannot take the value of
'Blah-Blah-Blah-Blah-Blah-Blah' because the max length it can take is
16.

Kevin Pope's comment under the accepted answer was what I needed.
The problem, in my case, was that I had triggers defined on my table that would insert update/insert transactions into an audit table, but the audit table had a data type mismatch where a column with VARCHAR(MAX) in the original table was stored as VARCHAR(1) in the audit table, so my triggers were failing when I would insert anything greater than VARCHAR(1) in the original table column and I would get this error message.

I used a different tactic, fields that are allocated 8K in some places. Here only about 50/100 are used.
declare #NVPN_list as table
nvpn varchar(50)
,nvpn_revision varchar(5)
,nvpn_iteration INT
,mpn_lifecycle varchar(30)
,mfr varchar(100)
,mpn varchar(50)
,mpn_revision varchar(5)
,mpn_iteration INT
-- ...
) INSERT INTO #NVPN_LIST
SELECT left(nvpn ,50) as nvpn
,left(nvpn_revision ,10) as nvpn_revision
,nvpn_iteration
,left(mpn_lifecycle ,30)
,left(mfr ,100)
,left(mpn ,50)
,left(mpn_revision ,5)
,mpn_iteration
,left(mfr_order_num ,50)
FROM [DASHBOARD].[dbo].[mpnAttributes] (NOLOCK) mpna
I wanted speed, since I have 1M total records, and load 28K of them.

This error may be due to less field size than your entered data.
For e.g. if you have data type nvarchar(7) and if your value is 'aaaaddddf' then error is shown as:
string or binary data would be truncated

You simply can't beat SQL Server on this.
You can insert into a new table like this:
select foo, bar
into tmp_new_table_to_dispose_later
from my_table
and compare the table definition with the real table you want to insert the data into.
Sometime it's helpful sometimes it's not.
If you try inserting in the final/real table from that temporary table it may just work (due to data conversion working differently than SSMS for example).
Another alternative is to insert the data in chunks, instead of inserting everything immediately you insert with top 1000 and you repeat the process, till you find a chunk with an error. At least you have better visibility on what's not fitting into the table.

cassandra-spring ingest command doesn't work

I've set up a cassandra cluster and work with the spring-cassandra framework 1.53. (http://docs.spring.io/spring-data/cassandra/docs/1.5.3.RELEASE/reference/html/)
I want to write millions of datasets into my cassandra cluster. The solution with executeAsync works good but the "ingest" command from the spring framework sounds interesting aswell.
The ingest method takes advantage of static PreparedStatements that are only prepared once for performance. Each record in your data set is bound to the same PreparedStatement, then executed asynchronously for high performance.
My code:
List<List<?>> session_time_ingest = new ArrayList<List<?>>();
for (Long tokenid: listTokenID) {
List<Session_Time_Table> tempListSessionTimeTable = repo_session_time.listFetchAggregationResultMinMaxTime(tokenid);
session_time_ingest.add(tempListSessionTimeTable);
}
cassandraTemplate.ingest("INSERT into session_time (sessionid, username, eserviceid, contextroot," +
" application_type, min_processingtime, max_processingtime, min_requesttime, max_requesttime)" +
" VALUES(?,?,?,?,?,?,?,?,?)", session_time_ingest);
Throws exception:
`Exception in thread "main" com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [varchar <-> ...tracking.Tables.Session_Time_Table]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:679)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:540)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:520)
at com.datastax.driver.core.CodecRegistry.codecFor(CodecRegistry.java:470)
at com.datastax.driver.core.AbstractGettableByIndexData.codecFor(AbstractGettableByIndexData.java:77)
at com.datastax.driver.core.BoundStatement.bind(BoundStatement.java:201)
at com.datastax.driver.core.DefaultPreparedStatement.bind(DefaultPreparedStatement.java:126)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1057)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1077)
at org.springframework.cassandra.core.CqlTemplate.ingest(CqlTemplate.java:1068)
at ...tracking.SessionAggregationApplication.main(SessionAggregationApplication.java:68)`
I coded exactly like in the spring-cassandra doku.. I've no idea how to map the values of my object to the values cassandra expects?!

Your Session_Time_Table class is probably a mapped POJO, but ingest methods do not use POJO mapping.
Instead you need to provide a matrix where each row contains as many arguments as there are variables to bind in your prepared statement, something along the lines of:
List<List<?>> rows = new ArrayList<List<?>>();
for (Long tokenid: listTokenID) {
Session_Time_Table obj = ... // obtain a Session_Time_Table instance
List<Object> row = new ArrayList<Object>();
row.add(obj.sessionid);
row.add(obj.username);
row.add(obj.eserviceid);
// etc. for all bound variables
rows.add(row);
}
cassandraTemplate.ingest(
"INSERT into session_time (sessionid, username, eserviceid, " +
"contextroot, application_type, min_processingtime, " +
"max_processingtime, min_requesttime, max_requesttime) " +
"VALUES(?,?,?,?,?,?,?,?,?)", rows);

Limitation of OpenCSV Reader-Java

I used OpenCSV Reader - java to load my CSV File (file size:-1.47 GB (1,585,965,952 bytes)).
However, inside my coding, whenever it only manage to insert 10950 record to PostgreSQL database.
CSVReader csvReader = new CSVReader(new FileReader(csvFilename));
String[] row = null;
String sqlInsertCSV = "insert into ip2location_tmp_test
(ip_from, ip_to, xxxxx, "
+ "xxxxx, xxxxx,xxxxx, "
+ "xxxxx,xxxxx, xxxxx, xxxxx, xxxxx, xxxxx,xxxxx,xxxxx)"
+ " VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
while((row = csvReader.readNext()) != null) {
PreparedStatement insertCSV = conn.prepareStatement(sqlInsertCSV);
insertCSV.setLong(1, Long.parseLong(row[0]));
....
....
insertCSV.setString(14, row[13]); // usage_type
insertCSV.executeUpdate();
}
csvReader.close();
PreparedStatement insertCSV = conn.prepareStatement(sqlInsertCSV);
insertCSV.executeUpdate();
}
Is there any limitation of OpenCSV?
I need to use setString function to cater for single quote in PostgreSQL.

It has no error. It just stop like that
Hi Craig,
COPY command need to be super user. Has tried it before

First question: you are telling us how many bytes are in the file but how many records (lines) are in the file? If the file has 10950 lines then everything is working great .
Second question: Why do you have the prepareStatement/executeUpdate outside the while loop? It seems to me that would try to either insert the last record twice or insert a empty record because once you are outside the while you have no data because the csvReader returned null.
Using the record at a time method you are using there is no limit to the number of records you can read. Check your data file to see if there is no accidental carriage returns in the file around line 10950.

How to optimize the import of data from a flat file to BD PostgreSQL?

Good morning to the community, I have a query you happen to have to import 14 million records containing the information of clients of a company.
Flat File. Txt weighs 2.8 GB, I have developed a java program that reads the flat file line by line, deal the information and put it in an object that in turn inserted into a table in the PostgreSQL database, the subject is that I have made a calculation that 100000 records inserted in a time of 112 minutes, but the issue is that I insert parts.
public static void main(String[] args) {
// PROCESSING 100,000 records in 112 minutes
  // PROCESSING 1,000,000 records in 770 minutes = 18.66 hours
loadData(0L, 0L, 100000L);
}
/**
* Load the number of records Depending on the input parameters.
* #param counterInitial - Initial counter, type long.
* #param loadInitial - Initial load, type long.
* #param loadLimit - Load limit, type long.
*/
private static void loadData(long counterInitial, long loadInitial, long loadLimit){
Session session = HibernateUtil.getSessionFactory().openSession();
try{
FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
DataInputStream entrada = new DataInputStream(fstream);
BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
String strLinea;
while ((strLinea = buffer.readLine()) != null){
if(counterInitial > loadInitial){
if(counterInitial > loadLimit){
break;
}
Sppadron spadron= new Sppadron();
spadron.setSpId(counterInitial);
spadron.setSpNle(strLinea.substring(0, 9).trim());
spadron.setSpLib(strLinea.substring(9, 16).trim());
spadron.setSpDep(strLinea.substring(16, 19).trim());
spadron.setSpPrv(strLinea.substring(19, 22).trim());
spadron.setSpDst(strLinea.substring(22, 25).trim());
spadron.setSpApp(strLinea.substring(25, 66).trim());
spadron.setSpApm(strLinea.substring(66, 107).trim());
spadron.setSpNom(strLinea.substring(107, 143).trim());
String cadenaGriSecDoc = strLinea.substring(143, strLinea.length()).trim();
String[] tokensVal = cadenaGriSecDoc.split("\\s+");
if(tokensVal.length == 5){
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(tokensVal[3]);
spadron.setSpDoc(tokensVal[4]);
}else{
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(null);
spadron.setSpDoc(tokensVal[3]);
}
try{
session.getTransaction().begin();
session.save(spadron); // Insert
session.getTransaction().commit();
} catch (Exception e) {
session.getTransaction().rollback();
e.printStackTrace();
}
}
counterInitial++;
}
entrada.close();
} catch (Exception e) {
e.printStackTrace();
}finally{
session.close();
}
}
The main issue is if they check my code when I insert the first million records, the parameters would be as follows: loadData (0L, 0L, 1000000L);
The issue is that when you insert the following records in this case would be the next million records would be: loadData (0L, 1000000L, 2000000L);
What will cause it to scroll back the first 100 billion of records, and then when the counter is in the value 1000001 recently will begin insert following records, someone can give me a more optimal suggestion to insert the records, knowing that it is necessary treat information as seen in previous code shown.

See How to speed up insertion performance in PostgreSQL .
The first thing you should do is bypass Hibernate. ORMs are convienient, but you pay a price in speed for that convenience, especially with bulk operations.
You could group your inserts into reasonable sized transactions and use multi-valued inserts, using a JDBC PreparedStatement.
Personally though, I'd use PgJDBC's support for the COPY protocol to do the inserts more directly. Unwrap your Hibernate Session object to get the underlying java.sql.Connection, get the PGconnection interface for it, getCopyAPI() to get the CopyManager, and use copyIn to feed your data into the DB.
Since it looks like your data isn't in CSV form but fixed-width field form, what you'll need to do is start a thread that reads your data from the file, converts each datum into CSV form suitable for PostgreSQL input, and writes it to a buffer that copyIn can consume with the passed Reader. This sounds more complicated than it is, and there are lots of examples of Java producer/consumer threading implementations using java.io.Reader and java.io.Writer interfaces out there.
It's possible you may instead be able to write a filter for the Reader that wraps the underlying file reader and transforms each line. This would be much simpler than producer/consumer threading. Research it as the preferred option first.

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!

This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.

You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns

Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)

ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}

If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse huge CSV file - java

Related

My Customer data is being truncated when added to my List [duplicate]

cassandra-spring ingest command doesn't work

Limitation of OpenCSV Reader-Java

How to optimize the import of data from a flat file to BD PostgreSQL?

How to read data from CSV if contains more than excepted separators?

Categories

Resources