Best strategy for Massive Insert/Update using jdbc in mssqlserver - java

Good Day, I posted this question previously but it seems I am not clear enough so I will try to be as detailed as possible here about my situation.
I need to implement a solution to do a daily extraction of data from some CSV files and using only JDBC insert this data into a production environment database tables.
I have to insert into 2 tables
Tables :
Table1 (
[func] [varchar](8) NOT NULL,
[Ver] [smallint] NOT NULL,
[id] [varchar](32) NOT NULL,
[desc] [varchar](300) NOT NULL,
[value] [float] NOT NULL,
[dtcreated] [date] NOT NULL,
[dtloaded] [date] NULL,
CONSTRAINT [Table1_PK] PRIMARY KEY CLUSTERED
(
[func] ASC,
[ver] ASC,
[id] ASC,
[desc] ASC,
[dtcreated] ASC
);
table2 (
[id] [varchar](32) NOT NULL,
[f1] [varchar](50) NOT NULL,
[f2] [varchar](32) NOT NULL,
[f3] [varchar](6) NULL,
[f4] [varchar](3) NULL,
[f5] [varchar](3) NULL,
[f6] [varchar](32) NULL,
[DtStart] [date] NOT NULL,
[DtEnd] [date] NOT NULL,
[dtcreated] [date] NOT NULL,
[dtloaded] [date] NULL,
CONSTRAINT [table2_PK] PRIMARY KEY CLUSTERED
(
[id] ASC,
[DtStart] DESC,
[DtEnd] DESC
)
Table1 has a size of 400+GB with 6,500+ Million Records.
Table2 has a size of 30+GB with about 5 Million Records.
In table1 I need to process and insert 1.5 Million records.
In table2 I need to process and update/insert 1.1 Million records, this is done using a merge-when-matched query.
I need to be able to do these 2 processes without interruption of usage of these tables.
my code does the following
public void processFile(String fileLocation) throws IOException, SQLException{
try {
SqlClient sqlClient = SqlClient.from(DriverClassName.SQLSERVER, DriverConnectionString.barra());
Connection connection = sqlClient.getConnection();
PreparedStatement pstmt = connection.prepareStatement(getSql());
File file = new File(fileLocation);
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
int lnproc = 0;
int batchCount = 0;
String line;
while (((line = br.readLine()) != null) {
String[] parts = line.split(",");
pstmt.clearParameters();
.....//Process parts and add them to the preparestatement
pstmt.addBatch();
batchCount++;
if(batchCount>=batchSize){
batchCount = 0;
try {
pstmt.executeBatch();
}catch (BatchUpdateException ex){
}
}
}
try {
pstmt.executeBatch();
}catch (BatchUpdateException ex){
}
}
connection.commit();
connection.close();
} catch (ClassNotFoundException | InstantiationException | IllegalAccessException e) {
}
}
because of the huge amount of records to insert in each table, i can generate dfferent locks on the tables that can afect the production environment.
I have done some research and I have multiple strategies I am thinking of using
create batches of max 5k inserts and commit them to prevent lock-escalation
committing after every record to prevent locks and
transactions logs.
I would like to pick the brains of the community about what you think could be the best strategy to use in this case.
And any recomendations you can make me.

After looking into it, the best solution i found for the following,
First as stated in the comments, I read the whole file loaded it into memory in a java structure.
After loading the file I then iterated the java stricter and started to load each record in the batch. At the same time I kept a counter on each item I add the batch.
When the counter hits 5000, I do a commit on the batch, reset the counter to 0 and keep adding to the following items I either hit 5000 again or reached the end of the iteration.
By doing this I am preventing MSSQL from creating a lock on the table, and the table can still be used by the other processes and applications.

Related

java + SQLite project. Foreign key "On Update" not updating

I am making a javafx (intelliJ with java jdk 11) app using SQLite version 3.30.1 with DB Browser for SQLite.
I have a table called "beehives" and each beehive can have diseases (stored in the table "diseases").
this is my "beehives" table:
CREATE TABLE "beehives" (
"number" INTEGER NOT NULL,
"id_apiary" INTEGER NOT NULL DEFAULT -2,
"date" DATE,
"type" TEXT,
"favorite" BOOLEAN DEFAULT 'false',
PRIMARY KEY("number","id_apiary"),
FOREIGN KEY("id_apiary") REFERENCES "apiaries"("id") ON DELETE SET NULL
);
this is my "diseases" table:
CREATE TABLE "diseases" (
"id" INTEGER NOT NULL,
"id_beehive" INTEGER NOT NULL,
"id_apiary" INTEGER NOT NULL,
"disease" TEXT NOT NULL,
"treatment" TEXT NOT NULL,
"start_treat_date" DATE NOT NULL,
"end_treat_date" DATE,
PRIMARY KEY("id"),
FOREIGN KEY("id_beehive","id_apiary") REFERENCES "beehives"("number","id_apiary") ON UPDATE CASCADE
);
this is my "apiaries" table in case you need it:
CREATE TABLE "apiaries" (
"id" INTEGER NOT NULL,
"name" TEXT NOT NULL,
"address" TEXT,
PRIMARY KEY("id")
);
Everything works fine, but when I update a beehive (for example when I update the "number", which is the primary key in beehives table) the diseases does not update the number. The result is that the diseases get some kind of disconnected since the beehive change his "number" correctly, but the disease doesn't update it. There is no error message.
My java method that calls the update is:
public void updateBeehiveInDB(Beehives newBeehive,Beehives oldBeehive){
try {
s = "UPDATE beehives SET number=?, id_apiary=?, date=?, type=?, favorite=? WHERE number=? and id_apiary=? ";
preparedStatement = connection.prepareStatement(s);
preparedStatement.setInt(1, newBeehive.getNumber());
preparedStatement.setInt(2, newBeehive.getId_apiary());
preparedStatement.setDate(3, newBeehive.getDate());
preparedStatement.setString(4, newBeehive.getType());
preparedStatement.setBoolean(5, newBeehive.isFavorite());
preparedStatement.setInt(6, oldBeehive.getNumber());
preparedStatement.setInt(7,oldBeehive.getId_apiary());
int i = preparedStatement.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
}
I tried to check if foreign keys are "on" following the SQLite documentation here, but my English is not good enough and I am using DB Manager. So no idea how to check if this is on, or how to turn it on manually.
What can I do to update the diseases "id_beehives" when I update "number" on beehives table?
The problem was that i am using a composite foreign key and i need to implement it correctly on other tables too even if i was not using them yet in this new project. Was very hard to find the problem because intellij normally show all the SQL error messages, but in this case, it was not showing anything. But when i tried to do the SQL sentence manually in the DB Browser, there i got an error message and was able to fix it.
Also had to activate foreign key on the connection:
public Connection openConnection() {
try {
String dbPath = "jdbc:sqlite:resources/db/datab.db";
Class.forName("org.sqlite.JDBC");
SQLiteConfig config = new SQLiteConfig();
config.enforceForeignKeys(true);
connection = DriverManager.getConnection(dbPath,config.toProperties());
return connection;
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return null;
}

Optimizing Insertions into an SQLite Database with JDBC

I'm writing the backend for a java http server for a class project and I have to insert a few records into a database using jdbc. The maximum number of insertions I have at one time is currently 122, which takes a whopping 18.7s to execute, about 6.5 insertions per second. This is outrageously slow, since the server needs to be able to respond to the request that inserts the records in less than 5s, and a real server would be expected to be many times faster. I'm pretty sure that this has something to do with the code or my declaration of the table schema, but I can't seem to find the bottleneck anywhere. The table schema looks like this:
CREATE TABLE Events (
ID varchar(38) primary key,
ownerName varchar(32) not null,
personID varchar(38) not null,
latitude float not null,
longitude float not null,
country varchar(64) not null,
city varchar(128) not null,
eventType varchar(8) not null,
year int not null,
foreign key (ownerName)
references Users (userName)
on delete cascade
on update cascade,
foreign key (ID)
references People (ID)
on delete cascade
on update cascade
);
and the code to perform the insertions is the following function
public class EventAccessor {
private Connection handle;
...
public void insert(Event event) throws DataInsertException {
String query = "insert into Events(ID,ownerName,personID,latitude,longitude,country,"
+ "city,eventType,year)\nvalues(?,?,?,?,?,?,?,?,?)";
try (PreparedStatement stmt = handle.prepareStatement(query)) {
stmt.setString(1, event.getID());
stmt.setString(2, event.getUsername());
stmt.setString(3, event.getPersonID());
stmt.setDouble(4, event.getLatitude());
stmt.setDouble(5, event.getLongitude());
stmt.setString(6, event.getCountry());
stmt.setString(7, event.getCity());
stmt.setString(8, event.getType());
stmt.setInt(9, event.getYear());
stmt.executeUpdate();
} catch (SQLException e) {
throw new DataInsertException(e.getMessage(), e);
}
}
}
Where Event is a class that holds an entry for the schema and DataInsertionException is a simple exception defined elsewhere in the API. I was instructed to use PreparedStatement because it's apparently "more safe" that using a Statement, but I have the choice to switch, so if it's faster I'll gladly change the code. The function that I use to insert the 122 entries is actually a wrapper for an array of Event objects that looks like this
void insertEvents(Event[] events) throws DataInsertException {
for (Event e : events) {
insert(e);
}
}
I'm willing to try anything to improve performance at this point.
I disabled auto commits on the JDBC connection with connection.setAutoCommit(false) and performance increased by over 1000x. New benchmarks show that inserting 122 records was complete in a mere in 0.008265739s, a speed of about 14,000 insertions per second, which is closer to what I was expecting.

Java JDBC - PreparedStatement executeUpdate() always returns 1

Currently I'm working on a code in java that retrieves data from XML files located in various folders and then uploads the file itself and the data retrieved to a SQL-server Database. I don't want to upload any repeated XML file to database but since the files can have random names I'm checking using the Hash from each file I'm about to upload, I'm uploading the files to the following table:
XMLFiles
CREATE TABLE [dbo].[XMLFiles](
[PathID] [int] NOT NULL,
[FileID] [int] IDENTITY(1,1) NOT NULL,
[XMLFileName] [nvarchar](100) NULL,
[FileSize] [int] NULL,
[FileData] [varbinary](max) NULL,
[ModDate] [datetime2](7) NULL,
[FileHash] [nvarchar](100) NULL,
CONSTRAINT [PK_XMLFiles] PRIMARY KEY CLUSTERED
(
[FileID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
the code I'm using to upload the files is the following:
public int UploadFile
public int UploadFile(String Path,int pathID) throws SQLException, SAXException, IOException {
int ID=-1;
String hash;
int len,rowCount=0;
String query;
PreparedStatement pstmt;
try {
File file = new File(Path);
hash=XMLRead.getFileChecksum(file);
FileInputStream fis = new FileInputStream(file);
len = (int) file.length();
query = (" IF NOT EXISTS "
+ " (SELECT 1"
+ " FROM XMLFiles"
+ " WHERE FileSize="+len+" AND FileHash='"+hash+"')"
+ " BEGIN"
+ " INSERT INTO XMLFiles (PathID,XMLFileName,FileSize,FileData,ModDate,FileHash) "
+ " VALUES(?,?,?,?,GETDATE(),?)"
+ " END;");
pstmt = Con.prepareStatement(query);
pstmt.setInt(1, pathID);
pstmt.setString(2, file.getName());
pstmt.setInt(3, len);
pstmt.setBinaryStream(4, fis, len);
pstmt.setString(5, hash);
rowCount=pstmt.executeUpdate();
System.out.println("ROWS AFFECTED:-"+rowCount);
if (rowCount==0){
System.out.println("THE FILE: "+file.getName()+"ALREADY EXISTS IN THE SERVER WITH THE NAME: ");
System.out.println(GetFilename(hash));
}
} catch (Exception e) {
e.printStackTrace();
}
return rowCount;
}
I'm executing the program with 28 files in which 4 of them are repeated files but with different names, I know the code is working fine because at the end of each execution only the 24 unique files are uploaded, the problem is that I'm using the rowCount to check if the file was uploaded or not, and if the file wasn't uploaded because it was a repeated file I'm not uploading the data of that file to the database neither, like so (the following code is a fragment to illustrate the comprobation I'm doing):
int rowCount=UploadFile(Path,pathID);
if (rowCount==1){
//UPLOAD DATA
}
the problem is that the executeUpdate() in the method UploadFile always returns 1 even when no rows in the database where affected, Is there something I'm missing here?, I can't find anything wrong with my code, is it the IF NOT EXISTS comprobation that I'm doing the one that returns 1?
The update count returned by a SQL statement is only well-defined for a normal DML statement (INSERT, UPDATE, or DELETE).
It is not defined for a SQL script.
The value is whatever the server chooses to return for a script. For MS SQL Server, it is likely the value of ##ROWCOUNT at the end of the statement / script:
Set ##ROWCOUNT to the number of rows affected or read.
Since you're executing a SELECT statement, it sets the ##ROWCOUNT value. If zero, you then execute the INSERT statement, which will override the ##ROWCOUNT value.
Assuming there will never be more than one row with that size/hash, you will always get a count of 1 back.
It could be that when your SELECT in the IF block finds the existing row it is counted and returned.
If there is no exception thrown you could try the INSERT without the IF NOT EXISTS check and see if this is the case. You may end up with duplicates if you do not have a key of some kind that prevents them from being inserted, or you may receive an exception if you have a key that does prevent the insert. It's worth testing to see what you get.
If it is the SELECT returning the 1, you may need to split them into two statements, and simply skip the execution of the second if the first finds a row. You can keep them in the same transaction, and essentially your db is doing two statements as currently written. It's more code, but if you do in the same transaction, it's the same effect on your database.

java.sql.SQLException: Subquery returns more than 1 row

I have a program that when executed it gets lots of words from a file and inserts them into a database, after being inserted if the word is inserted twice it calculates the "IDF" again using a trigger. The problem is that if I do this directly into MySQL it has no problem, but if I do this on Java it returns this error:
Exception in thread "main" java.sql.SQLException: Subquery returns more than 1 row
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1086)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2828)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2777)
at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:949)
at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:795)
at model.Consultas.altaBajaCambio(Consultas.java:29)
at model.DatosBD.agregarPalabra(DatosBD.java:23)
at search.Search.main(Search.java:36)
Java Result: 1
I assume the problem has to be with the st.execute(), since it only gives back one int, but I have search on the web for a solution and I cannot find one.
Query:
String query2 = "INSERT IGNORE INTO Search.IndiceInv (DocID, Term, TF) VALUES ("+doc+",'"+term+"',1) ON DUPLICATE KEY UPDATE `TF` = `TF` + 1;";
c.altaBajaCambio(query2);
Execution:
try (Connection con = c.getConnection()) {
if (con == null) {
System.out.println("No hay conexion");
} else {
Statement st = con.createStatement();
st.execute(query);
}
Database:
-- -----------------------------------------------------
-- Table `Search`.`Doc`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `Search`.`Doc` (
`DocID` INT NOT NULL,
PRIMARY KEY (`DocID`))
ENGINE = InnoDB;
-- -----------------------------------------------------
-- Table `Search`.`Term`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `Search`.`Term` (
`Term` VARCHAR(45) NOT NULL,
`IDF` INT NOT NULL,
PRIMARY KEY (`Term`))
ENGINE = InnoDB;
-- -----------------------------------------------------
-- Table `Search`.`IndiceInv`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `Search`.`IndiceInv` (
`DocID` INT NOT NULL,
`Term` VARCHAR(45) NOT NULL,
`TF` INT NOT NULL,
PRIMARY KEY (`DocID`, `Term`),
ENGINE = InnoDB;
Trigger:
delimiter //
create trigger IDF
after update on IndiceInv
for each row
begin
update Term
set IDF = (SELECT LOG((SELECT count(*) FROM Doc)/(SELECT count(DocID) FROM IndiceInv WHERE Term = new.Term)) FROM Doc, IndiceInv)
where Term = new.Term;
end;//
delimiter ;
Try to run manually:
SELECT LOG((SELECT count(*) FROM Doc)/(SELECT count(DocID) FROM IndiceInv WHERE Term = new.Term)) FROM Doc, IndiceInv
(assign the relevant values to new.Term etc)

Failing to load large dataset into h2 database

Here is the problem: At my company we have a large database that we want to perform some automated operations in it. To test that we got a small sample of that data about 6 10MB sized csv files. We want to use H2 to test the results of our program in it. H2 Seemed to work fine with our previous cvs though they were, at most, 1000 entries long. When it comes to any of our 10MB files the command
insert into myschema.mytable (select * from csvread('mycsvfile.csv'));
reports a failure because one of the registries is supposedly duplicated and offends our primary key constraints.
Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
insert into myschema.mytable (select * from csvread('src/test/resources/h2/data/mycsvfile.csv')) [23001-148] 23001/23001
Breaking the mycsvfile.csv into smaller pieces I was able to see that the problem starts to appear after about 10000 rows inserted(though the number varies depending on what data I used). I could however insert more than 10000 rows if I broke the file into pieces and then ran the command individually. But even if I manage to insert all that data manually I need an automated method to fill the database.
Since running the command would not give me the row that was causing the problem I guessed that the problem could be some cache in the csvread routine.
Then I created a small java program that could insert the data in the H2 database manually. No matter whether I batched the commands, closed and opened the connection for 1000 rows h2 reported that I was trying to duplicate an entry in the database.
org.h2.jdbc.JdbcSQLException: Unique index or primary key violation: "PRIMARY_KEY_6 ON MYSCHEMA.MYTABLE(DATETIME, LARGENUMBER, KIND)"; SQL statement:
INSERT INTO myschema.mytable VALUES ( '1997-10-06 01:00:00.0',25485116,1.600,0,18 ) [23001-148]
Doing a normal search for that registry using emacs I can find that the registry is not duplicated as the datetime column is unique in the whole dataset.
I cannot give that data for you to test since the company sells that information. But here is how my table definition is like.
create table myschema.mytable (
datetime timestamp,
largenumber numeric(8,0) references myschema.largenumber(largecode),
value numeric(8,3) not null,
flag numeric(1,0) references myschema.flag(flagcode),
kind smallint references myschema.kind(kindcode),
primary key (datetime, largenumber, kind)
);
This is how our csv looks like:
datetime,largenumber,value,flag,kind
1997-06-11 16:45:00.0,25485116,0.710,0,18
1997-06-11 17:00:00.0,25485116,0.000,0,18
1997-06-11 17:15:00.0,25485116,0.000,0,18
1997-06-11 17:30:00.0,25485116,0.000,0,18
And the java code that would fill our test database(forgive my ugly code, I got desperate :)
private static void insertFile(MyFile file) throws SQLException {
int updateCount = 0;
ResultSet rs = Csv.getInstance().read(file.toString(), null, null);
ResultSetMetaData meta = rs.getMetaData();
Connection conn = DriverManager.getConnection(
"jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
rs.next();
while (rs.next()) {
Statement stmt = conn.createStatement();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < meta.getColumnCount(); i++) {
if (i == 0)
sb.append("'" + rs.getString(i + 1) + "'");
else
sb.append(rs.getString(i + 1));
sb.append(',');
}
updateCount++;
if (sb.length() > 0)
sb.deleteCharAt(sb.length() - 1);
stmt.execute(String.format(
"INSERT INTO myschema.mydatabase VALUES ( %s ) ",
sb.toString()));
if (updateCount == 1000) {
conn.close();
conn = DriverManager.getConnection(
"jdbc:h2:tcp://localhost/mytestdatabase", "sa", "pass");
updateCount = 0;
}
}
if (!conn.isClosed()) {
conn.close();
}
rs.close();
}
I'll be glad to provide more information if requested.
EDIT
#Randy I always check if the database is clean before running the command and in my java program I have a routine to delete all data from a file that fails to be inserted.
select * from myschema.mytable where largenumber = 25485116;
DATETIME LARGENUMBER VALUE FLAG KIND
(no rows, 8 ms)
The only thing that I can think of is that there is a trigger on the table that sets the timestamp to "now". Although that would not explain why you are successful with a few rows, it would explain why the primary key is being violated.

Categories