I am reading a CSV file line by line and upserting data to mongodb database. It takes 2 mins approx. to read, process and write data to mongodb from all files, when db and the files are on same machine. Whereas when the db is located on another machine in my network, it takes around 5 mins. It is taking even more time on remote machine. can anyone please help me out to reduce time ?. Thanks.
An approach to your problem with reduction in processing time.
To Read the CSV file and put it into MongoDB Use an ETL such as Kettle.
http://wiki.pentaho.com/display/BAD/Write+Data+To+MongoDB
This will enhance the time in reading from CSV to writing in MongoDB.
Simplest way to have the data in the remote machine.
Export the Data in your local db and import it in the remote machine.
https://docs.mongodb.com/v2.6/core/import-export/
Hope it Helps!
I saw that you are using Java to load your Mongo database.
The Java driver on recent version allow bulk operations, so you can send a batch of insert to mongo instead of sending them one by one. This will speed up inserts in mongoDB a lot.
DBCollection collection = db.getCollection("my_collection");
List<DBObject> list = new ArrayList<>();
for(int i = 0; i < 100; i++){
//generate your datas
BasicDBObject obj = new BasicDBObject("key", "value");
list.add(obj);
}
collection.insert(list);//bulk insert of 100 obj!
This is available since Mongo 2.6 : https://docs.mongodb.com/manual/reference/method/Bulk.insert/
Related
I have a CSV/TSV file with data and want to load that CSV data into Database. I am using Java or Python and PostgreSQL to do that (I can't change that).
The problem is that for each row I make an INSERT query and it is not that efficient if I have let's say 600.000 rows. Is there any more efficient way to do it?
I was wondering if I can take more rows and create just one big query and execute it on my database but I'm not sure if that helps at all or should I divide the data in maybe let's say 100 pieces and execute 100 queries?
If the CSV file is compatible with the format required by copy from stdin, then the most efficient way is to use the CopyManager API.
See this answer or this answer for example code.
If your input file isn't compatible with Postgres' copy command, you will need to write the INSERT yourself. But you can speed up the process by using JDBC batching:
Something along the lines:
PreparedStatement insert = connection.prepareStatement("insert into ...");
int batchSize = 1000;
int batchRow = 0;
// iterate over the lines from the file
while (...) {
... parse the line, extract the columns ...
insert.setInt(1, ...);
insert.setString(2, ...);
insert.setXXX(...);
insert.addBatch();
batchRow ++;
if (batchRow == batchSize) {
insert.executeBatch();
batchRow = 0);
}
}
insert.executeBatch();
Using reWriteBatchedInserts=true in your JDBC URL will improve performance even more.
Assuming the server can access the file directly, you could try using the COPY FROM command. If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (e.g. while copying to a location that the server can access).
I am looking for technical solution; To query data from one db and load it into a SQL Server database using java spring boot.
Mock query to get productNames which are updated between given time of 20 hours:
SELECT
productName, updatedtime FROM
products WHERE
updatedtime BETWEEN '2018-03-26 00:00:01' AND '2018-03-26 19:59:59';
Here is the approach we followed.
1) Its long running Oracle query, which runs approximately 1 hours on business hours and it returns ~1Million records.
2) We have to insert/ dump this resultset into a SQL Server Table using JDBC.
3) As I know Oracle JDBC driver supports kind of streaming. When we iterate over ResultSet it loads only fetchSize rows into memory.
int currentRow = 1;
while (rs.next()) {
// get your data from current row from Oracle database and accumulate in a batch
if (currentRow++ % BATCH_SIZE == 0) {
//insert whole accumulated batch into SqlServer database
}
}
In this case we do not need to store all huge dataset from Oracle in memory. And we will insert into SqlServer by batches of BATCH_SIZE. The only thing is that we need to think where to do commit into SqlServer database.
4)Here is the bottleneck is query execution waiting time to get the data from oracle db, So I am planing to split the query into 10 equal parts such each query to give updatedtime between each hour as shown. so that execution time also get reduced to ~10min for each query.
eg:
SELECT
productName, updatedtime FROM
products WHERE
updatedtime BETWEEN '2018-03-26 01:00:01' AND '2018-03-26 01:59:59';
5.For that I required 5 Oracle JDBC connections and 5 Sql server connection(to query the data and insert into db) to do its job independently. I am new to JDBC connection pooling
How can I do the connection pooling and closing the connection if its not in use etc?
Please suggest if you have any other better approach to get the data from the data source quickly as real time data. Please suggest. Thanks in advance.
This is a typical use case from spring batch.
There you have the concept of ItemReader(from your source db) and ItemWriter(into your destination db).
You can define multiple datasource and you will have capabilities for reading in fixed fetch size(JdbcCursorItemReader for instance) and also to create grid for parallel execution.
With a quick search you can find many examples online relative to this kind of tasks.
I know I'm not posting the code relative to the concept but it will take me some time to prepare a decent example
My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html
Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"
I was entering data into mongodb but suddenly encountered with this error.Don't know how to fix this.Is this due to maximum size exceeded?.If no then why I am getting this error?.Anyone know how to fix this? Below is the error which I encountered
Exception in thread "main" com.mongodb.MongoInternalException: DBObject of size 163745644 is over Max BSON size 16777216
I know my dataset is large...but is there any other solution??
the document you are trying to insert is exceeding the max BSON document size ie 16 MB
Here is the reference documentation : http://docs.mongodb.org/manual/reference/limits/
To store documents larger than the maximum size, MongoDB provides the GridFS API.
The mongofiles utility makes it possible to manipulate files stored in
your MongoDB instance in GridFS objects from the command line. It is
particularly useful as it provides an interface between objects stored
in your file system and GridFS.
Ref : MongoFiles
For inserting an document of size greater than 16MB you need to use GRIDFS by MongoDB. GridFS is an abstraction layer on mongoDB which divide data in chunks (by default 255K ). As you are using java, its simple to use with java driver too. I am inserting an elasticsearch jar(of size 20mb) in mongoDB. Sample code :
MongoClient mongo = new MongoClient("localhost", 27017);
DB db = mongo.getDB("testDB");
String newFileName = "elasticsearch-Jar";
File imageFile = new File("/home/impadmin/elasticsearch-1.4.2.tar.gz");
GridFS gfs = new GridFS(db);
//Insertion
GridFSInputFile inputFile = gfs.createFile(imageFile);
inputFile.setFilename(newFileName);
inputFile.put("name", "devender");
inputFile.put("age", 23);
inputFile.save();
//Fetch back
GridFSDBFile outputFile = gfs.findOne(newFileName);
Find out more here.
If you want to insert directly using mongoclient you will use mongofiles as mentioned in other answer.
Hope that helps.....:)
I have to transfer around 5 million rows of data from Teradata to MySQL. Can anyone please suggest me the fastest way to do this over the network, without using the filesystem. I am new to Teradata and MySQL. I want to run this transfer as a batch job on weekly basis, so I am looking for the solution which can be fully automated. Any suggestions or hints will be greatly appreciated.
I have already written the code using JDBC to get the records from the Teradata and insert them into the MySQL. But it is very slow, so I am looking to make that code more efficient. I kept in generic because I didn't have the solution to be constrained by my implementation, as along with making existing code more efficient I am open to other alternatives also. But I don't want to use the file system since it's not easier to maintain or update the scripts.
My implementation:
Getting records from teradata:
connection = DBConnectionFactory.getDBConnection(SOURCE_DB);
statement = connection.createStatement();
rs = statement.executeQuery(QUERY_SELECT);
while (rs.next()) {
Offer offer = new Offer();
offer.setExternalSourceId(rs.getString("EXT_SOURCE_ID"));
offer.setClientOfferId(rs.getString("CLIENT_OFFER_ID"));
offer.setUpcId(rs.getString("UPC_ID"));
offers.add(offer);
}
Inserting the records in mySQL:
int count = 0;
if (isUpdated) {
for (Offer offer : offers) {
count++;
stringBuilderUpdate = new StringBuilder();
stringBuilderUpdate = stringBuilderUpdate
.append(QUERY_INSERT);
stringBuilderUpdate = stringBuilderUpdate.append("'"
+ offer.getExternalSourceId() + "'");
statement.addBatch(stringBuilderUpdate.toString());
queryBuilder = queryBuilder.append(stringBuilderUpdate
.toString() + SEMI_COLON);
if (count > LIMIT) {
countUpdate = statement.executeBatch();
LOG.info("DB update count : " + countUpdate.length);
count = 0;
}
}
if (count > 0) {
// Execute batch
countUpdate = statement.executeBatch();
}
Can anybody please tell me if we can make this code more efficient ???
Thanks
PS: Please ignore the syntax error in above code as this code is working fine. Some info might be missing because of copy and paste.
The fastest method of importing data to MySQL is by using LOAD DATA INFILE or mysqlimport, which is a command line interface to LOAD DATA INFILE and it involves loading data from a file, preferably residing on a local filesystem.
When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times faster than using INSERT statements.
Therefore despite the fact that you don't want to use the filesystem I'd suggest to consider creating a dump to a file, transfer it to a MySQL server and use above mentioned means to load the data.
All these tasks can be fully automated via scripting.