Streaming big files from postgres database into file system using JDBC

Streaming big files from postgres database into file system using JDBC - java

I store files in my postgres database in a column of type bytea with a size potentionaly exceeding the allocated Java heap space so when trying to write those files into the file system I quickly run into out of memory issues.
I am using JDBC to perform the query and then extract the content as binary stream.
This is a simplified version of my code:
public File readContent(String contentId) {
PreparedStatement statement = jdbcTemplate.getDataSource().getConnection().prepareStatement("SELECT content from table.entry WHERE id=?");
statement.setString(1, contentId);
ResultSet resultSet = statement.executeQuery();
resultSet.next();
File file = writeToFileSystem(resultSet.getBinaryStream(1));
resultSet.close();
return file;
}
private File writeToFileSystem(InputStream inputStream) {
File dir = createDirectories(Paths.get(properties.getTempFolder(), UUID.randomUUID().toString())).toFile();
File file = new File(dir, "content.zip");
FileUtils.copyInputStreamToFile(inputStream, file);
return file;
}
My expectation was that this would let me stream the data from the database into the file without ever having to load it into memory entirely. This approach doesn't work however as I am still getting OutOfMemoryErrors as soon as the query is executed:
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.postgresql.core.PGStream.receiveTupleV3(PGStream.java:395)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2118)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:168)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:116)
at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.tomcat.jdbc.pool.StatementFacade$StatementProxy.invoke(StatementFacade.java:114)
at com.sun.proxy.$Proxy149.executeQuery(Unknown Source)
at [...].ContentRepository.readContent(ContentRepository.java:111)
Is there any way I can stream the data from the database into a file without having to increase the Java VMs available memory?

As per this mail group discussion you should not be using bytea for this use case:
There are two methods to store binary data in pg and they have different
access methods and performance characteristics. Bytea data is expected to
be shorter and is returned in whole with a ResultSet by the server. For
larger data you want to use large objects which return a pointer (oid) to
the actual data which you can then stream from the server at will.
This page describes some of the differences between the two and
demonstrates using a pg specific api to access large objects, but
getBlob/setBlob will work just fine.
See Chapter 7. Storing Binary Data which shows example code and Chapter 35. Large Objects that goes into details:
PostgreSQL has a large object facility, which provides stream-style access to user data that is stored in a special large-object structure. Streaming access is useful when working with data values that are too large to manipulate conveniently as a whole.

Related

How to efficiently load data from CSV into Database?

I have a CSV/TSV file with data and want to load that CSV data into Database. I am using Java or Python and PostgreSQL to do that (I can't change that).
The problem is that for each row I make an INSERT query and it is not that efficient if I have let's say 600.000 rows. Is there any more efficient way to do it?
I was wondering if I can take more rows and create just one big query and execute it on my database but I'm not sure if that helps at all or should I divide the data in maybe let's say 100 pieces and execute 100 queries?

If the CSV file is compatible with the format required by copy from stdin, then the most efficient way is to use the CopyManager API.
See this answer or this answer for example code.
If your input file isn't compatible with Postgres' copy command, you will need to write the INSERT yourself. But you can speed up the process by using JDBC batching:
Something along the lines:
PreparedStatement insert = connection.prepareStatement("insert into ...");
int batchSize = 1000;
int batchRow = 0;
// iterate over the lines from the file
while (...) {
... parse the line, extract the columns ...
insert.setInt(1, ...);
insert.setString(2, ...);
insert.setXXX(...);
insert.addBatch();
batchRow ++;
if (batchRow == batchSize) {
insert.executeBatch();
batchRow = 0);
}
}
insert.executeBatch();
Using reWriteBatchedInserts=true in your JDBC URL will improve performance even more.

Assuming the server can access the file directly, you could try using the COPY FROM command. If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (e.g. while copying to a location that the server can access).

Writing Data from RDS to Disk in JOOQ

My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html

Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"

MongoInternalException while inserting into mongoDB

I was entering data into mongodb but suddenly encountered with this error.Don't know how to fix this.Is this due to maximum size exceeded?.If no then why I am getting this error?.Anyone know how to fix this? Below is the error which I encountered
Exception in thread "main" com.mongodb.MongoInternalException: DBObject of size 163745644 is over Max BSON size 16777216
I know my dataset is large...but is there any other solution??

the document you are trying to insert is exceeding the max BSON document size ie 16 MB
Here is the reference documentation : http://docs.mongodb.org/manual/reference/limits/
To store documents larger than the maximum size, MongoDB provides the GridFS API.
The mongofiles utility makes it possible to manipulate files stored in
your MongoDB instance in GridFS objects from the command line. It is
particularly useful as it provides an interface between objects stored
in your file system and GridFS.
Ref : MongoFiles

For inserting an document of size greater than 16MB you need to use GRIDFS by MongoDB. GridFS is an abstraction layer on mongoDB which divide data in chunks (by default 255K ). As you are using java, its simple to use with java driver too. I am inserting an elasticsearch jar(of size 20mb) in mongoDB. Sample code :
MongoClient mongo = new MongoClient("localhost", 27017);
DB db = mongo.getDB("testDB");
String newFileName = "elasticsearch-Jar";
File imageFile = new File("/home/impadmin/elasticsearch-1.4.2.tar.gz");
GridFS gfs = new GridFS(db);
//Insertion
GridFSInputFile inputFile = gfs.createFile(imageFile);
inputFile.setFilename(newFileName);
inputFile.put("name", "devender");
inputFile.put("age", 23);
inputFile.save();
//Fetch back
GridFSDBFile outputFile = gfs.findOne(newFileName);
Find out more here.
If you want to insert directly using mongoclient you will use mongofiles as mentioned in other answer.
Hope that helps.....:)

Dumping data from MySql to MongoDB

There is a table event_logs There are about 16 million entries in the table. Database is MySQL. The database is stored in Google Cloud.
My task is dump this data into MongoDB. Before dumping data I need to convert each row into JSON document.
Table schema issues
1.There is no auto_increment_id and no primary keys in the table
Tried in following ways
1.In Java using JDBC driver streamed results in ResultSet, but the problem is for first 300k results it took less time , after that it's taking long time to getting results from database server why ??
2.Splitted queries into multiple queries and used limit (like limit 1000000,100000) , but the problem is if row number starts from large number like 1000000 it's taking long time to get results. Looks like MySql is still starting from beginning even if I put row number like 1000000 and throwing results up to this number.
Please suggest solutions to efficient way to copy from MySql to MongoDB

at first you can try to set ResultSet fetch size that way:
...
Statement statement = connection.createStatement();
statement.setFetchSize(2000); // perhaps more...
ResultSet resultSet = statement.executeQuery("YOUR QUERY");
...
or you could just export your MYSQL data (CSV/XML) and then import they using this import-export-tool
converting each row into JSON document could be done after that, by parsing CSV file
also you can try to create a Statement using this parameters:
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE,
ResultSet.CONCUR_READ_ONLY);

Use Mongify a Ruby based application for super simple conversion from MySQL to MongoDB

How to stream data to database BLOB using Hibernate (no in-memory storing in byte[])

I'm looking for a way to stream binary data to/from database. If possible, i'd like it to be done with Hibernate (in database agnostic way).
All solutions I've found involve explicit or implicit loading of binary data into memory as byte[]. I need to avoid it. Let's say I want my code to be able to write to a local file a 2GB video from database (stored in BLOB column), or the other way around, using no more than 256Mb of memory. It's clearly achievable, and involves no voodoo. But I can't find a way, for now I'm trying to avoid debugging Hibernate.
Let's look at sample code (keeping in mind -Jmx=256Mb).
Entity class:
public class SimpleBean {
private Long id;
private Blob data;
// ... skipping getters, setters and constructors.
}
Hibernate mapping fragment:
<class name="SimpleBean" table="SIMPLE_BEANS">
<id name="id" column="SIMPLE_BEAN_ID">
<generator class="increment" />
</id>
<property name="data" type="blob" column="DATA" />
</class>
Test code fragment:
Configuration cfg = new Configuration().configure("hibernate.cfg.xml");
ServiceRegistry serviceRegistry = new ServiceRegistryBuilder()
.applySettings(cfg.getProperties())
.buildServiceRegistry();
SessionFactory sessionFactory = cfg.buildSessionFactory(serviceRegistry);
Session session = sessionFactory.openSession();
session.beginTransaction();
File dataFile = new File("movie_1gb.avi");
long dataSize = dataFile.length();
InputStream dataStream = new FileInputStream(dataFile);
LobHelper lobHelper = session.getLobHelper();
Blob dataBlob = lobHelper.createBlob(dataStream, dataSize);
session.save( new SimpleBean(data) );
session.getTransaction().commit(); // Throws java.lang.OutOfMemoryError
session.close();
blobStream.close();
sessionFactory.close();
When running that snippet I get OutOfMemory exception. Looking at stack trace shows what Hibernate tries to load the stream in memory and gets OutOfMemory (as it should). Here's stack trace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.hibernate.type.descriptor.java.DataHelper.extractBytes(DataHelper.java:183)
at org.hibernate.type.descriptor.java.BlobTypeDescriptor.unwrap(BlobTypeDescriptor.java:121)
at org.hibernate.type.descriptor.java.BlobTypeDescriptor.unwrap(BlobTypeDescriptor.java:45)
at org.hibernate.type.descriptor.sql.BlobTypeDescriptor$4$1.doBind(BlobTypeDescriptor.java:105)
at org.hibernate.type.descriptor.sql.BasicBinder.bind(BasicBinder.java:92)
at org.hibernate.type.AbstractStandardBasicType.nullSafeSet(AbstractStandardBasicType.java:305)
at org.hibernate.type.AbstractStandardBasicType.nullSafeSet(AbstractStandardBasicType.java:300)
at org.hibernate.type.AbstractSingleColumnStandardBasicType.nullSafeSet(AbstractSingleColumnStandardBasicType.java:57)
at org.hibernate.persister.entity.AbstractEntityPersister.dehydrate(AbstractEntityPersister.java:2603)
at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2857)
at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:3301)
at org.hibernate.action.internal.EntityInsertAction.execute(EntityInsertAction.java:88)
at org.hibernate.engine.spi.ActionQueue.execute(ActionQueue.java:362)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:354)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:275)
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:326)
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:52)
at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1214)
at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:403)
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.beforeTransactionCommit(JdbcTransaction.java:101)
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:175)
at ru.swemel.msgcenter.domain.SimpleBeanTest.testBasicUsage(SimpleBeanTest.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
Used Hibernate 4.1.5.SP1. The exact question is: how to avoid loading stream into memory when storing a blob in database using Hibernate, using direct streaming instead. I'd like to avoid off topics about why one stores video in column of database instead of storing it in some content repository and linking. Please, consider it a model what is irrelevant to the question.
It seems that there might be some kind of capabilities on different dialects and Hibernate might try to load everything in memory, because underlying database doesn't support streaming blobs or something like that. If it's the case - i'd like to see some kind of comparative table between different dialects in aspect of handling blobs.
Thank you very much for your help!

For those looking for same thing.
My bad, the code works as supposed to (streams without trying to copy to memory) for PostgreSQL (and probably lots of others). The inner work of Hibernate depends on selected dialect. The one I used in the first place overrides direct use of streams in favor of BinaryStream backed by byte[].
Also there are no problems with performance, since it loads only OID (number) in case of PostgreSQL, and probably lazy loads data in case of other dialects (including byte[] implementation). Just ran some dirty tests, no visible difference in 10000 loads of entity with and without binary data field.
Storing data in database seems to be slower than just saving it on disk as external file though. But it saves you a lot of headache when backing up, or dealing with limitations of particular file system, or concurrent updates, etc. But it's an off-topic.

Your solution using Hibernate's lobHelper should work, but you may need to make sure that the use of streams is enforced.
Set property hibernate.jdbc.use_streams_for_binary = true
This is a system-level property, so it has to be set at startup (I defined it on the command line during testing:
java -Dhibernate.jdbc.use_streams_for_binary=true blobTest
You can prove it's changed in your code:
Object prop = props.get("hibernate.jdbc.use_streams_for_binary");
System.out.println("hibernate.jdbc.use_streams_for_binary" + "/" + prop);

You are storing the Blob in your POJO SimpleBean. This means if the blob is larger than your heap space, anytime you work with this object or access the data field, you're going to get the OutOfMemoryError because the entire thing is loaded into memory.
I don't think there's a way to set or get a database field using a Stream in hibernate, and HQL inserts only into SELECT statements.
What you may have to do is remove the data field from the SimpleBean object so that it's not stored in memory when you load or save. But when you need to save a blob, you can use hibernate's save() to create the row, then use a jdbc PreparedStatement and the setBinaryStream() method. When you need to access the stream, you can use hibernate's load() method to get a SimpleBean object and do a jdbc select to get a ResultSet then use the getBinaryStream() method to read the blob. The docs for setBinaryStream() say:
The data will be read from the stream as needed until end-of-file is reached.
So the data won't be stored entirely in memory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Streaming big files from postgres database into file system using JDBC - java

Related

How to efficiently load data from CSV into Database?

Writing Data from RDS to Disk in JOOQ

MongoInternalException while inserting into mongoDB

Dumping data from MySql to MongoDB

How to stream data to database BLOB using Hibernate (no in-memory storing in byte[])

Categories

Resources