JSON to SSTable tool out-of-memory failure - java

json2sstable tool supplied with Cassandra 1.2.15 fails with out-of-memory error. Back in 2011 a similar issue was reported as bug and fixed: https://issues.apache.org/jira/browse/CASSANDRA-2189
Either I am missing some steps in the tool configuration/usage or the bug has re-emerged. Please point out what I am missing.
Repro steps:
1) Cassandra 1.2.15, one table with varchar key and one varchar column filled with random uuids, 6x10^6 records.
2) JSON file generated with sstable2json tool (~1G).
3) Cassandra restarted with new configuration (new data/cache/commit dirs, new partitioner)
4) Keyspace re-created
5) json2sstable fails after several minutes of processing:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:350)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:204)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:104)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:18)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2695)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1294)
at org.codehaus.jackson.JsonParser.readValueAs(JsonParser.java:1368)
at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:344)
at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:328)
at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:547)

From json2sstable source code, the tool loads all the records from json file into memory and sorts records by keys:
private int importUnsorted(String jsonFile, ColumnFamily columnFamily, String ssTablePath, IPartitioner<?> partitioner) throws IOException
{
int importedKeys = 0;
long start = System.currentTimeMillis();
JsonParser parser = getParser(jsonFile);
Object[] data = parser.readValueAs(new TypeReference<Object[]>(){});
keyCountToImport = (keyCountToImport == null) ? data.length : keyCountToImport;
SSTableWriter writer = new SSTableWriter(ssTablePath, keyCountToImport);
System.out.printf("Importing %s keys...%n", keyCountToImport);
// sort by dk representation, but hold onto the hex version
SortedMap<DecoratedKey,Map<?, ?>> decoratedKeys = new TreeMap<DecoratedKey,Map<?, ?>>();
for (Object row : data)
{
Map<?,?> rowAsMap = (Map<?, ?>)row;
decoratedKeys.put(partitioner.decorateKey( hexToBytes((String)rowAsMap.get("key"))), rowAsMap);
....
According to Jonathan Elis' comment in CASSANDRA-2322 issue the behavior is by design.
Thus json2sstable is not very well suited for importing production size data to Cassandra. The tool is likely to crash on large datasets.

Related

String or binary data would be truncated Error when the data has no issues

I have an error on a working pop3 daemon that is supposed to pull email data from the server and insert it into multiple local database tables. But I seem to get this issue in my log every second which I believe is negatively affecting database performance and using a lot of database pools.
After seeing this message I checked the length of my column data and also implemented a code that restricts data from accessing DB if it exceeds specified length. But even after this the error still occurs. It's odd that executing this query separately in a database inserts the data. But running it in WAS causing problems. SQL Server 2015. There are no triggers to the table.
[SQL Error] errorCode : 8152, sqlState : 22001, message : string or binary data would be truncated
INSERT INTO t_mail_rcvinfo
(
rcvInfoId,
mailId,
rcvType,
rcvIdType,
rcvId,
sortNo,
rcvName,
device,
regUserId,
regDate,
chgUserId,
chgDate
)
VALUES
(
'CA2MLe38cc3c33863bb3b26bd8a36edeebc01',
'CA2MLe38cc3be3863bb3b26bd8a360f3fa9c7',
'TO',
'EMAIL',
'datpt#email.com',
'3',
'datpt#email.com',
'PC',
null,
'2020-03-17 12:02:07.056',
null,
'2020-03-17 12:02:07.056'
)
//Implemented Code after looking at the issue
for (int i = 0; i < list1.size(); i++) {
MailRcvInfoVO infoVO = (MailRcvInfoVO)list1.get(i);
if (infoVO.getRcvId().getBytes("UTF-8").length > 200) {
rcvToIdLength = infoVO.getRcvId().getBytes("UTF-8").length;
isValidMail = false;
break;
}
if (infoVO.getRcvName().getBytes("UTF-8").length > 200) {
rcvToNameLength = infoVO.getRcvName().getBytes("UTF-8").length;
isValidMail = false;
break;
}
}

Cassandra Exception

For my current project i'm using Cassandra Db for fetching data frequently. Within every second at least 30 Db requests will hit. For each request at least 40000 rows needed to fetch from Db. Following is my current code and this method will return Hash Map.
public Map<String,String> loadObject(ArrayList<Integer> tradigAccountList){
com.datastax.driver.core.Session session;
Map<String,String> orderListMap = new HashMap<>();
List<ResultSetFuture> futures = new ArrayList<>();
List<ListenableFuture<ResultSet>> Future;
try {
session =jdbcUtils.getCassandraSession();
PreparedStatement statement = jdbcUtils.getCassandraPS(CassandraPS.LOAD_ORDER_LIST);
for (Integer tradingAccount:tradigAccountList){
futures.add(session.executeAsync(statement.bind(tradingAccount).setFetchSize(3000)));
}
Future = Futures.inCompletionOrder(futures);
for (ListenableFuture<ResultSet> future : Future){
for (Row row: future.get()){
orderListMap.put(row.getString("cliordid"), row.getString("ordermsg"));
}
}
}catch (Exception e){
}finally {
}
return orderListMap;
}
My data request query is something like this,
"SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1 WHERE tradacntid = ?".
My Cassandra cluster has 2 nodes with 32 concurrent read and write thread for each and my Db schema as follow
CREATE TABLE omsks_v1.ordersstringv1_copy1 (
tradacntid int,
cliordid text,
ordermsg text,
PRIMARY KEY (tradacntid, cliordid)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'class' : 'SizeTieredCompactionStrategy'
};
My problem is getting Cassandra timeout exception, how to optimize my code to handle all these requests
It would be better if you would attach the snnipet of that Exception (Read/write exception). I assume you are getting read time out. You are trying to fetch a large data set on a single request.
For each request at least 40000 rows needed to fetch from Db
If you have a large record and resultset is too big, it throws exception if results cannot be returned within a time limit mentioned in Cassandra.yaml.
read_request_timeout_in_ms
You can increase the timeout but this is not a good option. It may resolve the issue (may not throw exception but will take more time to return result).
Solution: For big data set you can get the result using manual pagination (range query) with limit.
SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1
WHERE tradacntid > = ? and cliordid > ? limit ?;
Or use range query
SELECT cliordid,ordermsg FROM omsks_v1.ordersStringV1 WHERE tradacntid
= ? and cliordid >= ? and cliordid <= ?;
This will be much more faster than fetching the whole resultset.
You can also try by reducing the fetch size. Although it will return the whole resultset.
public Statement setFetchSize(int fetchSize) to check if exception is thrown.
setFetchSize controls the page size, but it doesn't control the
maximum rows returned in a ResultSet.
Another point to be noted:
What's the size of tradigAccountList?
Too many requests at a time also may lead to timeout. Large size of tradigAccountList and a lot of read requests are done at a time (load balancing of requests are handled by Cassandra and how many requests can be handled depends on cluster size and some other factors) may cause this exception .
Some related Links:
Cassandra read timeout
NoHostAvailableException With Cassandra & DataStax Java Driver If Large ResultSet
Cassandra .setFetchSize() on statement is not honoured

How to overcome SVMWithSGD that throws ArrayIndexOutOfBoundsException for index bigger that 5000?

In order to detect visitors demographics based on their behavior I used SVM algorithm from SPARK MLlib:
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), "labels.txt").toJavaRDD();
JavaRDD<LabeledPoint> training = data.sample(false, 0.6, 11L);
training.cache();
JavaRDD<LabeledPoint> test = data.subtract(training);
// Run training algorithm to build the model.
int numIterations = 100;
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
// Clear the default threshold.
model.clearThreshold();
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(new SVMTestMapper(model));
Unfortunately final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations); throws ArrayIndexOutOfBoundsException :
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4857
labels.txt is a txt file composed from:
Visitor criteria(is male) | List[siteId: access number]
1 27349:1 23478:1 35752:1 9704:2 27896:1 30050:2 30018:1
1 36214:1 26378:1 26606:1 26850:1 17968:2
1 21870:1 41294:1 37388:1 38626:1 10711:1 28392:1 20749:1
1 29328:1 34370:1 19727:1 29542:1 37621:1 20588:1 42426:1 30050:6 28666:1 23190:3 7882:1 35387:1 6637:1 32131:1 23453:1
I tried with a lot of data and algorithms and as seen it gives an error for site Ids bigger than 5000.
Is there any solution to overcome it or there is another library for this issue? Or because the data is matrix is too sparse should use SVD?

clearing batch preparedstatements

I have a java application which read files and writes to oracle db row by row.
We have come across a strange error during batch insert which does not occur during sequential insert. The error is strange because it occurs only with IBM JDK7 on AIX platform and I get this error on different rows every time. My code looks like below:
prpst = conn.prepareStatement(query);
while ((line = bf.readLine()) != null) {
numLine++;
batchInsert(prpst, line);
//onebyoneInsert(prpst, line);
}
private static void batchInsert(PreparedStatement prpst, String line) throws IOException, SQLException {
prpst.setString(1, "1");
prpst.setInt(2, numLine);
prpst.setString(3, line);
prpst.setString(4, "1");
prpst.setInt(5, 1);
prpst.addBatch();
if (++batchedLines == 200) {
prpst.executeBatch();
batchedLines = 0;
prpst.clearBatch();
}
}
private static void onebyoneInsert(PreparedStatement prpst, String line) throws Exception{
int batchedLines = 0;
prpst.setString(1, "1");
prpst.setInt(2, numLine);
prpst.setString(3, line);
prpst.setString(4, "1");
prpst.setInt(5, 1);
prpst.executeUpdate();
}
I get this error during batch insert mode :
java.sql.BatchUpdateException: ORA-01461: can bind a LONG value only for insert into a LONG column
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345)
I already know why this Ora error occurs but this is not my case. I am nearly sure that I am not setting some large data to a smaller column. May be I am hitting some bugs in IBM jdk7 but could not prove that.
My question if there is a way that I can avoid this problem ? One by one insert is not an option because we have big files and it takes too much time.
Try with
prpst.setInt(5,new Integer(1))
What is the type of variable "numLine"?
Can you share type of columns corresponding to the fields you set in PreparedStatement?
Try once by processing with "onebyoneInsert". Share the output for this case. It might help identifying root cause.
Also print value of "numLine" to console.

How to provide correct arguments to setAsciiStream method?

This is my FULL test code with the main method:
public class TestSetAscii {
public static void main(String[] args) throws SQLException, FileNotFoundException {
String dataFile = "FastLoad1.csv";
String insertTable = "INSERT INTO " + "myTableName" + " VALUES(?,?,?)";
Connection conStd = DriverManager.getConnection("jdbc:xxxxx", "xxxxxx", "xxxxx");
InputStream dataStream = new FileInputStream(new File(dataFile));
PreparedStatement pstmtFld = conStd.prepareStatement(insertTable);
// Until this line everything is awesome
pstmtFld.setAsciiStream(1, dataStream, -1); // This line fails
System.out.println("works");
}
}
I get the "cbColDef value out of range" error
Exception in thread "main" java.sql.SQLException: [Teradata][ODBC Teradata Driver] Invalid precision: cbColDef value out of range
at sun.jdbc.odbc.JdbcOdbc.createSQLException(Unknown Source)
at sun.jdbc.odbc.JdbcOdbc.standardError(Unknown Source)
at sun.jdbc.odbc.JdbcOdbc.SQLBindInParameterAtExec(Unknown Source)
at sun.jdbc.odbc.JdbcOdbcPreparedStatement.setStream(Unknown Source)
at sun.jdbc.odbc.JdbcOdbcPreparedStatement.setAsciiStream(Unknown Source)
at file.TestSetAscii.main(TestSetAscii.java:21)
Here is the link to my FastLoad1.csv file. I guess that setAsciiStream fails because of the FastLoad1.csv file , but I am not sure
(In my previous question I was not able to narrow down the problem that I had. Now I have shortened the code.)
It would depend on the table schema, but the third parameter of setAsciiStream is length.
So
pstmtFld.setAsciiStream(1, dataStream, 4);
would work for a field of length 4 bytes.
But I dont think it would work as you expect it in the code. For each bind you should have separate stream.
This function setAsciiStream() is designed to be used for large data values BLOBS or long VARCHARS. It is not designed to read csv file line by line and split them into separate values.
Basicly it just binds one of the question marks with the inputStream.
After looking into the provided example it looks like teradata could handle csv but you have to explicitly tell that with:
String urlFld = "jdbc:teradata://whomooz/TMODE=ANSI,CHARSET=UTF8,TYPE=FASTLOADCSV";
I don't have enough reputation to comment, but I feel that this info can be valuable to those navigating fast load via JDBC for the first time.
This code will get the full stack trace and is very helpful for diagnosing problems with fast load:
catch (SQLException ex){
for ( ; ex != null ; ex = ex.getNextException ())
ex.printStackTrace () ;
}
In the case of the code above, it works if you specify TYPE=FASTLOADCSV in the connection string, but when run multiple times will fail due to the creation of the error tables _ERR_1 and _ERR_2. Drop these tables and clear out the destination tables to run again.

Categories