Error regarding usage of super csv bean reader - java

I have the following dependency added:
<dependency>
<groupId>net.sf.supercsv</groupId>
<artifactId>super-csv</artifactId>
<version>2.4.0</version>
</dependency>
private final static String[] COLS = { "col1", "col2", "col3", "col4", "col5",
"col6", "col7", "col8", "col9", "col10", "col11",
"col12", "col13", "col14" };
private final static String[] TEMP_COLS = {"col1", "col2", "col3", "col4", "col5",
"col6", "col7", "col8", "col9", "col10", "col11",
"col12", "col13"};
The below is how I build my reader.
protected CsvPreference csvPref = CsvPreference.STANDARD_PREFERENCE;
protected String encoding = "US-ASCII";
InputStream is = fs.open(path);
BufferedReader br = new BufferedReader(new InputStreamReader(is, encoding));
ICsvBeanReader csvReader = new CsvBeanReader(br, csvPref);
As part of bean reader, I have the following code:
Selections bean = null;
try{
bean = reader.read(Selections.class, Selections.getCols());
}catch(Exception e){
// bean = reader.read(Selections.class, Selections.getTempCols());
// slf4j.error(bean.getEventCode() + bean.getProgramId());
slf4j.error("Error Logged for bean because of COLUMNS MISMATCH");
}
In the above code, It is throwing exception :
java.lang.IllegalArgumentException:the nameMapping array and the number of columns read should be the same size (nameMapping length = 14, columns = 13))
I am not sure what is causing this exception.It is throwing this exception on some of the records even if all the records have 14 columns(I have verified this by using a script, I have even created a schema and uploaded the file with 14 columns). Out of 7,000,000 records 2,100,000 has this issue.
In order to debug what record is causing this problem I have made the below changes to the code.
Selections bean = null;
try{
bean = reader.read(Selections.class, Selections.getCols());
}catch(Exception e){
bean = reader.read(Selections.class, Selections.getTempCols());
slf4j.error(bean.getEventCode() + bean.getProgramId());
slf4j.error("Error Logged for bean because of COLUMNS MISMATCH");
}
Now, the above changes are throwing : java.lang.IllegalArgumentException: the nameMapping array and the number of columns read should be the same size (nameMapping length = 13, columns = 14)
I have no idea why the open csv reader is behaving so strangely. When the count of columns is not 14 it would cause exception and in exception when trying to read it to print the details, It says the column count is 14.
Please help me debug this issue. I shall update more details about the issue if needed. Please let me know.

After a dive into super csv source and your confirmation that you can upload with 14 columns coreectly, I'd suggest you look for a replacement for Super CSV.
My recommendation: Check out Apache Commons CSV.
This library also supports an iterative approach, so you wouldn't need to have 7.000.000 records in memory.

Finally I resolved the problem, the problem is because of the columnquote mode character that I have given in my CSV preferences.
new CsvPreference.Builder('"', '\u0001', "\r\n").build()
My incoming data has " as part of the data. The issue got resolved when I have replaced quoted column with a character that will never be part of the incoming data.
I am not an expert at it, it is because of my ignorance and super-scv is not at fault. I believe super-csv is a decent API to explore and use.
To know more about column quote mode, please refer to their API.
https://super-csv.github.io/super-csv/apidocs/org/supercsv/quote/ColumnQuoteMode.html

Related

Processing large number of records from a file in Java

I have million records in CSV file which has 3 columns id,firstName,lastName. I have to process this file in java and validate that id should be unique, firstName should not be null. If there are scenarios where id is not unique and/or firstName is null then I have to write these records in an output file with a fourth column as the reason("id not unique"/"firstName is NULL"). Performance should be good. Please suggest the best effective way.
You can use a collection (ArrayList) to store all the ID's in it in a loop and check if it doesn't already exist. If it doest, write it in a file.
The code should be like this:
if(!idList.contains(id)){
idList.add(id);
}else{
writer.write(id);
}
The above code should work in a loop for all the records being read from the CSV file.
You can use OpenCsv jar for the purpose you have specified. It's under Apache 2.0 licence.
You can download the jar from
http://www.java2s.com/Code/Jar/o/Downloadopencsv22jar.htm
below is the code for the same
Reader reader = Files.newBufferedReader(Paths.get(INPUT_SAMPLE_CSV_FILE_PATH));
CSVReader csvReader = new CSVReader(reader);
Writer writer = Files.newBufferedReader(Paths.get(OUTPUT_SAMPLE_CSV_FILE_PATH));
CSVWriter csvWriter = new CSVWriter(writer);
List<String[]> list = csvReader.readAll();
for (String[] row : list) {
//assuming First column to be Id
String id = row[0];
//assuming name to be second column
String name = row[1];
//assuming lastName to be third column
String lastName = row[2];
//Put your pattern here
if(id==null || !id.matches("pattern") || name==null || !name.matches("pattern")){
String[] outPutData = new String[]{id, name , lastName, "Invalid Entry"};
csvWriter.writeNext(outPutData);
}
}
let me know if this works or you need further help or clarifications.
If you want a good performance algorithm, you should not use ArrayList.contains(element) as explained here, uses O(n) complexity. Instead I suggest you to use a HashSet as the HashSet.Contains(element) operation has an O(1) complexity. To make things short, with ArrayList you would make 1,000,000^2 operations, while with HashSet you would use 1,000,000 operations.
In pseudo-code (to not give away the full answer and make you find the answer on your own) I would do this:
File outputFile
String[] columns
HashSet<String> ids
for(line in file):
columns = line.split(',')
if(ids.contains(columns.id):
outputFile.append(columns.id + " is not unique")
continue
if(columns.name == null):
outputFile.append("first name is null!")
continue
ids.add(columns.id)

How to use the SQL Server JDBC bulk copy API

I run into some issues trying to map my column metadata with SQL Server Bulk Copy API and SQLServerBulkCSVFileRecord. Just for test purposes I made a table consisting of only nvarchar(500) columns and add the metadata like this:
fileRecord = new SQLServerBulkCSVFileRecord(csvPath, false);
for(int i=1; i<=colCount; i++) {
fileRecord.addColumnMetadata(i, null, java.sql.Types.NVARCHAR, 500, 0);
}
I get the following stacktrace after using the Microsoft SQL bulk copy API with JDBC and I can't find any documentation on SQLServerBulkCSVFileRecord. I don't know what the parameters in addColumnMetaData stand for: I just assumed looking at this example that the first parameter stands for column index and then obviously the third one for data-type, the fourth being byte count of the column(?).
com.microsoft.sqlserver.jdbc.SQLServerException: Unicode data is odd byte size for column 1. Should be even byte size.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:217)
at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:251)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:81)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:36)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.doInsertBulk(SQLServerBulkCopy.java:1433)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.access$200(SQLServerBulkCopy.java:41)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy$1InsertBulk.doExecute(SQLServerBulkCopy.java:666)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:6276)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1793)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.sendBulkLoadBCP(SQLServerBulkCopy.java:699)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(SQLServerBulkCopy.java:1516)
at com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(SQLServerBulkCopy.java:616)
I read that blank lines, non CRLF line endings, encoding etc. could have an impact but I feel like I've exhausted those options.
Finally here's a little sample of my CSV file:
column1|test|1|testtest|test3
column2|test|2|testt46426est|test346
column3|test|3|test4test|test3426234
You are not specifying the pipe character as your field delimiter. Note that it needs to be escaped as "\\|" because, according to the documentation:
The delimiter specified for the CSV file should not appear anywhere in the data and should be escaped properly if it is a restricted character in Java regular expressions.
I just tried the following code and it worked for me:
String csvPath = "C:/Users/Gord/Desktop/sample.txt";
SQLServerBulkCSVFileRecord fileRecord =
new SQLServerBulkCSVFileRecord(csvPath, null, "\\|", false);
int colCount = 5;
for (int i = 1; i <= colCount; i++) {
fileRecord.addColumnMetadata(i, null, java.sql.Types.NVARCHAR, 50, 0);
}
try (SQLServerBulkCopy bulkCopy = new SQLServerBulkCopy(conn)) {
bulkCopy.setDestinationTableName("dbo.so41144967");
try {
// Write from the source to the destination.
bulkCopy.writeToServer(fileRecord);
} catch (Exception e) {
// Handle any errors that may have occurred.
e.printStackTrace();
}
}

Am i doing it in right wat? predict stock price

I prepared csv file with the input data for neural network, and csv file where i can test my neural network. The results are not satisfactory. I was trying increase/decrease size of input data. Probably i missing something and i would be glad if someone can some tips etc. Here is my encog code:
//input data
File file = new File("path to file");
CSVFormat format = new CSVFormat('.', ',');
VersatileDataSource source = new CSVDataSource(file, false, format);
VersatileMLDataSet data = new VersatileMLDataSet(source);
data.getNormHelper().setFormat(format);
ColumnDefinition wig20OpenN = data.defineSourceColumn("wig20OpenN", 0, ColumnType.continuous);
(...)
ColumnDefinition futureClose = data.defineSourceColumn("futureClose", 81, ColumnType.continuous);
data.analyze();
data.defineSingleOutputOthersInput(futureClose);
EncogModel model = new EncogModel(data);
//TYPE_RBFNETWORK, TYPE_SVM, TYPE_NEAT, TYPE_FEEDFORWARD <- this type of method i was trying
model.selectMethod(data, MLMethodFactory.TYPE_SVM);
model.setReport(new ConsoleStatusReportable());
data.normalize();
model.holdBackValidation(0.001, true, 10);
model.selectTrainingType(data);
MLRegression bestMethod = (MLRegression)model.crossvalidate(20, true);
// Display the training and validation errors.
System.out.println( "Training error: " + model.calculateError(bestMethod, model.getTrainingDataset()));
System.out.println( "Validation error: " + model.calculateError(bestMethod, model.getValidationDataset()));
NormalizationHelper helper = data.getNormHelper();
File testingData = new File("path to testing file");
ReadCSV csv = new ReadCSV(testingData, false, format);
String[] line = new String[81];
MLData input = helper.allocateInputVector();
while(csv.next()) {
StringBuilder result = new StringBuilder();
for(int i = 0; i <81; i++){
line[i] = csv.get(i);
}
String correct = csv.get(81);
helper.normalizeInputVector(line,input.getData(),false);
MLData output = bestMethod.compute(input);
String irisChosen = helper.denormalizeOutputVectorToString(output)[0];
result.append(Arrays.toString(line));
result.append(" -> predicted: ");
result.append(irisChosen);
result.append("(correct: ");
result.append(correct);
result.append(")");
System.out.println(result.toString());
}
// Delete data file and shut down.
filename.delete();
Encog.getInstance().shutdown();
What i was trying so far is to change the MLMethodFactory, but had problems here, only TYPE_RBFNETWORK, TYPE_SVM, TYPE_NEAT, TYPE_FEEDFORWARD this type works fine, for example if i changed it to TYPE_PNN i had following exception:
Exception in thread "main" org.encog.EncogError: Please call selectTraining first to choose how to train.
Ok i know from documentation that i should use this method:
selectTraining(VersatileMLDataSet dataset, String trainingType, String trainingArgs)
But the string type for traningtype and triningArgs is confusing.
And last question what about saving the neural after traning to file, and loading it to check on the traning data? As i would like to have this separately.
Edit: I forgot the size of the input data is 1500.
I see that you not satisfied with your results, but it is relatively fine. I propose you to consider adding scaling to your training. You have 81 columns, and in your input row I see data like 16519.1600, also 2315.94, and even -0.6388282285709328. For neural network it is hard to adjust weights correctly for such different inputs.
P.S. scaling is also normalizing of columns!. As usually in books is described normalizing of rows, but normalizing of columns is also important.

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!
This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.
You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns
Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)
ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}
If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

error during grouping files based on the date field

I have a large file which has 10,000 rows and each row has a date appended at the end. All the fields in a row are tab separated. There are 10 dates available and those 10 dates have randomly been assigned to all the 10,000 rows. I am now writing a java code to write all those rows with the same date into a separate file where each file has the corresponding rows with that date.
I am trying to do it using string manipulations, but when I am trying to sort the rows based on date, I am getting an error while mentioning the date and the error says the literal is out of range. Here is the code that I used. Please have a look at it let me know if this is the right approach, if not, kindly suggest a better approach. I tried changing the datatype to Long, but still the same error. The row in the file looks something like this:
Each field is tab separated and the fields are:
business id, category, city, biz.name, longitude, state, latitude, type, date
**
qarobAbxGSHI7ygf1f7a_Q ["Sandwiches","Restaurants"] Gilbert Jersey
Mike's Subs -111.8120071 AZ 3.5 33.3788385 business 06012010
**
The code is:
File f=new File(fn);
if(f.exists() && f.length()>0)
{
BufferedReader br=new BufferedReader(new FileReader(fn));
BufferedWriter bw = new BufferedWriter(new FileWriter("FilteredDate.txt"));
String s=null;
while((s=br.readLine())!=null){
String[] st=s.split("\t");
if(Integer.parseInt(st[13])==06012010){
Thanks a lot for your time..
Try this,
List<String> sampleList = new ArrayList<String>();
sampleList.add("06012012");
sampleList.add("06012013");
sampleList.add("06012014");
sampleList.add("06012015");
//
//
String[] sampleArray = s.split(" ");
if (sampleArray != null)
{
String sample = sampleArray[sampleArray.length - 1];
if (sampleList.contains(sample))
{
stringBuilder.append(sample + "\n");
}
}
i suggest not to use split, but rather use
String str = s.subtring(s.lastIndexOf('\t'));
in any case, you try to take st[13] when i see you only have 9 columns. might be you just need st[8]
one last thing, look at this post to learn what 06012010 really means

Categories