Split .srt file into equal chunks

Split .srt file into equal chunks - java

I am new to this and i need to split Srt(subtitle file) into multiple chunks.
For example: if i have subtitle file of a video(60 minutes). Then the subtitle file should split into 6 subtitle files having each subtitle file of 10 minutes.
i.e 6 X 10 = 60 Minutes
Need to divide into 6 chunks irrespective of minutes.
Using these each subtitle time/duration, i have to split the video into same chunks.
I am trying this code, can u please help me out how can i calculate the time and divide into chunks,
I am able to achieve how many minutes of chuck i needed.But stuck in how to read upto that chunck minutes from source file and create a new file .Then how to start the next chunk from the next 10 minutes from the source file.
import org.apache.commons.io.IOUtils;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.stream.Stream;
/**
* The class SyncSRTSubtitles reads a subtitles .SRT file and offsets all the
* timestamps with the same specific value in msec.
*
* The format of the .SRT file is like this:
*
* 123
* 00:11:23,456 --> 00:11:25,234
* subtitle #123 text here
*
*
* #author Sorinel CRISTESCU
*/
public class SyncSRTSubtitles {
/**
* Entry point in the program.
*
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
/* INPUT: offset value: negative = less (-) ... positive = more (+). */
long delta = (22 * 1000L + 000); /* msec */
/* INPUT: source & destination files */
String srcFileNm = "/Users/meh/Desktop/avatar.srt";
String destFileNm = "/Users/meh/Desktop/avatar1.srt";
/* offset algorithm: START */
File outFile = new File(destFileNm);
outFile.createNewFile();
FileWriter ofstream = new FileWriter(outFile);
BufferedWriter out = new BufferedWriter(ofstream);
/* Open the file that is the first command line parameter */
FileInputStream fstream = new FileInputStream(srcFileNm);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// List<String> doc = IOUtils.readLines(in, StandardCharsets.UTF_8);
String strEnd = null;
long diff = 0;
String line;
String startTS1;
try (Stream<String> lines = Files.lines(Paths.get(srcFileNm))) {
line = lines.skip(1).findFirst().get();
String[] atoms = line.split(" --> ");
startTS1 = atoms[0];
}
System.out.println("bolo:" +line);
System.out.println("startTS1:" +startTS1);
String startTS = null;
String endTS = null;
/* Read File Line By Line */
while ((strLine = br.readLine()) != null) {
String[] atoms = strLine.split(" --> ");
if (atoms.length == 1) {
//out.write(strLine + "\n");
}
else {
startTS = atoms[0];
endTS = atoms[1];
// out.write(offsetTime(startTS, delta) + " --> "
// + offsetTime(endTS, delta) + "\n");
strEnd = endTS;
}
}
try {
SimpleDateFormat dateFormat = new SimpleDateFormat("hh:mm:ss");
Date parsedendDate = dateFormat.parse(strEnd);
Date parsedStartDate = dateFormat.parse(startTS1);
diff = parsedendDate.getTime() - parsedStartDate.getTime();
} catch(Exception e) { //this generic but you can control another types of exception
// look the origin of excption
}
System.out.println("strEnd");
System.out.println(strEnd);
/* Close the input streams */
in.close();
out.close();
System.out.println(diff);
long diff1 =diff/6;
System.out.println(diff1);
long diff2= (diff1*6);
System.out.println(diff2);
System.out.println((diff / 3600000) + " hour/s " + (diff % 3600000) / 60000 + " minutes");
System.out.println((diff1 / 3600000) + " hour/s " + (diff1 % 3600000) / 60000 + " minutes");
System.out.println((diff2 / 3600000) + " hour/s " + (diff2 % 3600000) / 60000 + " minutes");
/* offset algorithm: END */
System.out.println("DONE! Check the rsult oin the file: " + destFileNm);
}
/**
* Computes the timestamp offset.
*
* #param ts
* String value of the timestamp in format: "hh:MM:ss,mmm"
* #param delta
* long value of the offset in msec (positive or negative).
* #return String with the new timestamp representation.
*/
private static String offsetTime(String ts, long delta) {
long tsMsec = 0;
String atoms[] = ts.split("\\,");
if (atoms.length == 2) {
tsMsec += Integer.parseInt(atoms[1]);
}
atoms = atoms[0].split(":");
tsMsec += Integer.parseInt(atoms[2]) * 1000L; /* seconds */
tsMsec += Integer.parseInt(atoms[1]) * 60000L; /* minutes */
tsMsec += Integer.parseInt(atoms[0]) * 3600000L; /* hours */
tsMsec += delta; /* here we do the offset. */
long h = tsMsec / 3600000L;
System.out.println(h);
String result = get2digit(h, 2) + ":";
System.out.println(result);
long r = tsMsec % 3600000L;
System.out.println(r);
long m = r / 60000L;
System.out.println(m);
result += get2digit(m, 2) + ":";
System.out.println(result);
r = r % 60000L;
System.out.println(r);
long s = r / 1000L;
result += get2digit(s, 2) + ",";
result += get2digit(r % 1000L, 3);
System.out.println(result);
return result;
}
/**
* Gets the string representation of the number, adding the prefix '0' to
* have the required length.
*
* #param n
* long number to convert to string.
* #param digits
* int number of digits required.
* #return String with the required length string (3 for digits = 3 -->
* "003")
*/
private static String get2digit(long n, int digits) {
String result = "" + n;
while (result.length() < digits) {
result = "0" + result;
}
return result;
}
}
Please suggest me how can i achieve this?

You will need to parse the file twice:
once to read the last end time
second time to process all lines and
generate output files.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
import org.apache.commons.io.FileUtils;
public class SplitSRTFiles {
/**
* Splits a SRT file in multiple files each containing an equal time duration.
* #param args
* [0] number of wanted chunks
* [1] source file name
* #throws IOException
*/
public static void main(String[] args) throws IOException {
int nrOfChunks = Integer.parseInt(args[0]);
File srtFile = new File(args[1]);
System.out.println("Splitting "+srtFile.getAbsolutePath()+" into "+nrOfChunks+" files.");
List<String> srcLines = FileUtils.readLines(srtFile);
long fileEndTime = lastEndTime(srcLines);
long msecsPerChunkFile = fileEndTime / nrOfChunks;
int destFileCounter = 1;
String[] fileNameParts = srtFile.getName().split("\\.");
File outFile = new File(fileNameParts[0] + destFileCounter + "." + fileNameParts[1]);
System.out.println("Writing to "+outFile.getAbsolutePath());
outFile.createNewFile();
FileWriter ofstream = new FileWriter(outFile);
BufferedWriter out = new BufferedWriter(ofstream);
for (String line : srcLines) {
String[] atoms = line.split(" --> ");
if (atoms.length > 1) {
long startTS = toMSec(atoms[0]);
// check if start time of this subtitle is after the current
// chunk
if (startTS > msecsPerChunkFile * destFileCounter) {
// close existing file ...
out.close();
ofstream.close();
// ... and start a new file
destFileCounter++;
outFile = new File(srtFile.getParent(), fileNameParts[0] + destFileCounter + "." + fileNameParts[1]);
System.out.println("Writing to "+outFile.getAbsolutePath());
outFile.createNewFile();
ofstream = new FileWriter(outFile);
out = new BufferedWriter(ofstream);
}
}
out.write(line + "/n");
}
out.close();
ofstream.close();
System.out.println("Done.");
}
/**
* Calculates the time in msec of the end time of the last subtitle of the
* file
*
* #param lines
* read from file
* #return end time in milliseconds of the last subtitle
*/
public static long lastEndTime(List lines) throws IOException {
String endTS = null;
for (String line : lines) {
String[] atoms = line.split(" --> ");
if (atoms.length > 1) {
endTS = atoms[1];
}
}
return endTS == null ? 0L : toMSec(endTS);
}
public static long toMSec(String time) {
long tsMsec = 0;
String atoms[] = time.split("\\,");
if (atoms.length == 2) {
tsMsec += Integer.parseInt(atoms[1]);
}
atoms = atoms[0].split(":");
tsMsec += Integer.parseInt(atoms[2]) * 1000L; /* seconds */
tsMsec += Integer.parseInt(atoms[1]) * 60000L; /* minutes */
tsMsec += Integer.parseInt(atoms[0]) * 3600000L; /* hours */
return tsMsec;
}
}

I have figured out a way to divide the file into chunks,
public static void main(String args[]) throws IOException {
String FilePath = "/Users/meh/Desktop/escapeplan.srt";
FileInputStream fin = new FileInputStream(FilePath);
System.out.println("size: " +fin.getChannel().size());
long abc = 0l;
abc = (fin.getChannel().size())/3;
System.out.println("6: " +abc);
System.out.println("abc: " +abc);
//FilePath = args[1];
File filename = new File(FilePath);
long splitFileSize = 0,bytefileSize=0;
if (filename.exists()) {
try {
//bytefileSize = Long.parseLong(args[2]);
splitFileSize = abc;
Splitme spObj = new Splitme();
spObj.split(FilePath, (long) splitFileSize);
spObj = null;
} catch (Exception e) {
e.printStackTrace();
}
} else {
System.out.println("File Not Found....");
}
}
public void split(String FilePath, long splitlen) {
long leninfile = 0, leng = 0;
int count = 1, data;
try {
File filename = new File(FilePath);
InputStream infile = new BufferedInputStream(new FileInputStream(filename));
data = infile.read();
System.out.println("data");
System.out.println(data);
while (data != -1) {
filename = new File("/Users/meh/Documents/srt" + count + ".srt");
//RandomAccessFile outfile = new RandomAccessFile(filename, "rw");
OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
while (data != -1 && leng < splitlen) {
outfile.write(data);
leng++;
data = infile.read();
}
leninfile += leng;
leng = 0;
outfile.close();
changeTimeStamp(filename, count);
count++;
}
} catch (Exception e) {
e.printStackTrace();
}
}

Related

How to avoid skipping blank rows or columns in Apache POI

When I parse the file using Apace POI, the empty rows are getting skipped and a List of String arrays with non-empty rows are being returned.
How can I tell the API, not to skip reading the empty row or columns ?
The code to read is somehow skipping the rows with no data.
XExcelFileReader fileReader = new XExcelFileReader(excelByteArray, sheetFromWhichToRead);
dataFromFile = fileReader.readRows();
is used to read the data from the class XExcelFileReader. The variable dataFromFile is a List of String array.
This is the code to read the data row:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.regex.Pattern;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.lang.StringUtils;
import org.apache.poi.hssf.usermodel.HSSFDateUtil;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFCellStyle;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
/**
* This class is responsible for reading the row contents for a given byte array and sheet name
*/
public class XExcelFileReader {
/**
* This is the Row Num of the current row that was read
*/
private int rowNum = 0;
/**
* The OPCPackage is the package used to laod the Input Stream used for only .xlsx Files
*/
private OPCPackage opcPkg;
/**
* These are the String Tables that are read from the Excel File
*/
private ReadOnlySharedStringsTable stringsTable;
/**
* The XML Streaming API will be used
*/
private XMLStreamReader xmlReader;
/** The styles table which has formatting information about cells. */
private StylesTable styles;
/**
* #param excelByteArray the excel byte array
* #param sheetFromWhichToRead the excel sheet from which to read
* #throws Exception the exception
*/
public XExcelFileReader(final byte[] excelByteArray, final String sheetFromWhichToRead) throws Exception {
InputStream excelStream = new ByteArrayInputStream(excelByteArray);
opcPkg = OPCPackage.open(excelStream);
this.stringsTable = new ReadOnlySharedStringsTable(opcPkg);
XSSFReader xssfReader = new XSSFReader(opcPkg);
styles = xssfReader.getStylesTable();
XMLInputFactory factory = XMLInputFactory.newInstance();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
InputStream inputStream = null;
while (iter.hasNext()) {
inputStream = iter.next();
String tempSheetName = iter.getSheetName();
if (StringUtils.isNotEmpty(tempSheetName)) {
tempSheetName = tempSheetName.trim();
if (tempSheetName.equals(sheetFromWhichToRead)) {
break;
}
}
}
xmlReader = factory.createXMLStreamReader(inputStream);
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("sheetData")) {
break;
}
}
}
}
/**
* #return rowNum
*/
public final int rowNum() {
return rowNum;
}
/**
* #return List<String[]> List of String array which can hold the content
* #throws XMLStreamException the XMLStreamException
*/
public final List<String[]> readRows() throws XMLStreamException {
String elementName = "row";
List<String[]> dataRows = new ArrayList<String[]>();
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals(elementName)) {
rowNum++;
dataRows.add(getDataRow());
// TODO need to see if the batch Size is required
// if (dataRows.size() == batchSize)
// break;
}
}
}
return dataRows;
}
/**
* #return String [] of Row Data
* #throws XMLStreamException the XMLStreamException
*/
private String[] getDataRow() throws XMLStreamException {
List<String> rowValues = new ArrayList<String>();
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("c")) {
CellReference cellReference = new CellReference(xmlReader.getAttributeValue(null, "r"));
// Fill in the possible blank cells!
while (rowValues.size() < cellReference.getCol()) {
rowValues.add("");
}
String cellType = xmlReader.getAttributeValue(null, "t");
String cellStyleStr = xmlReader.getAttributeValue(null, "s");
rowValues.add(getCellValue(cellType, cellStyleStr));
}
} else if (xmlReader.isEndElement() && xmlReader.getLocalName().equals("row")) {
break;
}
}
return rowValues.toArray(new String[rowValues.size()]);
}
/**
* #param cellType the cell type
* #param cellStyleStr the cell style
* #return cell content the cell value
* #throws XMLStreamException the XMLStreamException
*/
private String getCellValue(final String cellType, final String cellStyleStr) throws XMLStreamException {
String value = ""; // by default
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("v")) {
if (cellType != null && cellType.equals("s")) {
int idx = Integer.parseInt(xmlReader.getElementText());
String s = stringsTable.getEntryAt(idx);
return new XSSFRichTextString(s).toString();
}
if (cellStyleStr != null) {
String cellValue = xmlReader.getElementText();
int styleIndex = Integer.parseInt(cellStyleStr);
XSSFCellStyle style = styles.getStyleAt(styleIndex);
short formatIndex = style.getDataFormat();
if (!isValidDouble(cellValue)) {
return cellValue;
}
double doubleVal = Double.valueOf(cellValue);
boolean isValidExcelDate = HSSFDateUtil.isInternalDateFormat(formatIndex);
Date date = null;
if (isValidExcelDate) {
date = HSSFDateUtil.getJavaDate(doubleVal);
String dateStr = dateAsString(date);
return dateStr;
}
if (!(doubleVal == Math.floor(doubleVal))) {
return Double.toString(doubleVal);
}
return cellValue;
}
else {
return xmlReader.getElementText();
}
}
} else if (xmlReader.isEndElement() && xmlReader.getLocalName().equals("c")) {
break;
}
}
return value;
}
/**
* To check whether the incoming value can be used in the Double utility method Double.valueOf() to prevent
* NumberFormatException.
* #param stringVal - String to be validated.
* #return - true if it is a valid String which can be passed into Double.valueOf method. <br/> For more information
* refer- <a>https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#valueOf(java.lang.String)</a>
*/
private boolean isValidDouble(String stringVal) {
final String Digits = "(\\p{Digit}+)";
final String HexDigits = "(\\p{XDigit}+)";
// an exponent is 'e' or 'E' followed by an optionally
// signed decimal integer.
final String Exp = "[eE][+-]?" + Digits;
final String fpRegex = ("[\\x00-\\x20]*" + // Optional leading "whitespace"
"[+-]?(" + // Optional sign character
"NaN|" + // "NaN" string
"Infinity|" + // "Infinity" string
// A decimal floating-point string representing a finite positive
// number without a leading sign has at most five basic pieces:
// Digits . Digits ExponentPart FloatTypeSuffix
//
// Since this method allows integer-only strings as input
// in addition to strings of floating-point literals, the
// two sub-patterns below are simplifications of the grammar
// productions from the Java Language Specification, 2nd
// edition, section 3.10.2.
// Digits ._opt Digits_opt ExponentPart_opt FloatTypeSuffix_opt
"(((" + Digits + "(\\.)?(" + Digits + "?)(" + Exp + ")?)|" +
// . Digits ExponentPart_opt FloatTypeSuffix_opt
"(\\.(" + Digits + ")(" + Exp + ")?)|" +
// Hexadecimal strings
"((" +
// 0[xX] HexDigits ._opt BinaryExponent FloatTypeSuffix_opt
"(0[xX]" + HexDigits + "(\\.)?)|" +
// 0[xX] HexDigits_opt . HexDigits BinaryExponent FloatTypeSuffix_opt
"(0[xX]" + HexDigits + "?(\\.)" + HexDigits + ")" +
")[pP][+-]?" + Digits + "))" + "[fFdD]?))" + "[\\x00-\\x20]*");// Optional trailing "whitespace"
if (Pattern.matches(fpRegex, stringVal))
return true;
else {
return false;
}
}
/**
* Date as string.
* #param date the date
* #return the string
*/
public static String dateAsString(final Date date) {
String dateAsString = null;
if (date != null) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
dateAsString = sdf.format(date);
}
return dateAsString;
}
#Override
protected final void finalize() throws Throwable {
if (opcPkg != null) {
opcPkg.close();
}
super.finalize();
}
}
I want the Empty row also to be part of the List of String array. Just that the String array will be empty.
If it is in the Excel like:
First Row
<< No Data >>
Third Row
then the List should be
dataFromFile
0 ->[First Row]
1 ->[]
2 ->[Third Row]

Here is what the Busy Developers Guide on the POI site says about reading sheets with missing rows:
for (int rowNum = rowStart; rowNum < rowEnd; rowNum++) {
Row r = sheet.getRow(rowNum);
if (r == null) {
// This whole row is empty
// Handle it as needed
continue;
}
int lastColumn = Math.max(r.getLastCellNum(), MY_MINIMUM_COLUMN_COUNT);
for (int cn = 0; cn < lastColumn; cn++) {
Cell c = r.getCell(cn, Row.RETURN_BLANK_AS_NULL);
if (c == null) {
// The spreadsheet is empty in this cell
} else {
// Do something useful with the cell's contents
}
}
}
You really can get a lot of your answers there. You shouldn't ever have to parse the XML directly. Even things that are missing from the high level usermodel usually have a way to get at it through the CT classes generated by XMLBeans.

timestamp as file name in java

I need to create a text file in the name of current timestamp in a particular directory
D:\Assignments\abassign in java.
But when i try to do that the following error comes as the file name should not contain ':'. But the timestamp contains ':'
Exception in thread "main" java.io.IOException: The filename, directory name, or volume label syntax is incorrect
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:883)
at abassign.Abassign.main(Abassign.java:35)
Java Result: 1
This error shows up

I use something like this:
StringBuffer fn = new StringBuffer();
fn.append(workingDirectory);
fn.append("/");
fn.append("fileNamePrefix-");
fn.append("-");
DateFormat df = new SimpleDateFormat("yyyyMMdd-HHmmss");
fn.append(df.format(new Date()));
fn.append("-fileNamePostFix.txt");
return fn.toString();
(Of course, you can drop the prefix, postfix and workingDirectory parts)

If you are on Windows your file name cannot contain :, you need to replace it by another character...
The following characters are reserved :
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
For example :
String yourTimeStamp = "01-01-2016 17:00:00";
File yourFile = new File("your directory", yourTimeStamp.replace(":", "_"));

here is the Code to generate file using time stamp
/*
* created by Sandip Bhoi 9960426568
*/
package checkapplicationisopen;
import java.io.File;
import java.io.IOException;
import java.sql.Timestamp;
import java.util.Random;
public class GeneratorUtil {
/**
*
* #param args
*/
public static void main(String args[])
{
GeneratorUtil generatorUtil=new GeneratorUtil();
generatorUtil.createNewFile("c:\\Documents", "pdf");
}
/**
*
* #param min
* #param max
* #return
*/
public static int randInt(int min, int max) {
// Usually this can be a field rather than a method variable
Random rand = new Random();
// nextInt is normally exclusive of the top value,
// so add 1 to make it inclusive
int randomNum = rand.nextInt((max - min) + 1) + min;
return randomNum;
}
/**
*
* #param prefix adding the prefix for time stamp generated id
* #return the concatenated string with id and prefix
*/
public static String getId(String prefix)
{
java.util.Date date = new java.util.Date();
String timestamp = new Timestamp(date.getTime()).toString();
String dt1 = timestamp.replace('-', '_');
String dt2 = dt1.replace(' ', '_');
String dt3 = dt2.replace(':', '_');
String dt4 = dt3.replace('.', '_');
int temp = randInt(1, 5000);
return prefix +"_"+ temp + "_" + dt4;
}
/**
*
* #param direcotory
* #param extension
*/
public void createNewFile(String direcotory,String extension)
{
try {
File file = new File(direcotory+"//"+getId("File")+"."+extension);
if (file.createNewFile()){
System.out.println("File is created!");
}else{
System.out.println("File already exists.");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

readNext() function of CSVReader not looping through all rows of csv [EDIT: How to handle erroneous CSV (remove unescaped quotes)]

FileReader fr = new FileReader(inp);
CSVReader reader = new CSVReader(fr, ',', '"');
// writer
File writtenFromWhile = new File(dliRootPath + writtenFromWhilePath);
writtenFromWhile.createNewFile();
CSVWriter writeFromWhile = new CSVWriter(new FileWriter(writtenFromWhile), ',', '"');
int insideWhile = 0;
String[] currRow = null;
while ((currRow = reader.readNext()) != null) {
insideWhile++;
writeFromWhile.writeNext(currRow);
}
System.out.println("inside While: " + insideWhile);
System.out.println("lines read (acc.to CSV reader): " + reader.getLinesRead());
The output is:
inside While: 162199
lines read (acc.to CSV reader): 256865
Even though all lines are written to the output CSV (when viewed in a text editor, Excel shows much lesser number of rows), the while loop does not iterate the same number of times as the rows in input CSV. My main objective is to implement some other logic inside while loop on each line.
I have been trying to debug since two whole days ( a bigger code) without any results.
Please explain how I can loop through while 256865 times
Reference data, complete picture:
Here is the CSV I am reading in the above snippet.
My complete program tries to separate out those records from this CSV which are not present in this CSV, based on the fields title and author (i.e if author and title is the same in 2 records, even if other fields are different, they are counted as duplicate and should not be written to the output file). Here is my complete code (the difference should be around 300000, but i get only ~210000 in the output file with my code):
//TODO ask id
/*(*
* id also there in fields getting matched (thisRow[0] is id)
* u can replace it by thisRow[fielAnd Column.get(0)] to eliminate id
*/
package mainOne;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import com.opencsv.CSVReader;
import com.opencsv.CSVWriter;
public class Diff_V3 {
static String dliRootPath = "/home/gurnoor/Incoming/Untitled Folder 2/";
static String dli = "new-dli-IITG.csv";
static String oldDli = "dli-iisc.csv";
static String newFile = "newSampleFile.csv";// not used
static String unqFile = "UniqueFileFinal.csv";
static String log = "Diff_V3_log.txt";
static String splittedNewDliDir = "/home/gurnoor/Incoming/Untitled Folder 2/splitted new file";
static String splittedOldDliDir = "/home/gurnoor/Incoming/Untitled Folder 2/splitted old file";
// debug
static String testFilePath = "testFile.csv";
static int insidepopulateMapFromSplittedCSV = 0;
public static void main(String[] args) throws IOException, CustomException {
// _readSample(dliRootPath+dli, dliRootPath+newFile);
// System.out.println(areIDsunique(dliRootPath + dli, 550841) );// open
// in geany to get total no
// of lines
// TODO implement sparate function to check equals
// File filteredFile = new File(dliRootPath + "filteredFile.csv");
// filteredFile.createNewFile();
File logFile = new File(dliRootPath + log);
logFile.createNewFile();
new File(dliRootPath + testFilePath).createNewFile();
List<String> fieldsToBeMatched = new ArrayList<>();
fieldsToBeMatched.add("dc.contributor.author[]");
fieldsToBeMatched.add("dc.title[]");
filterUniqueFileds(new File(splittedNewDliDir), new File(splittedOldDliDir), fieldsToBeMatched);
}
/**
* NOTE: might remove the row where fieldToBeMatched is null
*
* #param inpfile
* #param file
* #param filteredFile
* #param fieldsToBeMatched
* #throws IOException
* #throws CustomException
*/
private static void filterUniqueFileds(File newDir, File oldDir, List<String> fieldsToBeMatched)
throws IOException, CustomException {
CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
// writer
File unqFileOp = new File(dliRootPath + unqFile);
unqFileOp.createNewFile();
CSVWriter writer = new CSVWriter(new FileWriter(unqFileOp), '|');
// logWriter
BufferedWriter logWriter = new BufferedWriter(new FileWriter(new File(dliRootPath + log)));
String[] headingRow = // allRows.get(0);
reader.readNext();
writer.writeNext(headingRow);
int headingLen = headingRow.length;
// old List
System.out.println("[INFO] reading old list...");
// CSVReader oldReader = new CSVReader(new FileReader(new
// File(dliRootPath + oldDli)));
Map<String, List<String>> oldMap = new HashMap<>();
oldMap = populateMapFromSplittedCSV(oldMap, oldDir);// populateMapFromCSV(oldMap,
// oldReader);
// oldReader.close();
System.out.println("[INFO] Read old List. Size = " + oldMap.size());
printMapToCSV(oldMap, dliRootPath + testFilePath);
// map of fieldName, ColumnNo
Map<String, Integer> fieldAndColumnNoInNew = new HashMap<>(getColumnNo(fieldsToBeMatched, headingRow));
Map<String, Integer> fieldAndColumnNoInOld = new HashMap<>(
getColumnNo(fieldsToBeMatched, (String[]) oldMap.get("id").toArray()));
// error check: did columnNo get populated?
if (fieldAndColumnNoInNew.isEmpty()) {
reader.close();
writer.close();
throw new CustomException("field to be matched not present in input CSV");
}
// TODO implement own array compare using areEqual()
// error check
// if( !Arrays.equals(headingRow, (String[]) oldMap.get("id").toArray())
// ){
// System.out.println("heading in new file, old file: \n"+
// Arrays.toString(headingRow));
// System.out.println(Arrays.toString((String[])
// oldMap.get("id").toArray()));
// reader.close();
// writer.close();
// oldReader.close();
// throw new CustomException("Heading rows are not same in old and new
// file");
// }
int noOfRecordsInOldList = 0, noOfRecordsWritten = 0, checkManually = 0;
String[] thisRow;
while ((thisRow = reader.readNext()) != null) {
// for(int l=allRows.size()-1; l>=0; l--){
// thisRow=allRows.get(l);
// error check
if (thisRow.length != headingLen) {
String error = "Line no: " + reader.getLinesRead() + " in file: " + dliRootPath + dli
+ " not read. Check manually";
System.err.println(error);
logWriter.append(error + "\n");
logWriter.flush();
checkManually++;
continue;
}
// write if not present in oldMap
if (!oldMap.containsKey(thisRow[0])) {
writer.writeNext(thisRow);
writer.flush();
noOfRecordsWritten++;
} else {
// check if all reqd fields match
List<String> twinRow = oldMap.get(thisRow[0]);
boolean writtenToOp = false;
// for (int k = 0; k < fieldsToBeMatched.size(); k++) {
List<String> newFields = new ArrayList<>(fieldAndColumnNoInNew.keySet());
List<String> oldFields = new ArrayList<>(fieldAndColumnNoInOld.keySet());
// faaltu error check
if (newFields.size() != oldFields.size()) {
reader.close();
writer.close();
CustomException up = new CustomException("something is really wrong");
throw up;
}
// for(String fieldName : fieldAndColumnNoInNew.keySet()){
for (int m = 0; m < newFields.size(); m++) {
int columnInNew = fieldAndColumnNoInNew.get(newFields.get(m)).intValue();
int columnInOld = fieldAndColumnNoInOld.get(oldFields.get(m)).intValue();
String currFieldTwin = twinRow.get(columnInOld);
String currField = thisRow[columnInNew];
if (!areEqual(currField, currFieldTwin)) {
writer.writeNext(thisRow);
writer.flush();
writtenToOp = true;
noOfRecordsWritten++;
System.out.println(noOfRecordsWritten);
break;
}
}
if (!writtenToOp) {
noOfRecordsInOldList++;
// System.out.println("[INFO] present in old List: \n" +
// Arrays.toString(thisRow) + " AND\n"
// + twinRow.toString());
}
}
}
System.out.println("--------------------------------------------------------\nDebug info");
System.out.println("old File: " + oldMap.size());
System.out.println("new File:" + reader.getLinesRead());
System.out.println("no of records in old list (present in both old and new) = " + noOfRecordsInOldList);
System.out.println("checkManually: " + checkManually);
System.out.println("noOfRecordsInOldList+checkManually = " + (noOfRecordsInOldList + checkManually));
System.out.println("no of records written = " + noOfRecordsWritten);
System.out.println();
System.out.println("inside populateMapFromSplittedCSV() " + insidepopulateMapFromSplittedCSV + "times");
logWriter.close();
reader.close();
writer.close();
}
private static void printMapToCSV(Map<String, List<String>> oldMap, String testFilePath2) throws IOException {
// writer
int i = 0;
CSVWriter writer = new CSVWriter(new FileWriter(new File(testFilePath2)), '|');
for (String key : oldMap.keySet()) {
List<String> row = oldMap.get(key);
String[] tempRow = new String[row.size()];
tempRow = row.toArray(tempRow);
writer.writeNext(tempRow);
writer.flush();
i++;
}
writer.close();
System.out.println("[hello from line 210 ( inside printMapToCSV() ) of ur code] wrote " + i + " lines");
}
private static Map<String, List<String>> populateMapFromSplittedCSV(Map<String, List<String>> oldMap, File oldDir)
throws IOException {
File defective = new File(dliRootPath + "defectiveOldFiles.csv");
defective.createNewFile();
CSVWriter defectWriter = new CSVWriter(new FileWriter(defective));
CSVReader reader = null;
for (File oldFile : oldDir.listFiles()) {
insidepopulateMapFromSplittedCSV++;
reader = new CSVReader(new FileReader(oldFile), ',', '"');
oldMap = populateMapFromCSV(oldMap, reader, defectWriter);
// printMapToCSV(oldMap, dliRootPath+testFilePath);
System.out.println(oldMap.size());
reader.close();
}
defectWriter.close();
System.out.println("inside populateMapFromSplittedCSV() " + insidepopulateMapFromSplittedCSV + "times");
return new HashMap<String, List<String>>(oldMap);
}
private static Map<String, Integer> getColumnNo(List<String> fieldsToBeMatched, String[] headingRow) {
Map<String, Integer> fieldAndColumnNo = new HashMap<>();
for (String field : fieldsToBeMatched) {
for (int i = 0; i < headingRow.length; i++) {
String heading = headingRow[i];
if (areEqual(field, heading)) {
fieldAndColumnNo.put(field, Integer.valueOf(i));
break;
}
}
}
return fieldAndColumnNo;
}
private static Map<String, List<String>> populateMapFromCSV(Map<String, List<String>> oldMap, CSVReader oldReader,
CSVWriter defectWriter) throws IOException {
int headingLen = 0;
List<String> headingRow = null;
if (oldReader.getLinesRead() > 1) {
headingRow = oldMap.get("id");
headingLen = headingRow.size();
}
String[] thisRow;
int insideWhile = 0, addedInMap = 0, doesNotContainKey = 0, containsKey = 0;
while ((thisRow = oldReader.readNext()) != null) {
// error check
// if (oldReader.getLinesRead() > 1) {
// if (thisRow.length != headingLen) {
// System.err.println("Line no: " + oldReader.getLinesRead() + " in
// file: " + dliRootPath + oldDli
// + " not read. Check manually");
// defectWriter.writeNext(thisRow);
// defectWriter.flush();
// continue;
// }
// }
insideWhile++;
if (!oldMap.containsKey(thisRow[0])) {
doesNotContainKey++;
List<String> fullRow = Arrays.asList(thisRow);
fullRow = oldMap.put(thisRow[0], fullRow);
if (fullRow == null) {
addedInMap++;
}
} else {
List<String> twinRow = oldMap.get(thisRow[0]);
boolean writtenToOp = false;
// for(String fieldName : fieldAndColumnNoInNew.keySet()){
for (int m = 0; m < headingRow.size(); m++) {
String currFieldTwin = twinRow.get(m);
String currField = thisRow[m];
if (!areEqual(currField, currFieldTwin)) {
System.err.println("do something!!!!!! DUPLICATE ID in old file");
containsKey++;
FileWriter logWriter = new FileWriter(new File((dliRootPath + log)));
System.err.println("[Skipped record] in old file. Row no: " + oldReader.getLinesRead()
+ "\nRecord: " + Arrays.toString(thisRow));
logWriter.append("[Skipped record] in old file. Row no: " + oldReader.getLinesRead()
+ "\nRecord: " + Arrays.toString(thisRow));
logWriter.close();
break;
}
}
}
}
System.out.println("inside while: " + insideWhile);
System.out.println("oldMap size = " + oldMap.size());
System.out.println("addedInMap: " + addedInMap);
System.out.println("doesNotContainKey: " + doesNotContainKey);
System.out.println("containsKey: " + containsKey);
return new HashMap<String, List<String>>(oldMap);
}
private static boolean areEqual(String field, String heading) {
// TODO implement, askSubhayan
return field.trim().equals(heading.trim());
}
/**
* Returns the first duplicate ID OR the string "unique" OR (rarely)
* totalLinesInCSV != totaluniqueIDs
*
* #param inpCSV
* #param totalLinesInCSV
* #return
* #throws IOException
*/
private static String areIDsunique(String inpCSV, int totalLinesInCSV) throws IOException {
CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
List<String[]> allRows = new ArrayList<>(reader.readAll());
reader.close();
Set<String> id = new HashSet<>();
for (String[] thisRow : allRows) {
if (thisRow[0] != null || !thisRow[0].isEmpty() || id.add(thisRow[0])) {
return thisRow[0];
}
}
if (id.size() == totalLinesInCSV) {
return "unique";
} else {
return "totalLinesInCSV != totaluniqueIDs";
}
}
/**
* writes 20 rowsof input csv into the output file
*
* #param input
* #param output
* #throws IOException
*/
public static void _readSample(String input, String output) throws IOException {
File opFile = new File(dliRootPath + newFile);
opFile.createNewFile();
CSVWriter writer = new CSVWriter(new FileWriter(opFile));
CSVReader reader = new CSVReader(new FileReader(new File(dliRootPath + dli)), '|');
for (int i = 0; i < 20; i++) {
// String[] op;
// for(String temp: reader.readNext()){
writer.writeNext(reader.readNext());
// }
// System.out.println();
}
reader.close();
writer.flush();
writer.close();
}
}

RC's comment nailed it!
If you check the java docs you will see that there are two methods in the CSVReader: getLinesRead and getRecordsRead. And they both do exactly what they say. getLinesRead returns the number of lines that was read using the FileReader. getRecordsRead returns the number of records that the CSVReader read. Keep in mind that if you have embedded new lines in the records of your file then it will take multiple line reads to get one record. So it is very conceivable to have a csv file with 100 records but taking 200 line reads to read them all.

Unescaped quotes inside a CSV cell can mess up your whole data. This might happen in a CSV if the data you are working with has been created manually. Below is a function I wrote a while back for this situation. Let me know if this is not the right place to share it.
/**
* removes quotes inside a cell/column puts curated data in
* "../CuratedFiles"
*
* #param curateDir
* #param del Csv column delimiter
* #throws IOException
*/
public static void curateCsvRowQuotes(File curateDir, String del) throws IOException {
File parent = curateDir.getParentFile();
File curatedDir = new File(parent.getAbsolutePath() + "/CuratedFiles");
curatedDir.mkdir();
for (File file : curateDir.listFiles()) {
BufferedReader bufRead = new BufferedReader(new FileReader(file));
// output
File fOp = new File(curatedDir.getAbsolutePath() + "/" + file.getName());
fOp.createNewFile();
BufferedWriter bufW = new BufferedWriter(new FileWriter(fOp));
bufW.append(bufRead.readLine() + "\n");// heading
// logs
File logFile = new File(curatedDir.getAbsolutePath() + "/CurationLogs.txt");
logFile.createNewFile();
BufferedWriter logWriter = new BufferedWriter(new FileWriter(logFile));
String thisLine = null;
int lineCount = 0;
while ((thisLine = bufRead.readLine()) != null) {
String opLine = "";
int endIndex = thisLine.indexOf("\"" + del);
String str = thisLine.substring(0, endIndex);
opLine += str + "\"" + del;
while (endIndex != (-1)) {
// leave out first " in a cell
int tempIndex = thisLine.indexOf("\"" + del, endIndex + 2);
if (tempIndex == (-1)) {
break;
}
str = thisLine.substring(endIndex + 2, tempIndex);
int indexOfQuote = str.indexOf("\"");
opLine += str.substring(0, indexOfQuote + 1);
// remove all "
str = str.substring(indexOfQuote + 1);
str = str.replace("\"", "");
opLine += str + "\"" + del;
endIndex = thisLine.indexOf("\"" + del, endIndex + 2);
}
str = thisLine.substring(thisLine.lastIndexOf("\"" + del) + 2);
if ((str != null) && str.matches("[" + del + "]+")) {
opLine += str;
}
System.out.println(opLine);
bufW.append(opLine + "\n");
bufW.flush();
lineCount++;
}
System.out.println(lineCount + " no of lines in " + file.getName());
bufRead.close();
bufW.close();
}
}

In my case, I've used csvReader.readAll() before the readNext().
Like
List<String[]> myData =csvReader.readAll();
while ((nextRecord = csvReader.readNext()) != null) {
}
So my csvReader.readNext() returns always null. Since all the values were already read by myData.
Please be caution for using readNext() and readAll() functions.

cannot find symbol: variable TestUtils

I have written a Java program that analyze .soft file of data on gene expressions and write it to txt
package il.ac.tau.cs.sw1.bioinformatics;
import org.apache.commons.math3.stat.inference.TestUtils;
import java.io.*;
import java.util.Arrays;
/**
*
* Gene Expression Analyzer
*
* Command line arguments:
* args[0] - GeoDatasetName: Gene expression dataset name (expects a corresponding
input file in SOFT format to exist in the local directory).
* args[1] - Label1: Label of the first sample subset
* args[2] - Label2: Label of the second sample subset
* args[3] - Alpha: T-test confidence level : only genes with pValue below this
threshold will be printed to output file
*
* Execution example: GeneExpressionAnalyzer GDS4085 "estrogen receptor-negative" "estrogen receptor-positive" 0.01
*
* #author software1-2014
*
*/
public class GeneExpressionAnalyzer {
public static void main(String args[]) throws IOException {
// Reads the dataset from a SOFT input file
String inputSoftFileName = args[0] + ".soft";
GeneExpressionDataset geneExpressionDataset = parseGeneExpressionFile (inputSoftFileName);
System.out.printf ("Gene expression dataset loaded from file %s. %n",inputSoftFileName);
System.out.printf("Dataset contains %d samples and %d gene probes.%n%n",geneExpressionDataset.samplesNumber, geneExpressionDataset.genesNumber);
// Writes the dataset to a tabular format
String tabularFileName = args[0] + "-Tabular.txt";
writeDatasetToTabularFile(geneExpressionDataset,tabularFileName);
System.out.printf ("Dataset saved to tabular file - %s.%n%n",tabularFileName);
// Identifies differentially expressed genes between two sample groups and writes the results to a text file
String label1 = args[1];
String label2 = args[2];
double alpha = Double.parseDouble(args[3]);
String diffGenesFileName = args[0] + "-DiffGenes.txt";
int numOfDiffGenes = writeTopDifferentiallyExpressedGenesToFile(diffGenesFileName,geneExpressionDataset, alpha, label1, label2);
System.out.printf ("%d differentially expressed genes identified using alpha of %f when comparing the two sample groups [%s] and [%s].%n",numOfDiffGenes, alpha, label1, label2);
System.out.printf ("Results saved to file %s.%n",diffGenesFileName);
}
private static float[] StringtoFloat(String[] temp) {
float[] array = new float[temp.length];
for (int i = 0; i < temp.length; i++){
array[i]= Float.parseFloat(temp[i]);
}
return array;
}
private static double[] CutToCounter(double[] array, int counter) {
if (array.length == counter){
return array;
}
double[] args = new double[counter+1];
for (int i = 0; i < args.length; i++){
args[i] = array[i];
}
return args;
}
private static int min(double[] pValues) {
double val = 2;
int index = -1;
for (int i = 0; i < pValues.length; i++){
if (pValues[i] < val && pValues[i] != 3.0){
val = pValues[i];
index = i;
}
}
return index;
}
private static String changeformat(float[] array) {
String[] args = new String[array.length];
for (int i = 0; i < array.length; i++){
args[i] = String.format("%.2f", array[i]);
}
return Arrays.toString(args);
}
/**
*
* parseGeneExpressionFile - parses the given SOFT file
*
*
* #param filename A gene expression file in SOFT format
* #return a GeneExpressionDataset object storing all data parsed from the input file
* #throws IOException
*/
public static GeneExpressionDataset parseGeneExpressionFile (String filename) throws IOException {
GeneExpressionDataset dataset = new GeneExpressionDataset();
BufferedReader buf = new BufferedReader(new FileReader(filename));
String line = buf.readLine();
String[] geneids = null;
String[] genesymbols = null;
float[][] datamatrix = null;
String[][] subsetinfo = new String[10][2];
String[][] subsetsample = new String[10][];
int i = 0;
int j = 0;
boolean bol = false;
while (line != null){
if (line.startsWith("!dataset_sample_count")){
dataset.samplesNumber = Integer.parseInt(line.substring(24));
}
else if (line.startsWith("!dataset_sample_count")){
dataset.genesNumber = Integer.parseInt(line.substring(25));
geneids = new String[dataset.genesNumber];
genesymbols = new String[dataset.genesNumber];
}
else if (line.startsWith("^SUBSET")){
subsetinfo[i][0] = line.substring(10);
i++;
}
else if (line.startsWith("!subset_sample_description")){
subsetinfo[i][1] = line.substring(22);
}
else if (line.startsWith("!subset_sample_id")){
subsetsample[i-1] = line.substring(20).split(",");
}
else if (line.startsWith("!dataset_table_begin")){
datamatrix = new float[dataset.genesNumber][dataset.samplesNumber];
}
else if (line.startsWith("ID_REF")){
String[] array1 = line.split("\t");
dataset.sampleIds = (String[]) Arrays.copyOfRange(array1, 2, array1.length);
bol = true;
}
else if (bol && !line.startsWith("!dataset_table_end")){
String[] array2 = line.split("\t");
geneids[j] = array2[0];
genesymbols[j] = array2[1];
String[] temp = (String[]) Arrays.copyOfRange(array2, 2, array2.length);
datamatrix[j] = StringtoFloat(temp);
j++;
}
}
buf.close();
dataset.geneIds = geneids;
dataset.geneSymbols = genesymbols;
dataset.dataMatrix = datamatrix;
String[] lables = new String[dataset.samplesNumber];
int k = 0;
for (String sample : dataset.sampleIds) {
for (int m = 0; m < subsetsample.length; m++) {
if (Arrays.binarySearch(subsetsample[m], sample) != -1) {
lables[k] = subsetsample[m][1];
k += 1;
} else {
continue;
}
}
}
dataset.labels = lables;
return dataset;
}
/**
* writeDatasetToTabularFile
* writes the dataset to a tabular text file
*
* #param geneExpressionDataset
* #param outputFilename
* #throws IOException
*/
public static void writeDatasetToTabularFile(GeneExpressionDataset geneExpressionDataset, String outputFilename) throws IOException {
File NewFile = new File(outputFilename);
BufferedWriter buf = new BufferedWriter(new FileWriter(NewFile));
String Lables = "\t" + "\t" + "\t" + "\t" + Arrays.toString(geneExpressionDataset.labels);
String Samples = "\t" + "\t" + "\t" + "\t" + Arrays.toString(geneExpressionDataset.sampleIds);
buf.write(Lables + "\r\n" + Samples + "\r\n");
for (int i = 0; i < geneExpressionDataset.genesNumber; i++){
buf.write(geneExpressionDataset.geneIds[i] + "\t"+ geneExpressionDataset.geneSymbols[i] + "\t" +
changeformat(geneExpressionDataset.dataMatrix[i]) + "\r\n");
}
buf.close();
}
/**
*
* writeTopDifferentiallyExpressedGenesToFile
*
* #param outputFilename
* #param geneExpressionDataset
* #param alpha
* #param label1
* #param label2
* #return numOfDiffGenes The number of differentially expressed genes detected, having p-value lower than alpha
* #throws IOException
*/
public static int writeTopDifferentiallyExpressedGenesToFile(String outputFilename,
GeneExpressionDataset geneExpressionDataset, double alpha,
String label1, String label2) throws IOException {
double pValues[] = new double[geneExpressionDataset.genesNumber];
int counter = 0;
for (int i = 0; i < pValues.length; i++){
double pval = calcTtest(geneExpressionDataset, i, label1, label2);
if (pval < alpha){
pValues[i] = pval;
counter++;
}
else{
continue;
}
}
File tofile = new File(outputFilename);
BufferedWriter buf = new BufferedWriter(new FileWriter(tofile));
int j = 0;
while (min(pValues) != -1){
String PVal = String.format("%.6f", pValues[min(pValues)]);
String gene_id = geneExpressionDataset.geneIds[min(pValues)];
String gene_symbol = geneExpressionDataset.geneSymbols[min(pValues)];
String line = String.valueOf(j) + "\t" + PVal + "\t" + gene_id + "\t" + gene_symbol;
buf.write(line + "\r\n");
pValues[min(pValues)] = 3.0;
j++;
}
buf.close();
return counter;
}
/**
*
* getDataEntriesForLabel
*
* Returns the entries in the 'data' array for which the corresponding entries in the 'labels' array equals 'label'
*
* #param data
* #param labels
* #param label
* #return
*/
public static double[] getDataEntriesForLabel(float[] data, String[] labels, String label) {
double[] array = new double[data.length];
int counter = 0;
for (int i = 0; i < data.length; i++){
if (labels[i].equals(label)){
array[counter] = data[i];
counter++;
}
else{
continue;
}
}return CutToCounter(array, counter);
}
/**
* calcTtest - returns a pValue for the t-Test
*
* Returns the p-value, associated with a two-sample, two-tailed t-test comparing the means of the input arrays
*
* //http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/inference/TTest.html#tTest(double[], double[])
*
* #param geneExpressionDataset
* #param geneIndex
* #param label1
* #param label2
* #return
*/
private static double calcTtest(GeneExpressionDataset geneExpressionDataset, int geneIndex, String label1, String label2) {
double[] sample1 = getDataEntriesForLabel(geneExpressionDataset.dataMatrix[geneIndex], geneExpressionDataset.labels, label1);
double[] sample2 = getDataEntriesForLabel(geneExpressionDataset.dataMatrix[geneIndex], geneExpressionDataset.labels, label2);
return TestUtils.tTest(sample1, sample2);
}
/**
*
* GeneExpressionDataset
* A class representing a gene expression dataset
*
* #author software1-2014
*
*/
public static class GeneExpressionDataset {
public int samplesNumber; //number of dataset samples
public int genesNumber; // number of dataset gene probes
public String[] sampleIds; //sample ids
public String[] geneIds; //gene probe ids
public String[] geneSymbols; //gene symbols
public float[][] dataMatrix; //expression data matrix
public String[] labels; //sample labels
}
}
now, it won't compile and the error message is this:
"GeneExpressionAnalyzer.java:2: error: package org.apache.commons.math3.stat.inference does not exist
import org.apach.commons.math3.stat.interference.TestUtils;
GeneExpressionAnalyzer.java:277: error: cannot find symbol
return TestUtils.tTest;
symbol: variable TestUtils
location: class GeneExpressionAnalyzer
2 errors"
I don't get what went wrong, obviously i've added the .jar file that contains the path to TestUtils.
(here it is: http://apache.spd.co.il//commons/math/binaries/commons-math3-3.2-bin.zip)
any insights?

If you are working with eclipse,
Manually download the jar file from here
After that in eclipse open the package explorer -> right click on your project
Build Path -> Configure Build Path, A window will open.
Under the Libraries tab -> click Add External JARs. select the jar file you have downloaded Click Ok.
That's all. Now the problem might be gone

Does it work from the command line?
I've shortened your class to
import org.apache.commons.math3.stat.inference.TestUtils;
import java.io.*;
import java.util.Arrays;
public class Test {
public static void main(String args[]) throws IOException {
System.out.printf ("test...");
}
}
I copied the Test.java file and the commons-math3-3.2.jar to the same directory and here is my output from the command line:
C:\temp\test>dir
Répertoire de C:\temp\test
24/04/2014 14:41 <REP> .
24/04/2014 14:41 <REP> ..
24/04/2014 14:38 1 692 782 commons-math3-3.2.jar
24/04/2014 14:41 230 Test.java
2 fichier(s) 1 693 012 octets
2 Rép(s) 23 170 342 912 octets libres
C:\temp\test>javac Test.java
Test.java:1: package org.apache.commons.math3.stat.inference does not exist
import org.apache.commons.math3.stat.inference.TestUtils;
^
1 error
C:\temp\test>javac -cp commons-math3-3.2.jar Test.java
C:\temp\test>dir
Répertoire de C:\temp\test
24/04/2014 14:41 <REP> .
24/04/2014 14:41 <REP> ..
24/04/2014 14:38 1 692 782 commons-math3-3.2.jar
24/04/2014 14:41 500 Test.class
24/04/2014 14:41 230 Test.java

Reading an Excel sheet using POI's XSSF and SAX (Event API)

I am reading an Excel sheet using POI's XSSF and SAX (Event API). The Excel sheet has thousands of rows of user information like user name, email, address, age, department etc.
I need to read each row from Excel, convert it into a User object and add this User object to a List of User objects.
I can read the Excel sheet successfully, but I am not sure at what point while reading I should create an instance of the User object and populate it with the data from the Excel sheet.
Below is my entire working code.
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.BuiltinFormats;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFCellStyle;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
public class ExcelSheetParser {
enum xssfDataType {
BOOL, ERROR, FORMULA, INLINESTR, SSTINDEX, NUMBER,
}
int countrows = 0;
class XSSFSheetHandler extends DefaultHandler {
/**
* Table with styles
*/
private StylesTable stylesTable;
/**
* Table with unique strings
*/
private ReadOnlySharedStringsTable sharedStringsTable;
/**
* Destination for data
*/
private final PrintStream output;
private List<?> list = new ArrayList();
private Class clazz;
/**
* Number of columns to read starting with leftmost
*/
private final int minColumnCount;
// Set when V start element is seen
private boolean vIsOpen;
// Set when cell start element is seen;
// used when cell close element is seen.
private xssfDataType nextDataType;
// Used to format numeric cell values.
private short formatIndex;
private String formatString;
private final DataFormatter formatter;
private int thisColumn = -1;
// The last column printed to the output stream
private int lastColumnNumber = -1;
// Gathers characters as they are seen.
private StringBuffer value;
/**
* Accepts objects needed while parsing.
*
* #param styles
* Table of styles
* #param strings
* Table of shared strings
* #param cols
* Minimum number of columns to show
* #param target
* Sink for output
*/
public XSSFSheetHandler(StylesTable styles,
ReadOnlySharedStringsTable strings, int cols, PrintStream target, Class clazz) {
this.stylesTable = styles;
this.sharedStringsTable = strings;
this.minColumnCount = cols;
this.output = target;
this.value = new StringBuffer();
this.nextDataType = xssfDataType.NUMBER;
this.formatter = new DataFormatter();
this.clazz = clazz;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
if ("inlineStr".equals(name) || "v".equals(name)) {
vIsOpen = true;
// Clear contents cache
value.setLength(0);
}
// c => cell
else if ("c".equals(name)) {
// Get the cell reference
String r = attributes.getValue("r");
int firstDigit = -1;
for (int c = 0; c < r.length(); ++c) {
if (Character.isDigit(r.charAt(c))) {
firstDigit = c;
break;
}
}
thisColumn = nameToColumn(r.substring(0, firstDigit));
// Set up defaults.
this.nextDataType = xssfDataType.NUMBER;
this.formatIndex = -1;
this.formatString = null;
String cellType = attributes.getValue("t");
String cellStyleStr = attributes.getValue("s");
if ("b".equals(cellType))
nextDataType = xssfDataType.BOOL;
else if ("e".equals(cellType))
nextDataType = xssfDataType.ERROR;
else if ("inlineStr".equals(cellType))
nextDataType = xssfDataType.INLINESTR;
else if ("s".equals(cellType))
nextDataType = xssfDataType.SSTINDEX;
else if ("str".equals(cellType))
nextDataType = xssfDataType.FORMULA;
else if (cellStyleStr != null) {
// It's a number, but almost certainly one
// with a special style or format
int styleIndex = Integer.parseInt(cellStyleStr);
XSSFCellStyle style = stylesTable.getStyleAt(styleIndex);
this.formatIndex = style.getDataFormat();
this.formatString = style.getDataFormatString();
if (this.formatString == null)
this.formatString = BuiltinFormats
.getBuiltinFormat(this.formatIndex);
}
}
}
public void endElement(String uri, String localName, String name)
throws SAXException {
String thisStr = null;
// v => contents of a cell
if ("v".equals(name)) {
// Process the value contents as required.
// Do now, as characters() may be called more than once
switch (nextDataType) {
case BOOL:
char first = value.charAt(0);
thisStr = first == '0' ? "FALSE" : "TRUE";
break;
case ERROR:
thisStr = "\"ERROR:" + value.toString() + '"';
break;
case FORMULA:
// A formula could result in a string value,
// so always add double-quote characters.
thisStr = '"' + value.toString() + '"';
break;
case INLINESTR:
// TODO: have seen an example of this, so it's untested.
XSSFRichTextString rtsi = new XSSFRichTextString(value
.toString());
thisStr = '"' + rtsi.toString() + '"';
break;
case SSTINDEX:
String sstIndex = value.toString();
try {
int idx = Integer.parseInt(sstIndex);
XSSFRichTextString rtss = new XSSFRichTextString(
sharedStringsTable.getEntryAt(idx));
thisStr = '"' + rtss.toString() + '"';
} catch (NumberFormatException ex) {
output.println("Failed to parse SST index '" + sstIndex
+ "': " + ex.toString());
}
break;
case NUMBER:
String n = value.toString();
if (this.formatString != null)
thisStr = formatter.formatRawCellContents(Double
.parseDouble(n), this.formatIndex,
this.formatString);
else
thisStr = n;
break;
default:
thisStr = "(TODO: Unexpected type: " + nextDataType + ")";
break;
}
// Output after we've seen the string contents
// Emit commas for any fields that were missing on this row
if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}
for (int i = lastColumnNumber; i < thisColumn; ++i)
output.print(',');
// Might be the empty string.
output.print(thisColumn +" : "+thisStr);
// Update column
if (thisColumn > -1)
lastColumnNumber = thisColumn;
} else if ("row".equals(name)) {
// Print out any missing commas if needed
if (minColumns > 0) {
// Columns are 0 based
if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}
for (int i = lastColumnNumber; i < (this.minColumnCount); i++) {
output.print(',');
}
}
// We're onto a new row
output.println();
output.println(countrows++);
lastColumnNumber = -1;
}
}
/**
* Captures characters only if a suitable element is open. Originally
* was just "v"; extended for inlineStr also.
*/
public void characters(char[] ch, int start, int length)
throws SAXException {
if (vIsOpen)
value.append(ch, start, length);
}
/**
* Converts an Excel column name like "C" to a zero-based index.
*
* #param name
* #return Index corresponding to the specified name
*/
private int nameToColumn(String name) {
int column = -1;
for (int i = 0; i < name.length(); ++i) {
int c = name.charAt(i);
column = (column + 1) * 26 + c - 'A';
}
return column;
}
}
// /////////////////////////////////////
private OPCPackage xlsxPackage;
private int minColumns;
private PrintStream output;
private Class clazz;
/**
* Creates a new XLSX -> CSV converter
*
* #param pkg
* The XLSX package to process
* #param output
* The PrintStream to output the CSV to
* #param minColumns
* The minimum number of columns to output, or -1 for no minimum
*/
public ExcelSheetParser(OPCPackage pkg, PrintStream output, int minColumns, Class clazz) {
this.xlsxPackage = pkg;
this.output = output;
this.minColumns = minColumns;
this.clazz = clazz;
}
/**
* Parses and shows the content of one sheet using the specified styles and
* shared-strings tables.
*
* #param styles
* #param strings
* #param sheetInputStream
*/
public void processSheet(StylesTable styles,
ReadOnlySharedStringsTable strings, InputStream sheetInputStream)
throws IOException, ParserConfigurationException, SAXException {
InputSource sheetSource = new InputSource(sheetInputStream);
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxFactory.newSAXParser();
XMLReader sheetParser = saxParser.getXMLReader();
ContentHandler handler = new XSSFSheetHandler(styles, strings,
this.minColumns, this.output, this.clazz);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
}
/**
* Initiates the processing of the XLS workbook file to CSV.
*
* #throws IOException
* #throws OpenXML4JException
* #throws ParserConfigurationException
* #throws SAXException
*/
public void process() throws IOException, OpenXML4JException,
ParserConfigurationException, SAXException {
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(
this.xlsxPackage);
XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader
.getSheetsData();
int index = 0;
while (iter.hasNext()) {
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
this.output.println(sheetName + " [index=" + index + "]:");
processSheet(styles, strings, stream);
stream.close();
++index;
}
}
}

What I'd probably do is start building the User object when the row starts. As you hit the cells in the row, you populate your User object. When the row ends, validate the User object, and if it's fine add it then. Because you're doing SAX parsing, you'll get the start and events for all of these, so you can attach your logic there.
I'd suggest you take a look at XLSX2CSV in the Apache POI Examples. It shows how to go about handling the different kinds of cell contents (which you'll need for populating your user object), how to do something when you reach the end of the row, as well as handling missing cells etc.

I think you can create a user object at following location in your code:
// We're onto a new row
output.println();
// Convert output to a new user object
// ....
// ....

First of all where you are saving value in thisStr variable, if this is a valid value then put this value in Map.
You should create USer object in endElement() method in
else if ("row".equals(name)) {
// use map create USER object here
}
and You can add Users object in global list and if you want to persist it then you can persist it sheet by sheet OR all data at a time.
while (iter.hasNext()) {
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
this.output.println(sheetName + " [index=" + index + "]:");
processSheet(styles, strings, stream);
stream.close();
++index;
//for persisting USERS data sheet by sheet write your code here.........
}
// for persisting complete data of all sheets write your code here...
This is working for me.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split .srt file into equal chunks - java

Related

How to avoid skipping blank rows or columns in Apache POI

timestamp as file name in java

readNext() function of CSVReader not looping through all rows of csv [EDIT: How to handle erroneous CSV (remove unescaped quotes)]

cannot find symbol: variable TestUtils

Reading an Excel sheet using POI's XSSF and SAX (Event API)

Categories

Resources