I am a peculiar problem where I have to work on data given in a spreadsheet (xls,csv). I would be using that data in my java program.
The spreadsheet data is generated elsewhere and I have no control on it. In it, there are a few columns which have a system-peculiar formatting and I should have an option to choose "programmatically" on how to convert each of this to the format I need.
Simple approach in my project would have been to
a) read the spreadsheet and apply transformations in place while reading.
b) read each row as a java object and iterate over this list and do the modifications
c) use some in-memory DB like H2 and apply some **user-defined functions** (dont know how) either while reading into the memory or transforming it later.
At this point of time, I really do not have all 3 options figured out in detail. So please excuse the vagueness.
Is there any other option of doing it? And more importantly, because i can have thousands of records where more than 5 columns may need to be transformed, what is the quickest approach?
First you can check if the the file is excel or spreadsheet.
If its excel you can use Apache poi,its really useful to parse the excel file.In this case you can apply transformation while reading.
Spreadsheet is comma separated so you can use the split function and parse it.In this case you cannot apply transformation while reading, but collect in an Array and do the same.
Performance depends upon how you optimize the code.You can use Java 8 Streams to stream line and make effective use of code.
Related
I need to find an specific string (id, name for example) in 1 sheet of excel.
this is a basic need.
Later on we need to find a user on several excel sheets and copy the whole record identified with that code and send it to a JTable in the frame.
Are you looking for a high-level search function or something? I don't think that exists.
As you load the sheets, you might consider just adding the interesting columns to a HashMap if you can use exact matches, otherwise just iterate over the sheets/columns/rows and search manually.
You could create some mid-level tooling to do this. A "Sheet Indexer" perhaps, that takes a sheet and a list of columns then lets you do lookups. Even if you have to write code to iterate over everything manually you shouldn't worry too much about speed--the number of sheets/rows are very unlikely to get large enough to effect performance or anything.
We actually have a lot of tooling built around poi including a ORM layer that lets us load from spreadsheets using annotations just like hibernate. We called it "son of poi" aka "poison".
I'm writing a tool to analyze stock market data. For this I download data and then save all the data corresponding to a stock as a double[][] 20*100000 array in a data.bin on my hd, I know I should put it in some database but this is simply performance wise the best method.
Now here is my problem: I need to do updates and search on the data:
Updates: I have to append new data to the end of the array as time progresses.
Search: I want to iterate over different data files to find a minimum or calculate moving averages etc.
I could do both of them by reading the whole file in and update it writing or do search in a specific area... but this is somewhat overkill since I don't need the whole data.
So my question is: Is there a library (in Java) or something similar to open/read/change parts of the binary file without having to open the whole file? Or searching through the file starting at a specific point?
RandomAccessFile allows seeking into particular position in a file and updating parts of the file or adding new data to the end without rewriting everything. See the tutorial here: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
You could try looking at Random Access Files:
Tutorial: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
API: http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html
... but you will still need to figure out the exact positions you want to read in a binary file.
You might want to consider moving to a database, maybe a small embedded one like H2 (http://www.h2database.com)
I have a requirement where client uploads spread sheet containing thousands of rows.
different columns of a row have different data type and the data must comply with some validation rules.e.g.
below is a sample file structure:
(Header - Colume_name,Variable_type,field_size,i/p mask,required_field,validation_Text)
(P/N,String,20,none,yes,none)
(qty,Integer,10,none,yes,none)
(Ship_From,String,20,none,yes,none)
(Request_Date,Date,MM/DD/YY ,yes,none)
(Status,String,10,none,yes,Failed OR Qualified)
while reading the xl sheet,I need to validate the data against the above constraints and in case of any error in the data,
I need to store the error and inform the customer.
Please let me know the best possible approach maintaining the performance of the system.
Any early responses will be much appreciated.
Thanks,
Ashish Gupta
If I understand your question, you would like to read a file of validation rules such as the sample above. You would like to compile the rules such that they would read a large Excel spreadsheet (or is it a CSV file?) and perhaps print out a message for every line that is deemed invalid.
It seems like a two-pass process:
1) Validation and compilation of the validation file and 2) Compilation of the output of pass 1 and applying it to the Excel file.
You could approach the field validation in any of several ways, depending on your skills and inclinations.
Develop VBA code to read the validation file. Then write a separate macro to validate each line
Write a parser in your favorite language that reads in the validation file. Add some columns to the read-in Excel spreadsheet with fields such as Column name (e.g., Qty), type (e.g., Integer), required (e.g., true). Then have Excel or OpenOffice highlight invalid lines
Have lex and yacc generate Java or C++ a parser to scan the validation file and output BNF. Then have another lex and yacc file read in the output from the previous step and have it validate the Excel file.
You indicated POI on your tag, so I'm thinking that you will want to generate Java code.
Of course, you could also write a one-time program to do all of this meta-compiling and compiling, but this would be a brittle process.
If you have any freedom to specify the validation file, you might want to make it an .XSD file because there are automated tools to make its scanning much simpler. There are tools to determine whether the XML file were valid, as well as compilers that can turn it into Java.
(A thought came to mind when I was reading your validation file. How will you separate one part from another? For example, if you read in P/N, Qty, Request_Date, Ship_From, Status, P/N, is that one part with two P/N or one complete part and one with several required parts missing?)
My first thought was to have Excel do this validation, as Rajah seems to suggest too. Built-in functionality and/or VBA should be able to handle these requirements.
If you need to handle this in Java, I'd go for the XML approach.
Cheers,
Wim
I heard about a friend validating his spreadsheets using JBOSS DROOLS: http://www.jboss.org/drools
I have a XML based excel validator built on top of POI.
You just need to specify which data you need to validate in excel, the java api does the validation & returns the error message if not valid.
Eg:
<data rowNumber="2" columnNumber="2" dataType="string" >
<mandatory errorMessage="Name label is missing">Y
<value ignoreCase="true" errorMessage="Name label value is not matching.">Name</value>
The above is a simple validation for a plain text field, it has additional validations too, please let me know if you are intrested?
I want to make a GUI application that contains three functions as follows:
Add a record
Edit a record
Delete a record
A record contains two fields - Name and Profession
There are two restrictions for the application
You can't use database to store info. You have to use a flat file.
Total file should not be re-written for every add/delete operation.
So, my questions are mentioned below:
Q1. Which file format would be better? (.xml or .csv or .txt or any other)
Q2. How can we perform the add/delete operation without the whole file being re-written?
The second part of your question is answered here : Best Way to Write Bytes in the Middle of a File in Java
As for the format - I would go with something as simple as possible. You don't want to have to deal with a bunch of markup processing, as using RandomAccessFile, you will going directly to a byte position. A fixed width style format would be good, so that based on the record number, you can calculate the starting position of a record or field in the file, without having to read everything in the file. The fields would then be padded out to the fixed width with spaces or some other suitable character.
I would go with CSV, zipped. it is both readable, and editable externally.
If CSV is your choice, this can help: http://javacsv.sourceforge.net/
Did you look at this? http://sourceforge.net/projects/flatworm/
Also consider Apache Derbi and HSQLDB
Another solution is this http://www.coyotegulch.com/products/jisp/index.html
You can reinvent the wheel, but that is only required if this is an academic assigment...
Given that the whole file must not be rewritten, I would suggest using RandomAccessFile that allow you to read and write only the record you want.
For the file format, a binary file, using fixed length for the record : ex: Name on 20 characters, Profession on 30.
This will allow you to use the seek() method of RandomAccessFile to directly access your data.
Has anybody written any classes for reading and writing Palm Database (PDB) files in Java? (I mean on a server, not on the Palm device itself.) I tried to google, but all I got were Protein Data Bank references.
I wrote a Perl program that does it using Palm::PDB.pm, but I want to turn it into a servlet for a GWT app.
The jSyncManager project at http://www.jsyncmanager.org/ is under the LGPL and includes classes to read and write PDB files -- look in jSyncManager/API/Protocol/Util/DLPDatabase.java in its source code. It looks like the core code you need from this could be isolated from the rest of the library with a little effort.
There are a few ways that you can go about this;
Easiest but slowest: Find a perl-> java bridge. This will not be quick, but it will work and it should involve the least amount of work.
Find a C++/C# implementation that you have the source to and convert it (this should be the fastest solution)
Find a Java reader ... there seems to be a few listed under google... however I do not have any experience with these.
Depending on what your intended usage is, you might look into writing a simple reader yourself. The format is pretty simple and you only need to handle a couple of simple fields to parse it.
Basically there is a header for the entire file which has a 2 byte integer at the end which specifies the number of record. So just skip your way through the bytes for all the other fields in the header and then read the last field which is the number of records in the file. Be aware that the PDB format writes integers with most significant byte first.
Following this, there will be a record header for each record, the first field of which is the actual offset into the file for the record itself. Again, be aware of the byte order.
So, now you have the offsets into the file for each record in the file, which should make it very easy to read the actual records as long as you know the format of these for the type of PDB file you are trying to read.
Wikipedia has a nice overview of the header formats.
Maybe JPilot can help? They must have a lot of Java code dealing with Palm OS data.