Need suggestion on reading huge XLs with validating the data

Need suggestion on reading huge XLs with validating the data - java

I have a requirement where client uploads spread sheet containing thousands of rows.
different columns of a row have different data type and the data must comply with some validation rules.e.g.
below is a sample file structure:
(Header - Colume_name,Variable_type,field_size,i/p mask,required_field,validation_Text)
(P/N,String,20,none,yes,none)
(qty,Integer,10,none,yes,none)
(Ship_From,String,20,none,yes,none)
(Request_Date,Date,MM/DD/YY ,yes,none)
(Status,String,10,none,yes,Failed OR Qualified)
while reading the xl sheet,I need to validate the data against the above constraints and in case of any error in the data,
I need to store the error and inform the customer.
Please let me know the best possible approach maintaining the performance of the system.
Any early responses will be much appreciated.
Thanks,
Ashish Gupta

If I understand your question, you would like to read a file of validation rules such as the sample above. You would like to compile the rules such that they would read a large Excel spreadsheet (or is it a CSV file?) and perhaps print out a message for every line that is deemed invalid.
It seems like a two-pass process:
1) Validation and compilation of the validation file and 2) Compilation of the output of pass 1 and applying it to the Excel file.
You could approach the field validation in any of several ways, depending on your skills and inclinations.
Develop VBA code to read the validation file. Then write a separate macro to validate each line
Write a parser in your favorite language that reads in the validation file. Add some columns to the read-in Excel spreadsheet with fields such as Column name (e.g., Qty), type (e.g., Integer), required (e.g., true). Then have Excel or OpenOffice highlight invalid lines
Have lex and yacc generate Java or C++ a parser to scan the validation file and output BNF. Then have another lex and yacc file read in the output from the previous step and have it validate the Excel file.
You indicated POI on your tag, so I'm thinking that you will want to generate Java code.
Of course, you could also write a one-time program to do all of this meta-compiling and compiling, but this would be a brittle process.
If you have any freedom to specify the validation file, you might want to make it an .XSD file because there are automated tools to make its scanning much simpler. There are tools to determine whether the XML file were valid, as well as compilers that can turn it into Java.
(A thought came to mind when I was reading your validation file. How will you separate one part from another? For example, if you read in P/N, Qty, Request_Date, Ship_From, Status, P/N, is that one part with two P/N or one complete part and one with several required parts missing?)

My first thought was to have Excel do this validation, as Rajah seems to suggest too. Built-in functionality and/or VBA should be able to handle these requirements.
If you need to handle this in Java, I'd go for the XML approach.
Cheers,
Wim

I heard about a friend validating his spreadsheets using JBOSS DROOLS: http://www.jboss.org/drools

I have a XML based excel validator built on top of POI.
You just need to specify which data you need to validate in excel, the java api does the validation & returns the error message if not valid.
Eg:
<data rowNumber="2" columnNumber="2" dataType="string" >
<mandatory errorMessage="Name label is missing">Y
<value ignoreCase="true" errorMessage="Name label value is not matching.">Name</value>
The above is a simple validation for a plain text field, it has additional validations too, please let me know if you are intrested?

Related

Excel read - modify multiple column value while reading - Java, logic

I am a peculiar problem where I have to work on data given in a spreadsheet (xls,csv). I would be using that data in my java program.
The spreadsheet data is generated elsewhere and I have no control on it. In it, there are a few columns which have a system-peculiar formatting and I should have an option to choose "programmatically" on how to convert each of this to the format I need.
Simple approach in my project would have been to
a) read the spreadsheet and apply transformations in place while reading.
b) read each row as a java object and iterate over this list and do the modifications
c) use some in-memory DB like H2 and apply some **user-defined functions** (dont know how) either while reading into the memory or transforming it later.
At this point of time, I really do not have all 3 options figured out in detail. So please excuse the vagueness.
Is there any other option of doing it? And more importantly, because i can have thousands of records where more than 5 columns may need to be transformed, what is the quickest approach?

First you can check if the the file is excel or spreadsheet.
If its excel you can use Apache poi,its really useful to parse the excel file.In this case you can apply transformation while reading.
Spreadsheet is comma separated so you can use the split function and parse it.In this case you cannot apply transformation while reading, but collect in an Array and do the same.
Performance depends upon how you optimize the code.You can use Java 8 Streams to stream line and make effective use of code.

Is there a clean way to to transform text files that are not the same into a standard format

I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.

Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).

Generating dynamic forms from Excel xlsx to HTML and JavaScript

I am trying to create dynamic forms for a web application using Excel spreadsheets.
The form has some relatively advanced rules like the following:
Field A > Field B.
Field C must be shown if Check Box D is checked.
Field E is read-only and must be a sum of A and B.
Field G is sum of E and A or F and A if B is empty.
Combinations of rules.
These are just examples of some of them.
The server is implemented and runs in Java which I guess narrows the possible solutions. My first thought is to parse the excel spreadsheet with all required information into XML to enable either serverside or clientside conversion. This is basicly because I have found tools that work on either side.
So my question is whether anyone knows of a tool that can perform this conversion or if anyone knows of a better solution?
I have looked at https://github.com/davidmoten/xsd-forms but I am not sure it can implement all the required rules and license information is sparse.
I realize this question is quite vague but so is the task. Any help is appreciated.

I think you can use Apache's POI API for reading Excel sheet and JAX-B for generating XML from the data read from excel sheet.
You can read the more details about reading excel files using Apache's POI API over here.

Excel or text file, which one to use?

I need to suggest an input, excel file or text file.
assuming the input is large number of lines where I need to read the first String, for example:
A,B,C,D....
I need to read the first String (in this case A) to identify the matching row, should I use excel file and use POI to read the first cell of each row? or text file where each line tokens are separated by delimiter and to parse each line reading the first token.

Use a text file. Because computers like it more. If business requires it, rename that text file into a "csv" file and you've got an Excel file.

If humans are going to enter data then use Excel. If the file is used as a communication channel between two systems use as simple as possible file.

If at all possible, use text file - much easier to handle/troubleshoot, easier to generate, uses less memory, does not have restrictions on number of rows, etc. In general - more predictable.
If you go with text files and you have people manually preparing those text files, and you are dealing with non-ASCII text, you better make sure everybody will send you the files in correct encoding (usually UTF-8 would be the best). This is not an issue with Excel.
The only reason to use Excel workbook would be when you need some "business-people" to produce those input files, then that input effectively becomes a user interface to your system - Excel is usually considered more user friendly than Notepad. ;-)
If you do go with Excel, make sure that the people producing those Excel files will give you the correct version (I assume you would want the "old" XLS format, not the new XLSX format).

Rule of thumb: use a text file. It's more interchangeable and way easier to handle by any other software you may need to support in a few years.
If you need some humans to edit those data and you need some beautiful/color display the Excel can provide, consider creating a macro that would store data in csv.

Palm Database (PDB) files in Java?

Has anybody written any classes for reading and writing Palm Database (PDB) files in Java? (I mean on a server, not on the Palm device itself.) I tried to google, but all I got were Protein Data Bank references.
I wrote a Perl program that does it using Palm::PDB.pm, but I want to turn it into a servlet for a GWT app.

The jSyncManager project at http://www.jsyncmanager.org/ is under the LGPL and includes classes to read and write PDB files -- look in jSyncManager/API/Protocol/Util/DLPDatabase.java in its source code. It looks like the core code you need from this could be isolated from the rest of the library with a little effort.

There are a few ways that you can go about this;
Easiest but slowest: Find a perl-> java bridge. This will not be quick, but it will work and it should involve the least amount of work.
Find a C++/C# implementation that you have the source to and convert it (this should be the fastest solution)
Find a Java reader ... there seems to be a few listed under google... however I do not have any experience with these.

Depending on what your intended usage is, you might look into writing a simple reader yourself. The format is pretty simple and you only need to handle a couple of simple fields to parse it.
Basically there is a header for the entire file which has a 2 byte integer at the end which specifies the number of record. So just skip your way through the bytes for all the other fields in the header and then read the last field which is the number of records in the file. Be aware that the PDB format writes integers with most significant byte first.
Following this, there will be a record header for each record, the first field of which is the actual offset into the file for the record itself. Again, be aware of the byte order.
So, now you have the offsets into the file for each record in the file, which should make it very easy to read the actual records as long as you know the format of these for the type of PDB file you are trying to read.
Wikipedia has a nice overview of the header formats.

Maybe JPilot can help? They must have a lot of Java code dealing with Palm OS data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.