How to skip lines with ItemReader in Spring-Batch? - java

I have a custom item reader that transforms lines from a textfile to my entity:
public class EntityItemReader extends AbstractItemStreamItemReader<MyEntity> {
#Override
public MyEntity read() {
String line = delegate.read();
//analyze line and skip by condition
//line.split
//create entity with line values
}
}
This is similar to the FlatFileItemReader.
The read MyEntity will then be persisted to a DB by a JdbcItemReader.
Problem: sometimes I have lines that contain values that should be skipped.
BUT when I just return null inside the read() method of the reader, then not only this item is skipped, by the reading is terminated completely, and all further lines will be skipped. Because a null element is the "signal" for all spring-readers that the file to be read is finished.
So: what can I do to skip specific lines by condition inside the reader if I cannot return null? Because by nature of the reader I'm forced to return an object here.

I think the good practice to filter some lines is to use not the reader but a processor (in which you can return null when you want to filter the line).
Please see http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html :
6.3.2 Filtering Records
One typical use for an item processor is to filter out records before they are passed to the ItemWriter. Filtering is an action distinct from skipping; skipping indicates that a record is invalid whereas filtering simply indicates that a record should not be written.
For example, consider a batch job that reads a file containing three different types of records: records to insert, records to update, and records to delete. If record deletion is not supported by the system, then we would not want to send any "delete" records to the ItemWriter. But, since these records are not actually bad records, we would want to filter them out, rather than skip. As a result, the ItemWriter would receive only "insert" and "update" records.
To filter a record, one simply returns "null" from the ItemProcessor. The framework will detect that the result is "null" and avoid adding that item to the list of records delivered to the ItemWriter. As usual, an exception thrown from the ItemProcessor will result in a skip.

I've had a similar problem for the more general case where I'm using a custom reader. That is backed by an iterator over an object type and returns a new item (of different type) for each object read. Problem is some of those objects don't map to anything, so I'd like to return something that marks that.
Eventually I've decided to define an INVALID_ITEM and return that. Another approach could be to advance the iterator in the read() method, until the next valid item, with null returned if .hasNext() becomes false, but that is more cumbersome.
Initially I have also tried to return a custom ecxeption and tell Spring to skip the item upon it, but it seemed to be ignored, so I gave up (if there are too many invalids isn't performant anyway).

I do not think you can have your cake and eat it too in this case (and after reading all the comments).
By best opinion would (as suggested) to throw a custom Exception and skip 'on it'.
You can maybe optimize your entity creation or processes elsewhere so you don't loose so much performance.
Good luck.

We can handle it via a custom Dummy Object.
private final MyClass DUMMYMyClassObject ;
private MyClass(){
// create blank Object .
}
public static final MyClass getDummyyClassObject(){
if(null == DUMMYMyClassObject){
DUMMYMyClassObject = new MyClass();
}
return DUMMYMyClassObject ;
}
And just use the below when required to skip the record in the reader :
return MyClass.getDummyyClassObject();
The same can be ignored in the processor , checking if the object is blank OR as per the logic written in the private default constructor .

For skipping lines you can throw Exception when you want to skip some lines, like below.
My Spring batch Step
#Bean
Step processStep() {
return stepBuilderFactory.get("job step")
.<String, String>chunk(1000)
.reader(ItemReader)
.writer(DataWriter)
.faultTolerant() //allowing spring batch to skip line
.skipLimit(1000) //skip line limit
.skip(CustomException.class) //skip lines when this exception is thrown
.build();
}
My Item reader
#Bean(name = "reader")
public FlatFileItemReader<String> fileItemReader() throws Exception {
FlatFileItemReader<String> reader = new FlatFileItemReader<String>();
reader.setResource(resourceLoader.getResource("c://file_location/file.txt"));
CustomLineMapper lineMapper = new CustomLineMapper();
reader.setLineMapper(lineMapper);
return reader;
}
My custom line mapper
public class CustomLineMapper implements LineMapper<String> {
#Override
public String mapLine(String s, int i) throws Exception {
if(Condition) //put your condition here when you want to skip lines
throw new CustomException();
return s;
}
}

Related

Update first object attribute in stream

I have the following code:
public List<Data> toData(List<Signatory> signatories) {
List<Data> data= signatories.stream().sorted().map(input -> {
if (input == null) {
throw new IllegalArgumentException("input cannot be null");
}
if (isStrictSigningOrderEnabled()) {
return getData(input,false);
}
return getData(input,true);
}).collect(Collectors.toList());
data.stream().filter(input -> !input.getStatus().equals(SignatoryStatus.COMPLETE))
.findFirst().ifPresent(obj -> obj.setCanSign(true));
I want to avoid using the second stream and have the job implemented in it done as part of the first stream. Since the first stream handles Signatory data and the second handles Data data, I need to clarify that if I was to integrate the functionality of the second stream into the first one what I would want to happen is, that for the first Signatory object where !input.getStatus().equals(SignatoryStatus.COMPLETE) is true (Signatory and Data both have this method), I want to call the from(input, true) method. Is that possible?
I was thinking maybe an atomic counter? If it is not possible, to do what I want, it would also be helpful if someone pointed it out.

Java Lambda: how to use Stream.reduce to accomplish this?

I have a file of records, each row begins with a timestamp and then a few fields.. it implements Iterable
#SuppressWarnings("unchecked")
#Override
public <E extends MarkedPoint>
Stream<E>
stream()
{
return (Stream<E>) StreamSupport.stream(spliterator(), false);
}
I would like to implement with Lambda expression/streams API what is essentially not just a filter, but a mapping/accumulator that would merge neighboring records ( stream elements coming from the an Iterable interface ) having the same timestamp. I would need an interface that was something like this
MarkedPoint prevPoint = null;
void nextPoint(MarkedPoint in, Stream<MarkedPoint> inputStream, Stream<MarkedPoint> outputStream )
{
while ( prevPoint.time == in.time )
{
updatePrevPoint(in);
in = stream.next();
}
outputStream.emit(in);
prevPoint = in;
}
}
that is rough-pseudocode of what I imagine is close to some API as how it is supposed to be used.. can someone please point me towards the most straightforward way of implementing this stream transformation ? The resulting stream will be necessarily of the same or lesser number of elements as the input, as it is essentially a filter and and option transformation of records occuring at the same timestamp are encountered.
Thanks in advance
Streams don’t work like that; there can be only 1 terminating method (consumer). What you seem to be asking for is an on-the-fly reduction with a possible consumption of the next element(s) within your class. No dice with the standard stream API.
You could first create a list of un-merged lines, then create an iterator that peeks at the next elenent(s) and merges them before returning the next merged element.

process and manage Tabular data stream in java programming

I want to know how to process and manage Tabular data stream in java programming.
consider there is a table of records has the scheme ( name, age, zip-code, disease) and the records are to be read and processed tuple by tuple in time as a stream. i want to manage these stream tuples to save the processed tuples with the the scheme ( age, zip- code, disease ) ( name attribute is supposed to be deleted )
for input example .. read Tuple 1 ( han, 25, 12548, flue) at Time t1
publish Tuple 1* ( 25, 12548, flue)
read Tuple 2 ( alex, 27, 12544, cancer) 1 at t2 .
output Tuple 2* (27, 12544, cancer).
.. and so on, Can anyone Help me?
Here are some suggestions for a framework you can base your final application on.
First, make classes to represent your input and output records. We'll call them InRecord and OutRecord for the sake of discussion, but you can give them whatever names make sense for you. Give them private fields to hold the necessary data and public getter/setter methods to access the data.
Second, define an interface for an input supplier; let's call it InputSupplier for this discussion. It will need to provide methods for setup (open()) and tear-down (close()) methods to be called at the start and end of processing, and a getNext() method that returns the next available InRecord. You'll need to decide how it indicate end-of-input: either define that getNext() will return null if
there are no more input records, or provide a hasNext() method to call which will return true or false to indicate if another input record is available.
Third, define an interface for an output consumer (OutputConsumer). You'll want to have open() and close() methods, as well as an accept(OutRecord) method.
With this infrastructure in place, you can write your processing method:
public void process(InputSupplier in, OutputConsumer out){
in.open();
out.open();
InRecord inrec;
while ((inrec = in.getNext()) != null){
OutRecord outrec = new OutRecord(in.getAge(), in.getZipCode(), in.getDisease());
out.accept(outrec);
}
out.close();
in.close();
}
Finally, write some "dummy" I/O classes, one that implements InputSupplier and another that implements OutputConsumer. For test purposes, your input supplier can just return a few hand-created records and your output consumer could just print on the console the output records you send it.
Then all you need is a main method to tie it all together:
public static void main(String[] args){
InputSupplier in = new TestInput();// our "dummy" input supplier class
OuputConsumer out = new TestOutput(); // our "dummy" output consumer
process(in, out);
}
For the "real" application you'd write a "real" input supplier class, still implementing the InputSupplier interface, that can read from from a database or an Excel file or whatever input source, and an new output consumer class, still implementing the OutputConsumer interface, that can take output records and store them into whatever appropriate format. Your processing logic won't have to change, because you coded it in terms of InputSupplier and OutputConsumer interfaces. Now just tweak main a bit and you've got your final app:
public static void main(String[] args){
InputSupplier in = new RealInput();// our "real" input supplier class
OuputConsumer out = new RealOutput(); // our "real" output consumer
process(in, out);
}

Using Spring Batch to write csv records to different SQL tables based on a field value [duplicate]

This question already has answers here:
Spring Batch - more than one writer based on field value
(2 answers)
Closed 5 years ago.
I'm working with an application where I'm required to translate various CSV files to an SQL database. Different CSV files may contain a variable number of columns but the first three will always be consistent between files.
The files are being read in just fine, and new database tables are created before the CSV is picked up as I know the combination of possible values. However, when it comes to the writer (I'm using a JdbcBatchItemWriter), I need to reference the newly created table names based on the values in these columns to determine which tables the data corresponds to - I won't know which row corresponds to which table until I look at each row.
Some code to illustrate this:
public JdbcBatchItemWriter<Form> writer(#Value("#{jobParameters}") Map<String,Object> jobParameters) {
JdbcBatchItemWriter<Form> writer = new JdbcBatchItemWriter<Form>();
...
String parameters = getParameters();
writer.setSql(String.format("INSERT INTO tableName VALUES (%s)", parameters.toString()));
Basically, 'tableName' needs updating for each row that is processed.
It seems that ItemSqlParameterSourceProvider and ItemPreparedStatementSetter are designed for populating values in SQL query strings, but there isn't anything I can find to get the table name in as a parameter. Because I don't have access to each item at the level of the writer definition, I can't replace the value before the prepared statement is put together.
I've considered filtering items before they get there, but it's too messy for an unknown number of tables that might need to be entered into from the CSV. Any ideas?
Write your own writer, that keeps a map of writers. Everytime a new tablename appears, you can instantiate a new writer and store it in this map.
Instantiating a JdbcBatchItemWriter on the fly is no big deal (it does not have to be a spring bean).
public static <T> ItemWriter<T> createDbItemWriter(DataSource ds, String sql, ItemPreparedStatementSetter<T> psSetter) {
JdbcBatchItemWriter<T> writer = new JdbcBatchItemWriter<>();
writer.setDataSource(ds);
writer.setSql(sql);
writer.setItemPreparedStatementSetter(psSetter);
writer.afterPropertiesSet();
return writer;
}
Your writer will have to look somehow loke this (note: this code is not tested, it is just here to give you an idea)
public class MyWriter extends ItemWriter<MyDto> {
private Map<String, JdbcBatchItemWriter<MyDto>> writersMaps = new HashMap<>();
private JdbcBatchItemWriter<MyDto> getDbWriter(String tableName) throws Exception {
return writersMaps.putIfAbsent(tableName, createJdbcWriter(tableName));
}
private JdbcBatchItemWriter<MyDto> createJdbcWriter(String tableName) {
JdbcBatchItemWriter<T> writer = new JdbcBatchItemWriter<>();
// do your configuration
writer.afterPropertiesSet();
return writer;
}
public void write(List<MyDto> items) throws Exception {
Map<String, List<MyDto>> groupedItems =
--> build a list for every targetTableName, put in a Map
for (Map.Entry<String, List<MyDto>> entry : groupedItems) {
getDbWriter(entry.getKey()).write(entry.getValue);
}
}
}

designing classes for other developers to use in java

class CSVReader {
private List<String> output;
private InputStream input;
public CSVReader(InputStream input) {
this.input = input;
}
public void read() throws Exception{
//do something with the inputstream
// create output list.
}
public List<String> getOutput() {
return Collections.unmodifiableList(output);
}
I am trying to create a simple class which will be part of a library. I would like to create code that satisfies the following conditions:
handles all potential errors or wraps them into library errors and
throws them.
creates meaningful and complete object states (no incomplete object structures).
easy to utilize by developers using the library
Now, when I evaluated the code above, against the goals, I realized that I failed badly. A developer using this code would have to write something like this -
CSVReader reader = new CVSReader(new FileInputStream("test.csv");
reader.read();
read.getOutput();
I see the following issues straight away -
- developer has to call read first before getOutput. There is no way for him to know this intuitively and this is probably bad design.
So, I decided to fix the code and write something like this
public List<String> getOutput() throws IOException{
if(output==null)
read();
return Collections.unmodifiableList(output);
}
OR this
public List<String> getOutput() {
if(output==null)
throw new IncompleteStateException("invoke read before getoutput()");
return Collections.unmodifiableList(output);
}
OR this
public CSVReader(InputStream input) {
read(); //throw runtime exception
}
OR this
public List<String> read() throws IOException {
//read and create output list.
// return list
}
What is a good way to achieve my goals? Should the object state be always well defined? - there is never a state where "output" is not defined, so I should create the output as part of constructor? Or should the class ensure that a created instance is always valid, by calling "read" whenever it finds that "output" is not defined and just throw a runtime exception? What is a good approach/ best practice here?
I would make read() private and have getOutput() call it as an implementation detail. If the point of exposing read() is to lazy-load the file, you can do that with exposing getOutput only
public List<String> getOutput() {
if (output == null) {
try {
output = read();
} catch (IOException) {
//here you either wrap into your own exception and then declare it in the signature of getOutput, or just not catch it and make getOutput `throws IOException`
}
}
return Collections.unmodifiableList(output);
}
The advantage of this is that the interface of your class is very trivial: you give me an input (via constructor) I give you an output (via getOutput), no magic order of calls while preserving lazy-loading which is nice if the file is big.
Another advantage of removing read from the public API is that you can go from lazy-loading to eager-loading and viceversa without affecting your clients. If you expose read you have to account for it being called in all possible states of your object (before it's loaded, while it's already running, after it already loaded). In short, always expose the least possible
So to address your specific questions:
Yes, the object state should always be well-defined. Your point of not knowing that an external call on read by the client class is needed is indeed a design smell
Yes, you could call read in the constructor and eagerly load everything upfront. Deciding to lazy-load or not is an implementation detail dependent on your context, it should not matter to a client of your class
Throwing an exception if read has not been called puts again the burden to calling things in the right, implicit order on the client, which is unnecessary due to your comment that output is never really undefined so the implementation itself can make the risk-free decision of when to call read
I would suggest you make your class as small as possible, dropping the getOutput() method all together.
The idea is to have a class that reads a CSV file and returns a list, representing the result. To achieve this, you can expose a single read() method, that will return a List<String>.
Something like:
public class CSVReader {
private final InputStream input;
public CSVReader(String filename) {
this.input = new FileInputStream(filename);
}
public List<String> read() {
// perform the actual reading here
}
}
You have a well defined class, a small interface to maintain and the instances of CSVReader are immutable.
Have getOutput check if it is null (or out of date) and load it in automatically if it is. This allows for a user of your class to not have to care about internal state of the class's file management.
However, you may also want to expose a read function so that the user can chose to load in the file when it is convenient. If you make the class for a concurrent environment, I would recommend doing so.
The first approach takes away some flexibility from the API: before the change the user could call read() in a context where an exception is expected, and then call getOutput() exception-free as many times as he pleases. Your change forces the user to catch a checked exception in contexts where it wasn't necessary before.
The second approach is how it should have been done in the first place: since calling read() is a prerequisite of calling getOutput(), it is a responsibility of your class to "catch" your users when they "forget" to make a call to read().
The third approach hides IOException, which may be a legitimate exception to catch. There is no way to let the user know if the exception is going to be thrown or not, which is a bad practice when designing runtime exceptions.
The root cause of your problem is that the class has two orthogonal responsibilities:
Reading a CSV, and
Storing the result of a read for later use.
If you separate these two responsibilities from each other, you would end up with a cleaner design, in which the users would have no confusion over what they must call, and in what order:
interface CSVData {
List<String> getOutput();
}
class CSVReader {
public static CSVData read(InputStream input) throws IOException {
...
}
}
You could combine the two into a single class with a factory method:
class CSVData {
private CSVData() { // No user instantiation
}
// Getting data is exception-free
public List<String> getOutput() {
...
}
// Creating instances requires a factory call
public static CSVData read(InputStream input) throws IOException {
...
}
}

Categories