I have a spring-batch file that extracts data from a database and writes it to a .CSV file.
I would like to add the names of the columns that are extracted as the headers of the file without hard coding them on the file.
Is possible to write the header when I get the results or is there another solution?
Thanks
fileItemWriter.setHeaderCallback(new FlatFileHeaderCallback() {
public void writeHeader(Writer writer) throws IOException {
writer.write(Arrays.toString(names));
}
});
[names] can be fetched using reflections from the domain class you created for the column names to be used by rowMapper, something like below :
private String[] reflectFields() throws ClassNotFoundException {
Class job = Class.forName("DomainClassName");
Field[] fields = FieldUtils.getAllFields(job);
names = new String[fields.length];
for(int i=0; i<fields.length; i++){
names[i] = fields[i].getName();
}
return names;
}
Related
I have a csv file written using Apache commons API and I also can read the file, however I'm unable to know how to edit a record value in the csv file using Apache commons API, need help on this.
I tried the below code and it worked exactly the way I expected.
public static void updateCsvFile(File f) throws Exception {
CSVParser parser = new CSVParser(new FileReader(f), CSVFormat.DEFAULT);
List<CSVRecord> list = parser.getRecords();
String edited = f.getAbsolutePath();
f.delete();
CSVPrinter printer = new CSVPrinter(new FileWriter(edited), CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR));
for (CSVRecord record : list) {
String[] s = toArray(record);
if(s[0].equalsIgnoreCase("Actual Text")){
s[0] = "Replacement Text";
}
print(printer, s);
}
parser.close();
printer.close();
System.out.println("CSV file was updated successfully !!!");
}
public static String[] toArray(CSVRecord rec) {
String[] arr = new String[rec.size()];
int i = 0;
for (String str : rec) {
arr[i++] = str;
}
return arr;
}
public static void print(CSVPrinter printer, String[] s) throws Exception {
for (String val : s) {
printer.print(val != null ? String.valueOf(val) : "");
}
printer.println();
}
The Apache CSV interface only support reading and writing exclusively, you cannot update records with the provided API.
So your best option is probably to read the file into memory, do the changes and write it out again.
If the size of the file is bigger than available memory you might need some streaming approach which reads records and writes them out before reading the next one. You need to write to a separate file in this case naturally.
Hi i have an application that reads records from HBase and writes into text files HBase table has 200 regions.
I am using MultipleOutputs in the mapper class to write into multiple files and i am making file name from the incoming records .
I am making 40 unique file names .
I am able to get records properly but my problem is that when mapreduce finishes it creates 40 files and also 2k extra files with proper name but appended
with m-000 and so on.
This is because i have 200 regions and MultipleOutputs creates files for each mapper so 200 mapper and for each mapper there are 40 unique files so that is why it creates 40*200 files .
I don't know how to avoid this situation without custom partitioner .
Is there any way to force write records into belonging files only not to split into multiple files.
I have used custom partitioner class and its working fine but i don't want to use that as i am just reading from HBase and not doing reducer operation.Also if any extra file name i have to create then i have to change my code also .
Here is my mapper code
public class DefaultMapper extends TableMapper<NullWritable, Text> {
private Text text = new Text();
MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
#Override()
public void setup(Context context) throws java.io.IOException, java.lang.InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
String FILE_NAME = new String(value.getValue(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_NAME)));
multipleOutputs.write(NullWritable.get(), new Text(text.toString()),FILE_NAME);
//context.write(NullWritable.get(), text);
}
No reducer class
This is how my output looks like ideally only one Japan.BUS.gz file should be created.Other files are very small files also
Japan.BUS-m-00193.gz
Japan.BUS-m-00194.gz
Japan.BUS-m-00195.gz
Japan.BUS-m-00196.gz
I had encountered the same situation and made a solution for it also.
MultipleOutputs multipleOutputs = null;
String keyToFind = new String();
public void setup(Context context) throws IOException, InterruptedException
{
this.multipleOutputs_normal = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
}
public void map(NullWritable key , Text values, Context context) throws IOException, InterruptedException
{
String valToFindInCol[] = values.toString.split(",");/** Lets say comma seperated **/
if (keyToFind .equals(valToFindInCol[2].toString())|| keyToFind == null) /** Say you need to match 2 position element **/
{
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
else
{
this.multipleOutputs.close();
this.multipleOutputs = null;
this.multipleOutputs = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
keyToFind=valToFindInCol[2];
}
I am trying to get the summary of a csv file and the first line of the file is the header. Is there a way to make the values of each column with its header name as key value pair from the Java code.
Eg: Input file is like
A,B,C,D
1,2,3,4
5,6,7,8
I want the the output from mapper as (A,1),(B,2),(C,3),(D,4),(A,5),....
Note:I tried using overriding the run function in the Mapper class to skip the first line. But As far as I know the run function gets called for each input split and is thus not suiting my need. Any help on this will really be appreciated.
This is the way my mapper looks:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",",-1);
int length = splits.length;
// count = 0;
for (int i = 0; i < length; i++) {
columnName.set(header[i]);
context.write(columnName, new Text(splits[i]+""));
}
}
public void run(Context context) throws IOException, InterruptedException
{
setup(context);
try
{
if (context.nextKeyValue())
{
Text columnHeader = context.getCurrentValue();
header = columnHeader.toString().split(",");
}
while (context.nextKeyValue())
{
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
finally
{
cleanup(context);
}
}
I assume that the column headers are alphabets and column values are numbers.
One of the ways to achieve this, is to use DistributedCache.
Following are the steps:
Create a file containing the column headers.
In the Driver code, add this file to the distributed cache, by calling Job::addCacheFile()
In the setup() method of the mapper, access this file from the distributed cache. Parse and store the contents of the file in a columnHeader list.
In the map() method, check if the values in each record match the headers (stored in columnnHeader list). If yes, then ignore that record (Because the record just contains the headers). If no, then emit the values along with the column headers.
This is how the Mapper and Driver code looks like:
Driver:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "HeaderParser");
job.setJarByClass(WordCount.class);
job.setMapperClass(HeaderParserMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.addCacheFile(new URI("/in/header.txt#header.txt"));
FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
Driver Logic:
Copy "header.txt" (which contains just one line: A,B,C,D) to HDFS
In the Driver, add "header.txt" to distributed cache, by executing following statement:
job.addCacheFile(new URI("/in/header.txt#header.txt"));
Mapper:
public static class HeaderParserMapper
extends Mapper<LongWritable, Text , Text, NullWritable>{
String[] headerList;
String header;
#Override
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
header = bufferedReader.readLine();
headerList = header.split(",");
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] values = line.split(",");
if(headerList.length == values.length && !header.equals(line)) {
for(int i = 0; i < values.length; i++)
context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
}
}
}
Mapper Logic:
Override setup() method.
Read "header.txt" (which was put in distributed cache in the Driver) in the setup() method.
In the map() method, check if the line matches the header. If yes, then ignore that line. Else, output header and values as (h1,v1), (h2,v2), (h3,v3) and (h4,v4).
I ran this program on the following input:
A,B,C,D
1,2,3,4
5,6,7,8
I got the following output (where values are matched with respective header):
A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8
The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this.
One way to go is to write a custom InputFormat that does this work for you
I am currently trying to read in multiple CSV files using beanReader before taking a few columns from each and parsing them into one bean.
So far I cannot seem to parse columns from different files into one bean object. Is this even possible with ICsvBeanReader?
Yes, it's possible :) As of Super CSV 2.2.0 you can read into an existing bean (see javadoc).
The following example uses 3 readers simultaneously (operating on 3 different files) - the first reader is used to create the bean, the other 2 just update the existing bean. This approach assumes that each file has the same number of rows (and that each row number represents the same person). If they don't, but they share some unique identifier, you'll have to read all the records from the first file into memory first, then update from the second/third matching on the identifier.
I've tried to make it a little bit smart, so you don't have to hard-code the name mapping - it just nulls out the headers it doesn't know about (so that Super CSV doesn't attempt to map fields that don't exist in your bean - see the partial reading examples on the website). Of course this will only work if your file has headers - otherwise you'll just have to hard code the mapping arrays with nulls in the appropriate places.
Person bean
public class Person {
private String firstName;
private String sex;
private String country;
// getters/setters
}
Example code
public class Example {
private static final String FILE1 = "firstName,lastName\nJohn,Smith\nSally,Jones";
private static final String FILE2 = "age,sex\n21,male\n24,female";
private static final String FILE3 = "city,country\nBrisbane,Australia\nBerlin,Germany";
private static final List<String> DESIRED_HEADERS = Arrays.asList("firstName", "sex", "country");
#Test
public void testMultipleFiles() throws Exception {
try (
ICsvBeanReader reader1 = new CsvBeanReader(new StringReader(FILE1), CsvPreference.STANDARD_PREFERENCE);
ICsvBeanReader reader2 = new CsvBeanReader(new StringReader(FILE2), CsvPreference.STANDARD_PREFERENCE);
ICsvBeanReader reader3 = new CsvBeanReader(new StringReader(FILE3), CsvPreference.STANDARD_PREFERENCE);){
String[] mapping1 = getNameMappingFromHeader(reader1);
String[] mapping2 = getNameMappingFromHeader(reader2);
String[] mapping3 = getNameMappingFromHeader(reader3);
Person person;
while((person = reader1.read(Person.class, mapping1)) != null){
reader2.read(person, mapping2);
reader3.read(person, mapping3);
System.out.println(person);
}
}
}
private String[] getNameMappingFromHeader(ICsvBeanReader reader) throws IOException{
String[] header = reader.getHeader(true);
// only read in the desired fields (set unknown headers to null to ignore)
for (int i = 0; i < header.length; i++){
if (!DESIRED_HEADERS.contains(header[i])){
header[i] = null;
}
}
return header;
}
}
Output
Person [firstName=John, sex=male, country=Australia]
Person [firstName=Sally, sex=female, country=Germany]
I have some questions regarding reading and writing to CSV files (or if there is a simpler alternative).
Scenario:
I need to have a simple database of people and some basic information about them. I need to be able to add new entries and search through the file for entries. I also need to be able to find an entry and modify it (i.e change their name or fill in a currently empty field).
Now I'm not sure if a CSV reader/writer is the best route or not? I wouldn't know where to begin with SQL in Java but if anyone knows of a good resource for learning that, that would be great.
Currently I am using SuperCSV, I put together a test project based around some example code:
class ReadingObjects {
// private static UserBean userDB[] = new UserBean[2];
private static ArrayList<UserBean> arrUserDB = new ArrayList<UserBean>();
static final CellProcessor[] userProcessors = new CellProcessor[] {
new StrMinMax(5, 20),
new StrMinMax(8, 35),
new ParseDate("dd/MM/yyyy"),
new Optional(new ParseInt()),
null
};
public static void main(String[] args) throws Exception {
ICsvBeanReader inFile = new CsvBeanReader(new FileReader("foo.csv"), CsvPreference.EXCEL_PREFERENCE);
try {
final String[] header = inFile.getCSVHeader(true);
UserBean user;
int i = 0;
while( (user = inFile.read(UserBean.class, header, userProcessors)) != null) {
UserBean addMe = new UserBean(user.getUsername(), user.getPassword(), user.getTown(), user.getDate(), user.getZip());
arrUserDB.add(addMe);
i++;
}
} finally {
inFile.close();
}
for(UserBean currentUser:arrUserDB){
if (currentUser.getUsername().equals("Klaus")) {
System.out.println("Found Klaus! :D");
}
}
WritingMaps.add();
}
}
And a writer class:
class WritingMaps {
public static void add() throws Exception {
ICsvMapWriter writer = new CsvMapWriter(new FileWriter("foo.csv", true), CsvPreference.EXCEL_PREFERENCE);
try {
final String[] header = new String[] { "username", "password", "date", "zip", "town"};
String test = System.getProperty("line.seperator");
// set up some data to write
final HashMap<String, ? super Object> data1 = new HashMap<String, Object>();
data1.put(header[0], "Karlasa");
data1.put(header[1], "fdsfsdfsdfs");
data1.put(header[2], "17/01/2010");
data1.put(header[3], 1111);
data1.put(header[4], "New York");
System.out.println(data1);
// the actual writing
// writer.writeHeader(header);
writer.write(data1, header);
// writer.write(data2, header);
} finally {
writer.close();
}
}
}
Issues:
I'm struggling to get the writer to add a new line to the CSV file. Purely for human readability purposes, not such a big deal.
I'm not sure how I would add data to an existing record to modify it. (remove and add it again? Not sure how to do this).
Thanks.
Have you considered an embedded database like H2, HSQL or SQLite? They can all persist to the filesystem and you'll discover a more flexible datastore with less code.
The easiest solution is to read the file at application startup into an in-memory structure (list of UserBean, for example), to add, remove, modify beans in this in-memory structure, and to write the whole list of UserBean to the file when the app closes, or when the user chooses to Save.
Regarding newlines when writing, the javadoc seems to indicate that the writer will take care of that. Just call write for each of your user bean, and the writer will automatically insert newlines between each row.