Pentaho Kettle program in java to merge multiple csv files by columns

Pentaho Kettle program in java to merge multiple csv files by columns - java

I have two csv files employee.csv and loan.csv.
In employee.csv I have four columns i.e. empid(Integer),name(String),age(Integer),education(String).
In loan.csv I have three columns i.e. loan(Double),balance(Double),empid(Integer).
Now, I want to merge these two csv files into a single csv file by empid column.So in the result.csv file the columns should be,
empid(Integer),
name(String),
age(Integer),
education(String),
loan(Double),
balance(Double).
Also I have to achieve this only by using kettle api program in Java.
Can anyone please help me?

First of all, you need to create a kettle transformation as below:
Take two "CSV Input Step", one for employee.csv and another for loan.csv
Hop the input to the "Stream Lookup" step and lookup using the "emplid"
Final step : Take a Text file output to generate a csv file output.
I have placed the ktr code in here.
Secondly, if you want to execute this transformation using Java, i suggest you read this blog. I have explained how to execute a .ktr/.kjb file using Java.
Extra points:
If its required that the names of the csv files need to be passed as a parameter from the Java code, you can do that by adding the below code:
trans.setParameterValue(parameterName, parameterValue);
where parameterName is the some variable name
and parameterValue is the name of the file or the location.
I have already taken the files names as the parameter in the kettle code i have shared.
Hope it helps :)

Related

Writing to a specific csv column

I have made a little javaFX app that reads from the database and writes to a file using openCSV.Everything is fine now.In the file i have a column and its values contain the path of an image(Example : /home/user/folder1/image.jpg).What i have to do is replace all the values of this column ,with another value that contains only the last folder name and the image name(Example : folder1/image.jpg).I read about this and i realised that i had to write a lot of code.What i am asking is if there is another way of doing this? For example can i replace all of this values directly from the query or can i use another library that lets me write to a specific column?
Thanks :)

Running GATK DepthOfCoverage on BAM files with multiple RG's

I am trying to run GATK DepthOfCoverage on some BAM files that I have merged from two original files (the same sample was sequenced on two lanes to maximize the number of reads). I realized after the fact that my merged file has reads with different read groups (as reflected by the RG field of each read), and that the header of my two original files differed in their #RG fields.
I have tried running samtools reheader adding a new #RG field in the header, but when I merge the two files each read group is based on the name of the two BAM files, not on the name of the #RG in the headers of the two BAM files.
For example, my two starting samples are:
27163.pe.markdup.bam
27091.pe.markdup.bam
but when I merge them using samtools merge
samtools merge merged.bam 27163.pe.markdup.bam 27091.pe.markdup.bam
The resulting merged.bam has the same #RG field in the header as only one of the two, and each of the reads has a read name based on the name of the file it came from as such:
Read 1
RG:Z:27091.pe.markdup
Read 2
RG:Z:27163.pe.markdup
etc. for the rest of the reads in the BAM
Am I doing something wrong? Should I rehead each of the original files before merging? Or simply rehead after merging to something that is compatible with GATK? It seems like no matter what the #RG field in the header is before merging, the merged file will always have reads with different RGs based on the name of the two input files.
I am also not sure what does GATK DepthOfCoverage want as input in terms of read groups. Does it want a single RG for all reads? In that case, should I use something different than samtools merge?
Thanks in advance for any help you can give me.

For future reference, please see the worked out solution here:
https://www.biostars.org/p/105787/#107970
Basically the correct procedure is to merge using Picard instead of samtools which gives output compatible with GATK in terms of the bam file read group vocabulary.

Read text files and write it to excel in java

I have to read a text file and write it to an already existing excel file. The excel file is a customized excel sheet with different items in different columns. The items has different values for each of them... These items with there value can be found in a text file. But i dont have much idea as to how to do this.
E.g- example.txt
Name: John
Age=24
Sex=M
Graduate=M.S
example.xlsx
Age: Sex:
Name: Graduate:
Thanks in advance :)

Just as for so many other problems that need solved, there's an Apache library for that! In this case, it's the POI library. I've only used it for very basic spreadsheet manipulation, but managed that by just following a few tutorials. I'd link to one, but I can't now remember where it was.

Please see Apache POI-HSSF library for reading and writing Excel files with Java. There are some quick guides to get you started.
This post How to read and write excel file in java might help you.

You can also create a *.csv (comma separated value) file in Java. Just create a simple text file with CSV extension and put your values in there like that :
Age:,24,Sex:,M,
So you just separate your values with commas (or other delimiters like ';').
Every line in this file is a row, and every delimiter separates two columns. You won't be able to add colours/styles/formatting this way, but it gives you a file that is openable and understandable even without Excel (or other spreadsheet software).

combine the table data in multiple csv files into one single csv file in ruby

I have data coming from files which is spread in different files
like id,name,birthdate in one file and id,address in another file ie a csv files.
This is just an example the user has to specify the columns as its done while using SSIS
and what i want to do is create the combined file which has the whole content as
id,name,birthdate,address
are there any tools available in java/ruby for this?
I have seen the sed solution but can not go with it as the columns are not fixed......
In short i want ET function from ETL ........

Do you need Java or Ruby ? Instead have you looked at the Unix join utility ? It's analogous to the SQL join statement, except it works on text files.

How to join 2 csv files in Java

I would like to be able to take 2 csv files as input, join them (SQL style) on a specific column, and output a new csv file which contains all file1 data, plus one of the columns of data from file2.
Any tips on what the best way to achieve this would be? Since SQL offerse the join command then possibly some method of treating the csv files as databases would work well, but I'm open to all suggestions really - the easiest wins.
All help is much appreciated!

Do some simple file IO, split each line and load it into a Set type container. Then you can do set type operations on the content of the two files:
http://www.java2s.com/Code/Java/Collections-Data-Structure/Setoperationsunionintersectiondifferencesymmetricdifferenceissubsetissuperset.htm

you can parse your CSV files and bind them to the Beans with opencsv:
http://opencsv.sourceforge.net/
here, you can bind entities in CSV to a list of Beans:
http://opencsv.sourceforge.net/#javabean-integration
you can then do with List of Beans programmaticly what you want, like appending lists each other, or a join-like logic etc.

A very simple, non-programmatic approach: import both text files into a spreadsheet, then use vlookup (or its equivalent) to look up values from one sheet into the other.

For direct manipulation of CSV files as SQL tables see:
Reading a CSV file into Java as a DB table

You might also try to use a JDBC driver for CSV files like this one:
http://sourceforge.net/projects/csvjdbc/

I have written a command line program to execute arbitrary SQL on a csv files, including multi-file joins, called gcsvsql. You can read about it here:
http://bayesianconspiracy.blogspot.com/2010/03/gcsvsql.html
There is a Google Code project for it here: http://code.google.com/p/gcsvsql/
It's written in Java/Groovy, and will run anywhere Java is available.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.