Running GATK DepthOfCoverage on BAM files with multiple RG's - java

I am trying to run GATK DepthOfCoverage on some BAM files that I have merged from two original files (the same sample was sequenced on two lanes to maximize the number of reads). I realized after the fact that my merged file has reads with different read groups (as reflected by the RG field of each read), and that the header of my two original files differed in their #RG fields.
I have tried running samtools reheader adding a new #RG field in the header, but when I merge the two files each read group is based on the name of the two BAM files, not on the name of the #RG in the headers of the two BAM files.
For example, my two starting samples are:
27163.pe.markdup.bam
27091.pe.markdup.bam
but when I merge them using samtools merge
samtools merge merged.bam 27163.pe.markdup.bam 27091.pe.markdup.bam
The resulting merged.bam has the same #RG field in the header as only one of the two, and each of the reads has a read name based on the name of the file it came from as such:
Read 1
RG:Z:27091.pe.markdup
Read 2
RG:Z:27163.pe.markdup
etc. for the rest of the reads in the BAM
Am I doing something wrong? Should I rehead each of the original files before merging? Or simply rehead after merging to something that is compatible with GATK? It seems like no matter what the #RG field in the header is before merging, the merged file will always have reads with different RGs based on the name of the two input files.
I am also not sure what does GATK DepthOfCoverage want as input in terms of read groups. Does it want a single RG for all reads? In that case, should I use something different than samtools merge?
Thanks in advance for any help you can give me.

For future reference, please see the worked out solution here:
https://www.biostars.org/p/105787/#107970
Basically the correct procedure is to merge using Picard instead of samtools which gives output compatible with GATK in terms of the bam file read group vocabulary.

Related

Reading a file with multiple section headers in Apache Spark with variable section content

Is it possible to use Spark APIs to read a large CSV file containing multiple sections having different headers? The structure of the file is as follows
BatchCode#1
Name,Surname,Address
AA1,BBB,CCC
AA2,BBB,CCC
AA3,BBB,CCC
BatchCode#2
Name,Surname,Address,Phone
XY1,BBB,CCC,DDD
XY2,BBB,CCC,DDD
XY3,BBB,CCC,DDD
While reading the records, we need to be careful with the headers as well as the file formats could be different between the sections. The BatchCode information needs to be extracted from the header and should be a part of every record within that section - for example, Data at line 1 should be parsed as:
Name: AAA1
Surname: BBB
Address:CCC
BatchCode:1
The following options come to my mind but I am not completely sure if it could create significant problems:
Reading the file using wholeTextFile. This will use a single thread to read the file but it would load the entire file in memory and could cause memory issues with large files.
Forcing Spark to read the file in a single thread using coalesce(1) on sc.textFile. I am not sure if the order is always guaranteed. Once we get the file as RDD, we will cache the header rows while reading the file and merge them with their corresponding data records.
Even if the above approaches work, would they be efficient? What would be the most efficient way?
I wrote Scala only programs for more complicated such use cases whereby sequentialness is guaranteed. It is too difficult otherwise. The files were processed via csvkit if emanating from xls or xlsx firstly.
The following program works for me:
JavaPairRDD<String, PortableDataStream> binaryFiles = sc.binaryFiles(file);
PortableRecordReader reader = new PortableRecordReader();
JavaPairRDD<String, Record> fileAndLines = binaryFiles.flatMapValues(reader);
Where PortableRecordReader opens a DataInputStream and converts it to an InputStreamReader and then uses a CSV parser to convert the lines to the intended output in Record object and also merges the header.

Defining a manual Split algorithm for File Input

I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.

Pentaho Kettle program in java to merge multiple csv files by columns

I have two csv files employee.csv and loan.csv.
In employee.csv I have four columns i.e. empid(Integer),name(String),age(Integer),education(String).
In loan.csv I have three columns i.e. loan(Double),balance(Double),empid(Integer).
Now, I want to merge these two csv files into a single csv file by empid column.So in the result.csv file the columns should be,
empid(Integer),
name(String),
age(Integer),
education(String),
loan(Double),
balance(Double).
Also I have to achieve this only by using kettle api program in Java.
Can anyone please help me?
First of all, you need to create a kettle transformation as below:
Take two "CSV Input Step", one for employee.csv and another for loan.csv
Hop the input to the "Stream Lookup" step and lookup using the "emplid"
Final step : Take a Text file output to generate a csv file output.
I have placed the ktr code in here.
Secondly, if you want to execute this transformation using Java, i suggest you read this blog. I have explained how to execute a .ktr/.kjb file using Java.
Extra points:
If its required that the names of the csv files need to be passed as a parameter from the Java code, you can do that by adding the below code:
trans.setParameterValue(parameterName, parameterValue);
where parameterName is the some variable name
and parameterValue is the name of the file or the location.
I have already taken the files names as the parameter in the kettle code i have shared.
Hope it helps :)

Complex MapReduce configuration scenario

Consider an application that wants to use Hadoop in order to process large amounts of proprietary binary-encoded text data in approximately the following simplified MapReduce sequence:
Gets a URL to a file or a directory as input
Reads the list of the binary files found under the input URL
Extracts the text data from each of those files
Saves the text data into new, extracted plain text files
Classifies extracted files into (sub)formats of special characteristics (say, "context")
Splits each of the extracted text files according to its context, if necessary
Processes each of the splits using the context of the original (unsplit) file
Submits the processing results to a proprietary data repository
The format-specific characteristics (context) identified in Step 5 are also saved in a (small) text file as key-value pairs, so that they are accessible by Step 6 and Step 7.
Splitting in Step 6 takes place using custom InputFormat classes (one per custom file format).
In order to implement this scenario in Hadoop, one could integrate Step 1 - Step 5 in a Mapper and use another Mapper for Step 7.
A problem with this approach is how to make a custom InputFormat know which extracted files to use in order to produce the splits. For example, format A may represent 2 extracted files with slightly different characteristics (e.g., different line delimiter), hence 2 different contexts, saved in 2 different files.
Based on the above, the getSplits(JobConf) method of each custom InputFormat needs to have access to the context of each file before splitting it. However, there can be (at most) 1 InputFormat class per format, so how would one correlate the appropriate set of extracted files with the correct context file?
A solution could be to use some specific naming convention for associating extracted files and contexts (or vice versa), but is there any better way?
This sounds more like a Storm (stream processing) problem with a spout that loads the list of binary files from a URL and subsequent bolts in the topology performing each of the following actions.

Indexing multiple files in one file

I have a program that is reading from plain text files. the amount of these files can be more that 5 Million!
When I'm reading them I found them by name! the names are basically save as x and y of a matrix for example 440x300.txt
Now I want to put all of them in one big file and index them
I mean I want to now exactly for example 440x300.txt is saved in the file from which byte and end in which byte!
My first Idea was to create a separate file and save this info in that like each line contains 440 x 300 150883 173553
but finding this info will also a lot of time!
I want to know if the is a better way to find out where do they start and end!
Somehow index the files
Please help
By the way I'm programming in Java.
Thanks in advance for your time.
If you only need to read these files I would archive them in batches. e.g. use ZIP or Jar format. This support the naming and indexing of files and you can build, update and check them using standard tools.
It is possible to place 5 million file sin one archive but using a small number of archives may be more manageable.
BTW: As the files are text, compressing them will also make them smaller. You can try this yourself by create a ZIP or JAR with say 1000 of them.
If you want to be able to do direct addressing within your file, then you have two options:
Have an index at the beginning of your file so you can lookup the start/end address based on (x, y)
Make all records exactly the same size (in bytes) so you can easily compute the location of a record in your files.
Choosing the right option should be done based on the following criteria:
Do you have records for each cell in your matrix?
Do the matrix values change?
Does the matrix dimension change?
Can the values in the matrix have a fixed byte length (i.e. are they numbers or strings)?

Categories