Converting BAM to bedGraph for viewing on UCSC?

March 27, 2013, 9:48 pm

≫ Next: How to create a read density profile within a interval?

≪ Previous: Profile Coverage of RNAseq samples?

I'm trying to go from a BAM file to a representation viewable in UCSC, ideally bedGraph. I am trying to use Bedtools's genomeCoverage like this:

genomeCoverageBed -ibam accepted_hits.sorted.bam -bg -trackline -split -g ... > mytrack.bedGraph

I'm not sure what the -g argument is supposed to be or how to generate it. The documentation does not explicitly say what it is supposed to be, though it gives an example where it is some sort of BED file. I am simply looking for a bedGraph or other UCSC-friendly compact representation that will allow me to visualize read densities using UCSC from the BAM. EDIT When I generate a bedGraph and put it in UCSC, I get tracks that look like this:

enter image description here

not a histogram. How can I make it a histogram? How can I generate the genome file for use with genomeCoverageBed? Also Is this the best way to get a UCSC viewable file with Bedtools? To clarify, I want to visualize the BAM as a histogram. I'm not sure this is possible with bedGraph? Thank you.

↧

How to create a read density profile within a interval?

March 27, 2013, 9:48 pm

≫ Next: bowtie2 mapping different number of reads to same sequence when ref-seq is part of different indexes

≪ Previous: Converting BAM to bedGraph for viewing on UCSC?

HI!

I need some help: I have to create density profile with a window specific of 1kb (how many time a sequence is detected after NGS method). I have to use SAM and BEDtools, I think I can use genomeCov in BEDtools but I don't have genome reference.

So, if anybody is abble to help me...

Thanks

↧

bowtie2 mapping different number of reads to same sequence when ref-seq is part of different indexes

March 27, 2013, 9:48 pm

≫ Next: How to count genes in genomic regions using a GTF/GFF3 and a BED file of regions

≪ Previous: How to create a read density profile within a interval?

I am using bowtie2 to map my PE reads.

I have indexed multiple bacterial genomes by putting them together in a multi-fasta file fashion.

bowtie2 -q -a -p 1 -x Multi -1 R100_1.fq -2 R100_2.fq -U 100_Orph.fastq -S 100.sam
samtools view -b -S 100.sam -o 100.bam
coverageBed -abam 100.bam -b BED_RefSeq >>100.cvg
CoverageBed ouput for genome("307679329")is :  307679329       1       25751   72      3568    25750   0.1385631

but when I index genome ("307679329") separately then CoverageBed output is:

307679329       1       25751   449     8369    25750   0.3250097

Can someone explain this differnece

↧

How to count genes in genomic regions using a GTF/GFF3 and a BED file of regions

March 27, 2013, 9:48 pm

≫ Next: Discrepancy in samtools mpileup/depth and BEDTools genomeCoverageBed counts

≪ Previous: bowtie2 mapping different number of reads to same sequence when ref-seq is part of different indexes

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

↧

Discrepancy in samtools mpileup/depth and BEDTools genomeCoverageBed counts

March 27, 2013, 9:48 pm

≫ Next: Split a BAM file into several files containing all the alignments for x number of reads.

≪ Previous: How to count genes in genomic regions using a GTF/GFF3 and a BED file of regions

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
  1026779 qry.depth
  1027173 qry.dz
  1026779 qry.mpileup

Any ideas? Thanks

↧

Split a BAM file into several files containing all the alignments for x number of reads.

May 30, 2013, 4:47 am

≫ Next: Bedtools genomeCoverageBed usage : How to create a genome file?

≪ Previous: Discrepancy in samtools mpileup/depth and BEDTools genomeCoverageBed counts

Hi everyone!

I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough.

What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files.

However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to.

Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I can't simply split the BAM file into 15 files, because without luck, I could end up with a 2 BAM files that have the alignments of a single read split between them, leading to the division not being correct.

How can I instruct UNIX to count a certain number of unique reads on the BAM file, output all the alignments to a new file, and continue with the rest of the BAM file, such that all reads have their n alignments contained in one single file (but shared with other reads)?

Thank you!

↧

Bedtools genomeCoverageBed usage : How to create a genome file?

May 30, 2013, 4:47 am

≫ Next: memory efficient bedtools sort and merge with millions of entries?

≪ Previous: Split a BAM file into several files containing all the alignments for x number of reads.

I am using BEDTOOLS and the following command to get the coverage file:

$ ./genomeCoverageBed -ibam ~/GG_project/trim/ecoli.bam -g > ~/GG_project/trim/coverage

where ecoli.bam is my sorted bam file, and coverage is my output file

From where do I get the genome file? How do I create a genome file?? Specifically I would need a ecoli.genome file.

↧

memory efficient bedtools sort and merge with millions of entries?

May 30, 2013, 4:47 am

≫ Next: creating bed file for lncRNA using GENCODE GTF file

≪ Previous: Bedtools genomeCoverageBed usage : How to create a genome file?

I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.

I have tried the following but it blows up in memory beyond the 32G I have available here:

find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

Any suggestions?

↧

creating bed file for lncRNA using GENCODE GTF file

May 30, 2013, 4:47 am

≫ Next: Determining each samples coverage area

≪ Previous: memory efficient bedtools sort and merge with millions of entries?

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

↧

Determining each samples coverage area

May 30, 2013, 4:47 am

≫ Next: How to rearrange paired end bam file?

≪ Previous: creating bed file for lncRNA using GENCODE GTF file

First time I am working with NGS data. I've got a BAM file with mapped reads for my samples and a BED file with the regions in HG19 that were targeted (used an Ion-torrent ampliseq panel). Are there any tools that can output something similar to this:

**Sample      Amplicon           Chromosome           Start_coordinate_of_coverage             End_coordinate_of_coverage**
Sample1       amp_001                chr6                 1,000,000                                   1,000,250
Sample2       amp_001                chr6                 1,000,111                                   1,000,255
Sample1       amp_002                chr6                 1,000,200                                   1,000,333

I basically want to know for each gene what coverage we have for each sample.

EDIT: changed column headings, I'm looking for coordinates that have coverage, not depth at each exon.

↧

How to rearrange paired end bam file?

May 30, 2013, 4:47 am

≫ Next: Getting the average coverage from the coverage counts at each depth.

≪ Previous: Determining each samples coverage area

Hello all,

I have a paired end bam file and I want to use bedtools for them. After merging, the paired end read alignments are not lying next to each other. It is making problems in the bedtools process. Is there any tool available to rearrange the paired end read alignments in bam file?

Thanks, Deeps

↧

Getting the average coverage from the coverage counts at each depth.

May 30, 2013, 4:47 am

≫ Next: What is the fastest method to determine the number of positions in a BAM file with >N coverage?

≪ Previous: How to rearrange paired end bam file?

Hi,

I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X).

So, I decided to use genomeCoverageBed for my analysis. And I used the following command:

genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt

As many are aware, the output of the file looks something like this:

genome    0    26849578    100286070 0.26773
genome    1    30938928    100286070     0.308507
genome    2    21764479    100286070    0.217024
genome    3    11775917    100286070    0.117423
genome    4    5346208    100286070    0.0533096
genome    5    2135366    100286070    0.0212927
genome    6    785983    100286070    0.00783741
genome    7    281282    100286070    0.0028048
genome    8    106971    100286070    0.00106666
genome    9    47419    100286070    0.000472837
genome    10    27403    100286070    0.000273248

To find the coverage, I multiplied col2 (depth) with col3 (number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3 is:

And the sum is: 150098740. Since the genome length is 100286070, the coverage is 150098740/100286070 = 1.5. That is to say it is, 1.5X. I have only considered the first 10 depth from the file here, but you get the idea. So, is this the right way to get the physical coverage?

Note: While the output file (coverage.txt) gives individual chromosome details, I only took the genome details i.e., col1 labeled as genome.

↧

What is the fastest method to determine the number of positions in a BAM file with >N coverage?

May 30, 2013, 4:47 am

≫ Next: bedtools multicov need a index specification option

≪ Previous: Getting the average coverage from the coverage counts at each depth.

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
bedtools coverage?
bedtools genomecov?
bedtools multicov?
bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

↧

bedtools multicov need a index specification option

May 30, 2013, 4:47 am

≫ Next: filtering bed files by using BEDOPS

≪ Previous: What is the fastest method to determine the number of positions in a BAM file with >N coverage?

bedtools version 2.16.2 multicov used to compute the multiple sample coverage given a feature file(gtf bed).

format: bedtools multicov -bams alin1.bam aln2.bam .. -bed capturRegion.bed >out.coverage

official doc has mentioned that input bam files should be sorted and indexed, but it does not mention the details. suppose the bam file name is: sample1.bam, then the index file should be named: sample1.bam.bai(not sample1.bai) ,otherwise multicov will report an error: indexes not found.

I think it would be better to add an option which will allow the user to specify the bam index files or the suffix used for these index files.

↧

filtering bed files by using BEDOPS

July 16, 2013, 7:23 am

≫ Next: Can Bedtools/Bedops used to extract regions where scores are higher than a given value?

≪ Previous: bedtools multicov need a index specification option

hello every one,

I have paired end illumina reads, R1.fastq and R2.fastq and I have mapped them as single-end reads by using bowtie2 default parameters, I performed further downstream analysis by using samtools and bedops, and now I have R1.bed and R2. bed I made two sets, one of them have R1_uniquely_mapped.bed, R2_uniquely_mapped.bed and second of them R1_mapped_more_than_1.bed , R2_mapped_more_than_1.bed.

because R1 and R2 belongs paired end reads, and my restriction library has maximum 2KB size, then R1 and R2 pairs must be present in less than 2 kb territory of chromosome

theoretically I am assuming, in R1.bed format,

chr1  100   180    @R1_read1______1 .................  
chr1   1000  1090 @R1_read2______1................

In R2.bed format,

chr1 2100   2180 @R2_read1_____2............. ## I just add 2KB length with respect to R1.bed###
chr1 2500 2590    @R2_read______2......... ## I just add 1.5KB [1500nts] with respect to R1.bed, because my library is >= 2KB.

How can I customize downstream tools like BEDOPS or bedtool which can capture such type of reads or alignment????? How can I filter this type of infromation by using bedops tool????

all suggestions and comments are most welcome,

↧

Can Bedtools/Bedops used to extract regions where scores are higher than a given value?

July 16, 2013, 7:23 am

≫ Next: how to install BedTools in a user directory

≪ Previous: filtering bed files by using BEDOPS

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value?

For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search:

$: awk '$5 > 20 {print}' mybedfile.bed

However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with the score search (e.g. get all scores that overlap a region and are higher than a value).

↧

how to install BedTools in a user directory

July 16, 2013, 7:23 am

≫ Next: How can I include one bed file in another bed file ?

≪ Previous: Can Bedtools/Bedops used to extract regions where scores are higher than a given value?

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

How can I include one bed file in another bed file ?

August 7, 2013, 5:04 am

≫ Next: IntersectBed provides an empty output

≪ Previous: how to install BedTools in a user directory

Hello, I have 2 bedfiles that share some common features let's call the first file A.bed (bigger file) and the second B.bed (smaller file). I would like to have a new bed file that includes everything in B.bed in the A.bed file. I don't need the intersect, I more like need the merge option I checked bedtools's manual... couldn't find an answer for merging 2 bedfiles. Can someone help?

Thanks in advance

↧

IntersectBed provides an empty output

August 22, 2013, 1:06 pm

≫ Next: Heatmap of read coverage around TSSs

≪ Previous: How can I include one bed file in another bed file ?

Hi,

I've downloaded the recent Cygwin version 1.7.24 and an trying to run bedTools but I get an empty file as my output. When I run the same commandline and files on a colleagues computer also through Cygwin I get a file containing the overlaps I seek. is the new Cygwin not compatable with BedTools? I've put the command line we used below:

./intersectbed -a Gene_body.bed -b EdgeR1.bed -wao > yyy.temp

Any help would be appreciated.

↧

Heatmap of read coverage around TSSs

August 22, 2013, 1:06 pm

≫ Next: Getting RNA sequences from gff and fa files

≪ Previous: IntersectBed provides an empty output

I am trying to plot a heatmap of read density around a feature of interest (TSSs) very common in genomics papers. something like this (B):

enter image description here

However, I am struggling a bit in getting to look "right". A bit of background: I have mapped ChIP-seq reads for pol2 and calculate the coverage, per nucleotide, using bedtools.

coverageBed -d -abam $bamFile -b $TSSs > $coverage.bed
# output:
chr1    67108226    67110226    uc001dct.3    16    +    1    10
chr1    67108226    67110226    uc001dct.3    16    +    2    10
chr1    67108226    67110226    uc001dct.3    16    +    3    10
chr1    67108226    67110226    uc001dct.3    16    +    4    10
chr1    67108226    67110226    uc001dct.3    16    +    5    8
chr1    67108226    67110226    uc001dct.3    16    +    6    8
chr1    67108226    67110226    uc001dct.3    16    +    7    8
chr1    67108226    67110226    uc001dct.3    16    +    8    8
chr1    67108226    67110226    uc001dct.3    16    +    9    8
chr1    67108226    67110226    uc001dct.3    16    +    10    8

Then in R, the genomic position, in column 7, is converted to relative position to the TSS and read counts normalized to the library size. This is converted to a numeric matrix with each row being a TSS and each column the relative nucleotide position. For the plotting the matrix is ordered number of reads per TSS, and the values logged. This is the outcome:

heatmap(cov.mlog, Rowv=NA, Colv=NA, scale="column", labCol = FALSE, labRow = FALSE, col=brewer.pal(9, "Greens"), margins = c(5, 5))

enter image description here

Although the average coverage plot looks as one would expect for pol2, the heatmap is, well, not great. My question is what am I doing wrong and how to improve it? In this and this paper, the coverage is calculate for a bin of nucleotides and not per nucleotide as I am doing. Would that improve visualization? How to do it in bedtools? Should the sorting of the matrix be done differently?

I am aware that could use ngsplot for this, but I am trying to avoid it because my implementation would fit better with my other analysis.

Thank you!

↧