Extracting Genomic Coverage Information Across Different Samples

March 21, 2014, 1:39 pm

≫ Next: Which Of The Genes Are Enriched With Repeat Elements

≪ Previous: Problem With Counting Mapped Reads

Hello, I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples. 1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads. Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3. Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well. I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed. i have the following commands which might do similar but it ouputs differences at every base.

genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph

genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph

unionBedGraphs -header -i sample1.bedgraph sample2. ...

↧

Which Of The Genes Are Enriched With Repeat Elements

November 14, 2013, 3:12 am

≫ Next: Random shuffling of features leaving gene models intact

≪ Previous: Extracting Genomic Coverage Information Across Different Samples

I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements. I have a bam file and the repeats in bed format. As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements. I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene. For example,

chr   start  end gene number_of_repeats
chr1 100  200  gene1 70
chr1 190  240  gene1 40
chr1 250  400  gene1 100
chr2 500  600  gene2 150

if i sort and merge them i will get

chr1 100  240  gene1 90
chr1 250  400  gene1 100
chr2 500  600  gene2 150

So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats. Or b) shall I count the number of repeats which for each short sequence without any merging? ...

↧

Random shuffling of features leaving gene models intact

May 26, 2014, 7:02 am

≫ Next: GTF2/GFF3 "feature" types and expression analysis

≪ Previous: Which Of The Genes Are Enriched With Repeat Elements

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene are placed on the same contig and related features are placed inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case?

↧

GTF2/GFF3 "feature" types and expression analysis

April 16, 2014, 3:00 pm

≫ Next: Bedtools on Cygwin problem.

≪ Previous: Random shuffling of features leaving gene models intact

Hi, I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time). Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here). My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file? My second question follows I have read both these resources (here and here) but do not understand the differences between:

exon vs CDS
transcript vs mRNA

I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics. Thanks ahead of time and let me know if my post was not clear enough. ...

↧

Bedtools on Cygwin problem.

June 11, 2014, 8:26 am

≫ Next: Getting The Average Coverage From The Coverage Counts At Each Depth.

≪ Previous: GTF2/GFF3 "feature" types and expression analysis

Hi I'm trying to install the latest release of Bedtools via Cygwin but there's a weird error during process. I know this isn't the best solution, but I do not have an other choice. Perhaps anyone knows how to fix this?

NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew
$ make clean
 * Cleaning-up BamTools API
 * Cleaning up.

NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew
$ make
Building BEDTools:
=========================================================
DETECTED_VERSION = v2.20.1
CURRENT_VERSION  = v2.20.1
 * Creating BamTools API
- Building in src/utils/bedFile
  * compiling bedFile.cpp
bedFile.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]
 /*****************************************************************************
 ^
- Building in src/utils/BinTree
  * compiling BinTree.cpp
BinTree.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]
 #include "BinTree.h"
 ^
In file included from ../../utils//FileRecordTools/FileReaders/BufferedStreamMgr.h:16:0,
                 from ../../utils//FileRecordTools/FileRecordMgr.h:19,
                 from ../../utils//FileRecordTools/FileRecordMergeMgr.h:11,
                 from ../../utils//Contexts/ContextBase.h:23,
                 from ../../utils//Contexts/ContextIntersect.h:11,
                 from BinTree.h:20,
                 from BinTree.cpp:1:
../.. ...

↧

Getting The Average Coverage From The Coverage Counts At Each Depth.

May 20, 2013, 7:52 am

≫ Next: bedtools: extracting no coverage regions

≪ Previous: Bedtools on Cygwin problem.

Hi, I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X). So, I decided to use genomeCoverageBed for my analysis. And I used the following command:genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt As many are aware, the output of the file looks something like this:

genome    0    26849578    100286070 0.26773
genome    1    30938928    100286070     0.308507
genome    2    21764479    100286070    0.217024
genome    3    11775917    100286070    0.117423
genome    4    5346208    100286070    0.0533096
genome    5    2135366    100286070    0.0212927
genome    6    785983    100286070    0.00783741
genome    7    281282    100286070    0.0028048
genome    8    106971    100286070    0.00106666
genome    9    47419    100286070    0.000472837
genome    10    27403    100286070    0.000273248

To find the coverage, I multiplied col2 (depth) with col3 (number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3 is:

And the sum is: 150098740. Since the genome length is 1002860 ...

↧

bedtools: extracting no coverage regions

April 26, 2014, 10:32 am

≫ Next: Counting Features In A Bed File

≪ Previous: Getting The Average Coverage From The Coverage Counts At Each Depth.

Hello,

I am not sure if this has been answered before as I looked and couldn't find a simple answer.

I have a bam file, and all I want is to annotated all regions with 0 coverage in bed format. Is that possible?

Thank you,

Adrian

↧

Counting Features In A Bed File

November 22, 2012, 4:02 am

≫ Next: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

≪ Previous: bedtools: extracting no coverage regions

I have a file in the following BED format

Chr1 1022071 1022105  +
Chr1 1022071 1022105  +
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -

I am trying get the counts of each feature represented in this file.

mergeBed -i R5_chr.bed -n -s -d 0 > Output/R5_chr_counts.bed

I am interested in the counts of the features and I do not want to merge features by any number of base pairs. Then the output should be as follows

Chr1 1022071 1022105 2 +
Chr1 1022072 1022106 4 +

Any suggestions on how to achieve this using bedtools or in bash or awk? Thanks in advance!

↧

How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

February 27, 2013, 11:13 am

≫ Next: Extract coverage per feature from a bam and bed to a file

≪ Previous: Counting Features In A Bed File

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

↧

Extract coverage per feature from a bam and bed to a file

August 24, 2014, 11:07 pm

≫ Next: Profile Coverage Of Rnaseq Samples?

≪ Previous: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

Hi, a simple task.. or should be. I need to extract the average coverage per feature in a bam file. I have a genbank and bed file for the reference the bam was mapped to. if I map with e.g. Geneous I can see good, variable coverage over the reference genome. I have tried GATK (could not get to run) and Bedtools (genomecov and coverage) -coverage will give me an output file but all the features have zero coverage.. here's the top of the .bed file: track name="Example E.coli" o26chr.gb 189 255 thrL gene 0 + o26chr.gb 189 255 thrL CDS 0 + o26chr.gb 336 2799 thrA gene 0 + o26chr.gb 336 2799 thrA CDS 0 + o26chr.gb 2800 3733 thrB gene 0 + o26chr.gb 2800 3733 thrB CDS 0 + o26chr.gb 3733 5020 thrC gene 0 + o26chr.gb 3733 5020 thrC CDS 0 + o26chr.gb 5233 5530 yaaX gene 0 + Here's the top of the output from bedtools coveage -ibam file.bam -b file.bed o26chr.gb 1047122 1048841 poxB gene 0 - 0 0 1719 0.0000000 o26chr.gb 1047122 1048841 poxB CDS 0 - 0 0 1719 0.0000000 o26chr.gb 2096828 2097287 gene 0 + 0 0 459 0.0000000 o26chr.gb 3144900 3148635 yfaL gene 0 - 0 0 3735 0.0000000 o26chr.gb 3144900 3148635 yfaL CDS 0 - 0 0 3735 0.0000000 o26chr.gb 4194149 4194368 tdcR gene 0 + 0 0 219 0.00 ...

↧

Profile Coverage Of Rnaseq Samples?

February 14, 2013, 3:51 pm

≫ Next: Renaming SNPs or SNP matching

≪ Previous: Extract coverage per feature from a bam and bed to a file

Hi all,

I have a quick question:

How can I visualize aligned paired-end reads from RNAseq datasets in UCSC browser?

I already mapped the reads and assembled the transcripts with Tophat/Cufflinks but I'm not sure how to proceed to visualize the mappings

After sorting the BAM files and fixing the mate pairs, I tried to compute the coverage using the following commands:

genomeCoverageBed -bg -split -ibam F.T0.rep2-accepted_hits-fS.bam -g ~/conversion_util/chrom.hg19.sizes > F.T0.rep2-accepted_hits-fS.bg
bedGraphToBigWig F.T0.rep2-accepted_hits-fS.bg ~/conversion_util/chrom.hg19.sizes F.T0.rep2-accepted_hits-fS.bw

But I was not able to visualize properly the mappings. Here I paste a screenshot of how it looks like:

Do you know where is the mistake?

Thanks!

↧

Renaming SNPs or SNP matching

September 30, 2014, 8:22 am

≫ Next: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

≪ Previous: Profile Coverage Of Rnaseq Samples?

This should be easy to do by now, but... we have SNP data from an Illumina exome array given to us in PLINK format. The BIM file looks like this:

1       exm2253575      0       881627  G       A
1       exm269  0       881918  A       G
1       exm340  0       888659  T       C
1       exm348  0       889238  A       G
1       exm2264981      0       894573  G       A
1       exm773  0       909238  G       C
1       exm782  0       909309  C       T
1       exm912  0       949608  A       G
1       exm991  0       977028  T       G
1       exm1024 0       978762  A       G

And I have all of the SNPs in dbSNP 138 downloaded as a large VCF file: #CHROM POS ID REF ALT QUAL FILTER INFO 1 10019 rs376643643 TA T . . RS=376643643;RSPOS=10020;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000200;WGT=1;VC=DIV;R5;OTHERKG 1 10054 rs373328635 CAA C,CA . . RS=373328635;RSPOS=10055;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000210;WGT=1;VC=DIV;R5;OTHERKG;NOC 1 10109 rs376007522 A T . . RS=376007522;RSPOS=10109;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG 1 10139 rs368469931 A T . . RS=368469931;RSPOS=10139;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG 1 10144 rs144773400 TA T . . RS=1447734 ...

↧

Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

January 16, 2013, 2:49 pm

≫ Next: bedtools intersect - something wrong with chromosome numbers >= 10?

≪ Previous: Renaming SNPs or SNP matching

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

↧

bedtools intersect - something wrong with chromosome numbers >= 10?

May 31, 2014, 2:28 am

≫ Next: Extract Only Paired-End Reads That Map A Specific Interval

≪ Previous: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

Hi!

I have an alignment (.bam) of reads to mm9 genome. I sorted it with samtools sort, so that later I can use -sorted key with bedtools. I also created a .bed-file with regions of interest, in which I want to count number of reads, that mapped to them. I tried this: converted .bam to .bed with bedtools bamtobed, and then intersected them counting number of hits (bedtools intersect -a regions_of_interest.bed -b alignment_sorted.bed -c -sorted > Neg2H_counts.bedgraph). The problem is, it looks fine for all chromosomes with numbers from 0 to 9 (and X), but all counts for all regions of interest of chromosomes with higher number (chr10, chr11, etc) are 0. There is no biological reason for that, in fact the highest signal should be on chr11. What could be wrong here? I am fairly new to all these tools.

UPDATE
I tried to do the same intersection with bedmap and the result is identical... So there probably is something wrong with my files - what could it be?
I also tried sorting the alignment-derived bed-file in the same way, as I did with the files with regions of interest and it doesn't help.

↧

Extract Only Paired-End Reads That Map A Specific Interval

August 31, 2012, 1:23 am

≫ Next: Snps Comparison

≪ Previous: bedtools intersect - something wrong with chromosome numbers >= 10?

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

↧

Snps Comparison

March 27, 2012, 4:04 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: Extract Only Paired-End Reads That Map A Specific Interval

Hello,

I would like to compare SNPs from different methods:

number of SNPs
SNPs postion (position where method A has SNPs but not B and vice versa. Where both have SNPs)

I would be interested to get a output file which contain all above information and also would like to see the differences visualized i.e. where could load the two files which contain the SNPs and aligment.

In which format the SNPs has to be stored and which tools have to be used in order to make a comparison possibel?

Thank you in advance.

↧

How To Install Bedtools In A User Directory

June 25, 2013, 7:55 pm

≫ Next: To Group Items In Bed Files

≪ Previous: Snps Comparison

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

To Group Items In Bed Files

January 20, 2012, 5:50 pm

≫ Next: Creating Bed File For Lncrna Using Gencode Gtf File

≪ Previous: How To Install Bedtools In A User Directory

For example, we now have a bed file:

chr1 23455 45678
chr1 23446 45663
chr1 23449 45669
chr1 30000 31000

Is there anyway to group the first three lines, while leaving the last line alone? I know Bedtools have mergeBed function, merging those overlapping span, which, however will include the last line.

This may sound a pure computational question; but I'm just curious if we have available tools already to tackle such questions

thx

↧

Creating Bed File For Lncrna Using Gencode Gtf File

May 12, 2013, 9:29 am

≫ Next: Converting Bam To Bedgraph For Viewing On Ucsc?

≪ Previous: To Group Items In Bed Files

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

↧

Converting Bam To Bedgraph For Viewing On Ucsc?

February 21, 2013, 2:30 pm

≫ Next: Using Gnu Parallel For Bedtools

≪ Previous: Creating Bed File For Lncrna Using Gencode Gtf File

I'm trying to go from a BAM file to a representation viewable in UCSC, ideally bedGraph. I am trying to use Bedtools's genomeCoverage like this:

genomeCoverageBed -ibam accepted_hits.sorted.bam -bg -trackline -split -g ... > mytrack.bedGraph

I'm not sure what the -g argument is supposed to be or how to generate it. The documentation does not explicitly say what it is supposed to be, though it gives an example where it is some sort of BED file. I am simply looking for a bedGraph or other UCSC-friendly compact representation that will allow me to visualize read densities using UCSC from the BAM.EDIT When I generate a bedGraph and put it in UCSC, I get tracks that look like this:

enter image description here

not a histogram. How can I make it a histogram? How can I generate the genome file for use with genomeCoverageBed? Also Is this the best way to get a UCSC viewable file with Bedtools? To clarify, I want to visualize the BAM as a histogram. I'm not sure this is possible with bedGraph? Thank you.

↧