How Can I Merge Intervals ?

August 7, 2012, 10:18 am

≫ Next: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

Hello everybody, I should be grateful if you would kindly help me de fix my problem. I have a table like that :

Chromosome   start   end    info1    info2
chr01        1       100    15       35
chr01        150     300    15       39
chr01        299     750    16       39

I would like to merge the intervals that overlap ( line 2 and 3) and those that are closest (line 1 and 2) in addition to perform some operation basing in the other column ! for example I would like to merge the tree line above into one interval (chr1 1-750), sum basing on the info1 (15+15+16) and finally did the mean basing on the info line to (35+39+39)/3 the output I'd like will be as this :

chr1 1-750  46  37.66

I know that Bedtools can merge interval ( galaxy tool ! too )but accept only BED format that contain only 3 coloumn chr start and end !
Thanks in advance for your help !

↧

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

May 2, 2013, 10:37 am

≫ Next: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

≪ Previous: How Can I Merge Intervals ?

Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

↧

To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

December 3, 2013, 7:08 am

≫ Next: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

≪ Previous: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

Dear All, I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, ...

↧

Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

February 15, 2012, 1:56 pm

≫ Next: Counting Features In A Bed File

≪ Previous: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

Hi Everyone I would nedd some help on genomeCoverageBed tool. This tools when used for finding per base genome coverage uses an option -d. I am actually interested in finding read counts for each base within a particular intron of a gene. I will like to explain you more just to make myself clear. I used IGV to see how my alignments looks and moreover what is the coverage of each base within a particular intron. When I take my cursor in IGV to the area exactly above the base (i am interested in)within the coverage track it gives me such details:

Total Count:6
A:0
C:0
G:6
T:0
N:0

Now this total count is basically the read count for the base G within that intron. This counts says that 6 reads have actually covered this base position(and hence base). Now when i use this code snippet which is basically finding per base genome coverage genomeCoverageBed -i 2-B3-1b-D303A_sorted.bed -g pombe.genome -d this code gives me around 31 as the depth for that base(i.e G in my example). Looking closely in IGV i figured out that this 21 is basically 21 = 6 + 15 where 6 is the actual reads that has covered this base position(hence base) and 15 means that these reads have not covered that base at that position, but since the genomeCoverageBed tool calculates depth of feature coverage it also includes all those reads which skips that particular base. I would provide you with an image to make it more clear I would like to know how can i ...

↧

Counting Features In A Bed File

November 22, 2012, 4:02 am

≫ Next: Comparative Snp Analysis

≪ Previous: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

I have a file in the following BED format

Chr1 1022071 1022105  +
Chr1 1022071 1022105  +
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -

I am trying get the counts of each feature represented in this file.

mergeBed -i R5_chr.bed -n -s -d 0 > Output/R5_chr_counts.bed

I am interested in the counts of the features and I do not want to merge features by any number of base pairs. Then the output should be as follows

Chr1 1022071 1022105 2 +
Chr1 1022072 1022106 4 +

Any suggestions on how to achieve this using bedtools or in bash or awk? Thanks in advance!

↧

Comparative Snp Analysis

January 7, 2013, 7:33 pm

≫ Next: macs and bedtools

≪ Previous: Counting Features In A Bed File

Hello, I am trying to compare the degree of A-to-G editing in a near-to-isogenic pair of cell lines. I have two biological replicates and have mapped with Bowtie and BWA, followed by a samtools mpileup | VarScan analysis. After this, I have used bedtools intersect to extract variants not annotated in dbSNP, but are in Alu repeats. Here is where I have some doubts, mainly two questions: QUESTION 1: In the vcf file (VarScan output),

#CHROM  POS     ID      REF     ALT     QUAL    FILTER    INFO    FORMAT  Sample1    Sample2
   chrM    73      .           G       A       PASS     DP=238  GT:GQ:DP           1/1:71:121  1/1:69:117

What exactly is the meaning of

FORMAT   Sample1    Sample2
GT:GQ:DP 1/1:71:121  1/1:69:117

QUESTION 2:

I have higher number of editing sites "called" in sample 1 than in sample 2 in the 1st biological replicate (about 16% difference). However this difference is reversed in the 2nd biological replicate. What is the proper way of comparing the degree of RNA editing in two different samples? Is there a quantitative procedure? I have naively compared them with bedtools intersect, using or omitting option -v. Is this the correct way to go about it?

Many thanks. G.

↧

macs and bedtools

July 4, 2014, 2:07 pm

≫ Next: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

≪ Previous: Comparative Snp Analysis

Hello

I have MACS2 output and now looking for peaks which are situated in introns. I have bed file with introns from USCS for my species. What file with peaks should I use for bedtools intersection? Peaks summit (.bed) or narrow peak (.bed), both from MACS2 output?

↧

What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

May 21, 2013, 10:16 am

≫ Next: Getting Number Of Reads In Intervals With Bedtools

≪ Previous: macs and bedtools

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
bedtools coverage?
bedtools genomecov?
bedtools multicov?
bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

↧

Getting Number Of Reads In Intervals With Bedtools

December 14, 2012, 3:29 pm

≫ Next: Calculating Exome Coverage

≪ Previous: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

↧

Calculating Exome Coverage

April 3, 2014, 2:00 am

≫ Next: Remove Intronic Regions in .BAM

≪ Previous: Getting Number Of Reads In Intervals With Bedtools

*// Edit to make the post more clear (Mapping done via Bowtie2). My problem is that when counting Exome Coverage via coverageBed gives different results than via genomeCoverageBed. So I'm not sure if I'm doing something wrong, or which of the 2 methods is correct.

1) My first step is to build an .bed file of my Illumina Paired-End reads, returning the positions that only fall in targeted exon regions. I'm doing that via intersectBed -a [data.bed] -b [illuminaexonregions.bed].

2) My next step is to calculate the coverage of my new datafile via coverageBed -a [newdata.bed] -b [illuminaexonregions.bed]. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10993449.0

Nucleotides/Length*100 24.253740909 % Coverage.

3) The next step was to calculate the coverage of my new datafile via genomeCoverageBed -i [newdata.bed] -g [genome.txt] -d awk '$3>0 {print $1"\t"$2"\t"$3}'. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10576907.0

Nucleotides/Length*100 23.3347661863 % Coverage.

Somehow there's a difference in matched nucleotides, which I can't explain. What am I doing wrong?

↧

Remove Intronic Regions in .BAM

May 14, 2014, 2:45 am

≫ Next: Intersectbed: Return Reads In Fraction In Input Files

≪ Previous: Calculating Exome Coverage

I have a .BAM file which contains discordantly and concordantly mapped mate-pairs. I used bedtools Pairtobed to extract the mate-pairs which both show overlap with targeted regions (Illumina target .bed file). Is it somewhere possible to remove the parts of the mate-pairs that do not show overlap? I couldn't find it in the bedtools manual... can I just use intersectBed for each read for this?

Thanks!

↧

Intersectbed: Return Reads In Fraction In Input Files

September 27, 2012, 9:55 am

≫ Next: Extract Only Paired-End Reads That Map A Specific Interval

≪ Previous: Remove Intronic Regions in .BAM

I have a question with respect to intersectBED and multiple input files:

Is it possible to return reads which are present in, say 8/10 input files, without fractioning the reads in smaller intervals?

Thank you

↧

Extract Only Paired-End Reads That Map A Specific Interval

August 31, 2012, 1:23 am

≫ Next: Creating Bed File For Lncrna Using Gencode Gtf File

≪ Previous: Intersectbed: Return Reads In Fraction In Input Files

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

↧

Creating Bed File For Lncrna Using Gencode Gtf File

May 12, 2013, 9:29 am

≫ Next: Problems Extracting Non-Snps From A Vcf File

≪ Previous: Extract Only Paired-End Reads That Map A Specific Interval

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

↧

Problems Extracting Non-Snps From A Vcf File

January 16, 2013, 7:14 am

≫ Next: Genomic Regions To Exclude Before Shuffling Intervals

≪ Previous: Creating Bed File For Lncrna Using Gencode Gtf File

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

↧

Genomic Regions To Exclude Before Shuffling Intervals

November 20, 2013, 4:01 am

≫ Next: To Group Items In Bed Files

≪ Previous: Problems Extracting Non-Snps From A Vcf File

I want to do permutation test: randomly reposit (shuffle) given genomic intervals and measure intersection between new coordinates and specific genomic element.

Example:

Different sets of genes: protein coding, pseudogenes, ncRNA - intervals that I want to shuffle;
Genomic repeat L1 - coordinates are stable.
For every gene set shuffle intervals, intersect and measure the overlap with L1 (I am using bedtools shuffle - "reposition each feature in the input BED file on a random chromosome at a random position").

Question - Which genomic regions to exclude from the "genome" (bedtools shuffle -g option) before shuffling gene intervals?
I was going to exclude gaps in the assembly.
But what about:

All gene regions.
If I am shuffling pseudogene intervals should I exclude protein coding and ncRNA coordinates?
All non L1 Repeat masker coordinates.
As alu, LTR and DNA transposons aren't L1 so their won't be any intersection with them?

↧

To Group Items In Bed Files

January 20, 2012, 5:50 pm

≫ Next: Calculate reciprocal overlap for thousands of samples

≪ Previous: Genomic Regions To Exclude Before Shuffling Intervals

For example, we now have a bed file:

chr1 23455 45678
chr1 23446 45663
chr1 23449 45669
chr1 30000 31000

Is there anyway to group the first three lines, while leaving the last line alone? I know Bedtools have mergeBed function, merging those overlapping span, which, however will include the last line.

This may sound a pure computational question; but I'm just curious if we have available tools already to tackle such questions

thx

↧

Calculate reciprocal overlap for thousands of samples

July 15, 2014, 10:20 am

≫ Next: Is It Possible To Filter Only Bookend Reads From A Bed File?

≪ Previous: To Group Items In Bed Files

I have around 20k samples with BED files. How can I calculate reciprocal overlap for each segment? I want to find all segments with 50% reciprocal overlap or better.

↧