Bedtools Intersectbed

November 17, 2011, 10:15 am

≫ Next: Intersectbed: Return Reads In Fraction In Input Files

≪ Previous: How To Rearrange Paired End Bam File?

Apologies if this is blatantly obvious!

I would like to compare coordinates in setA with those of setB. The output should have the same number of coordinates as setA and tell me how many nucleotides of each setA coordinate are overlapped by any coordinate in setB.

For example a large coordinate in setA may be overlapped by two setB coordinates, but i want to know how many nucleotides of the setA coordinate are covered by both setB coordinate in total.

I know how to do this on GALAXY as there is the handy 'Coverage' tool in 'Operate on Genomic Intervals'. However, i want to do this on the command line. I have been trying to get BEDTools to do this using 'intersectBed', but i can only seem to get just the overlapping setA coords (using -u), or get the nucleotide over for multiple setB coordinates on separate line (using -wao), or a count of how many setB overlaps setA (using -c).

SetB coordinates are non-overlapping themselves, so i guess i could tally up those SetB coordinates that overlap the same setA coordinate.

Can BEDTools do what i want or there another command line way of doing what i want?

Thank you!

PS I have also sent the to BEDTools discussion, so apologies for any double postings!

↧

Intersectbed: Return Reads In Fraction In Input Files

September 27, 2012, 9:55 am

≫ Next: Using Gnu Parallel For Bedtools

≪ Previous: Bedtools Intersectbed

I have a question with respect to intersectBED and multiple input files:

Is it possible to return reads which are present in, say 8/10 input files, without fractioning the reads in smaller intervals?

Thank you

↧

Using Gnu Parallel For Bedtools

February 5, 2014, 4:36 am

≫ Next: Intersectbed/Coveragebed -Split Purify Exon?

≪ Previous: Intersectbed: Return Reads In Fraction In Input Files

I am trying to run gnu:parallel on bedtools multicov function where the original command is

bedtools multicov -bams bam1 bam2 bam3.. -bed anon.bed  > Q1_Counst.bed

I would like to implement the above command using gnu parallel. But when I run the command below

parallel -j 25 "bedtools multicov -bams {1} -bed {2} > Q1_Counst.bed" ::: minus_1_common_sorted_q1.bam minus_2_common_sorted_q1.bam minus_3_common_sorted_q1.bam plus_1_common_sorted_q1.bam plus_2_common_sorted_q1.bam plus_3_common_sorted_q1.bam ::: '/genome/genes_exon_2.bed'

each bam file is taken as separate argument , hence the processes starting are like

bedtools multicov -bams  bam1 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam2 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam3 -bed anon.bed  > Q1_Counst.bed

instead of taking all files as separate arguments. Hence Q1_Counst.bed is overwritten randomly. Could any one help me in getting exact command ? My server has around 30 cores.

↧

Intersectbed/Coveragebed -Split Purify Exon?

September 15, 2012, 1:58 am

≫ Next: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

≪ Previous: Using Gnu Parallel For Bedtools

all.reads.bam file records mapped RNA-seq reads data, including:

exon:exon junction
exon body
intron body
exon:intron junction

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed I'm confused by the mannual (Page 62):

When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.

If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like this:coverageBED -abam all.reads.bam -b ...

↧

What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

May 21, 2013, 10:16 am

≫ Next: General Considerations For Genomic Overlaps?

≪ Previous: Intersectbed/Coveragebed -Split Purify Exon?

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
bedtools coverage?
bedtools genomecov?
bedtools multicov?
bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

↧

General Considerations For Genomic Overlaps?

March 26, 2014, 1:01 am

≫ Next: GTF2/GFF3 "feature" types and expression analysis

≪ Previous: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics. Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this. E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome. Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome. If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A. But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now. So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A. So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A. For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling). I hope you can f ...

↧

GTF2/GFF3 "feature" types and expression analysis

April 16, 2014, 3:00 pm

≫ Next: How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream

≪ Previous: General Considerations For Genomic Overlaps?

Hi, I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time). Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here). My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file? My second question follows I have read both these resources (here and here) but do not understand the differences between:

exon vs CDS
transcript vs mRNA

I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics. Thanks ahead of time and let me know if my post was not clear enough. ...

↧

How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream

January 7, 2014, 3:36 am

≫ Next: Tool: Bedtools: Analyzing Genomic Features

≪ Previous: GTF2/GFF3 "feature" types and expression analysis

How can I use the closestBed from bedtools to find the closest locations between two bed files. The important bit here is that i want them to be upstream and in correct oriantation.

When I use the -s option, it does not report anything (everything is -1).

Then I checked the -D a option. It is returning some results but not sure if it is the right thing.

The other thing to mention is that my genes bed file (lets call is gene.bed) is organized as

chr1 123 234 +
chr1 456 789 -

rather than end position being smaller to indicate the negative strand.

Whereas my repeats.bed file are organized as

chr1 239 456
chr3 456 987

Does bedtools get confused with this?

Which options should i use if i want to find the distance to nearest repeat that is upstream and in the correct orientation?

↧

Tool: Bedtools: Analyzing Genomic Features

April 24, 2012, 10:54 am

≫ Next: Extract rows from BED file on the base of text content in one column

≪ Previous: How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream

All practicing bioinformaticians will face problems that require them to compare, query and select genomic features across an entire genome. As it happens efficient interval representation and query is a surprisingly challenging problem that needs a specialized representation. The BEDTools suite contains a set of programs that support a broad range of interval analyses that involve selecting certain locations in the genome. The name reflects the original intent to process BED files but the tools operate just as well on GFF formats. The scripts need to be run in command line format and are available for UNIX type systems: Linux, Mac OSX, and Cygwin (on Windows). The link to the site is: http://code.google.com/p/bedtools/ With BEDTools one can answer questions such as:

how many reads map upstream/downstream of one or more locations in the genome?
how many reads cover a certain base in the genome?
which sections of the genome are not overlapping with target intervals?
what are the sequences specified by the coordinates?
...

The suite consists of multiple tools but for beginners the most important is ...

↧

Extract rows from BED file on the base of text content in one column

August 3, 2014, 11:02 pm

≫ Next: Intersectbed Overlap

≪ Previous: Tool: Bedtools: Analyzing Genomic Features

Hi,

I am a newbie with scripting so I can't find an easy solution to this question by myself and I'd like to ask for some help.

I have a long list of BED files, and for each file I want to scan them row by row and if the content of a given column contains some text I am looking for (say, for example, gene name "A") I want that full row to be copied into a new, separate bed file.

I'm looking fwd to hearing your suggestions,

thanks in advance!

↧

Intersectbed Overlap

November 23, 2011, 9:20 am

≫ Next: Bedtools on Cygwin problem.

≪ Previous: Extract rows from BED file on the base of text content in one column

Hi,

I've a question about intersectBed. Is it possible to extract only alignment like this :

chromosome ===============================================================
BED/BAM A               ==============              =================
BED FILE B               ============
RESULT                  ==============

But no alignment like this (even if the read overlapp 100% of the feature, I don't want to extract these reads)

chromosome ===============================================================
BED/BAM A    =========================              =================
BED FILE B               =============
RESULT

So, only extracting reads that have 90-95% of its sequence overlapping 90-95% of the feature.

Is it clear ?

Thanks,

↧

Bedtools on Cygwin problem.

June 11, 2014, 8:26 am

≫ Next: Running BedTools on Linux Cluster: Permission Denied

≪ Previous: Intersectbed Overlap

Hi I'm trying to install the latest release of Bedtools via Cygwin but there's a weird error during process. I know this isn't the best solution, but I do not have an other choice. Perhaps anyone knows how to fix this?

NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew
$ make clean
 * Cleaning-up BamTools API
 * Cleaning up.

NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew
$ make
Building BEDTools:
=========================================================
DETECTED_VERSION = v2.20.1
CURRENT_VERSION  = v2.20.1
 * Creating BamTools API
- Building in src/utils/bedFile
  * compiling bedFile.cpp
bedFile.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]
 /*****************************************************************************
 ^
- Building in src/utils/BinTree
  * compiling BinTree.cpp
BinTree.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]
 #include "BinTree.h"
 ^
In file included from ../../utils//FileRecordTools/FileReaders/BufferedStreamMgr.h:16:0,
                 from ../../utils//FileRecordTools/FileRecordMgr.h:19,
                 from ../../utils//FileRecordTools/FileRecordMergeMgr.h:11,
                 from ../../utils//Contexts/ContextBase.h:23,
                 from ../../utils//Contexts/ContextIntersect.h:11,
                 from BinTree.h:20,
                 from BinTree.cpp:1:
../.. ...

↧

Running BedTools on Linux Cluster: Permission Denied

August 3, 2014, 10:50 am

≫ Next: How To Use Bedtools Windows To Overlap Upstream For Positive Strand Strand

≪ Previous: Bedtools on Cygwin problem.

I been having some problems with running BedTools binaries in a linux cluster. I have the binaries in my own $HOME/bin file and when I try to run bedtools I get this error message

-bash: bedtools: Permission Denied

I followed the instructions here and still got the same error message.

Any clue what do to>

↧

How To Use Bedtools Windows To Overlap Upstream For Positive Strand Strand

November 21, 2013, 12:48 pm

≫ Next: Does Windowbed Extend Reads?

≪ Previous: Running BedTools on Linux Cluster: Permission Denied

Hi,

I am trying to use bedtools windows. It has been explained in the manual of the bedtools but I am still bit confused and thought a confirmation would be good. And I have no biological background.

I have divided my bedfile into two, based on the strand information(For example, posStrand.bed and negStrand.bed).

I would like to screen overlaps of LINEs within 5000bp upstream of my postStrand.bed file.

In this case shall I use -l or -r option from bedtools window?
since all are on + strand, do I need to use the -sw option?

↧

Does Windowbed Extend Reads?

October 21, 2013, 10:08 am

≫ Next: How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

≪ Previous: How To Use Bedtools Windows To Overlap Upstream For Positive Strand Strand

I am using WindowBed, part of the BedTools suite, to align reads to a reference file and I obtained a very interesting result. I am trying to rule out an analysis artifact that could be caused by extending the reads or by aligning read midpoints rather than 5' ends. It is my understanding that WindowBed aligns the 5' end of the read to the reference point, rather than extending than mapping the read midpoint, or extending the 3' end of the read and mapping the midpoint. Am I correct in this assumption, that the 5' end of the read is in fact what is being aligned?

Any help here would be appreciated. The BedTools manual, which is very good, doesn't seem to address this.

Thanks

↧

How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

January 31, 2012, 2:12 pm

≫ Next: Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

≪ Previous: Does Windowbed Extend Reads?

Hi,

I am trying to extract the coverage and the average quality score for each position of a reference assembly in bam/sam format. I have managed to get the coverage using BEDtools

 genomeCoverageBed -ibam mybamfile.bam -g my_genome -d > my_coverage.txt

but am at a loss on how to get some measure of the quality of the base calls at each position. I was thinking that I could use the bcftools to get a variant call formatted file

samtools mpileup -uf ref.fa mybamfile.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

but this only provides the sites for which there are SNPs. Any advice greatly appreciated.

Joseph

↧

Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

September 7, 2011, 1:04 am

≫ Next: compute normal-tumor coverage ratio from exome BAMs

≪ Previous: How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

how can I count the number of BAM reads falling directly within a set of intervals, given in a GFF format? Note that I do not want reads overlapping the intervals, but ones that fall directly within them.

I tried the following:

intersectBed -abam reads.bam -b exons.gff -wb -f 1

this has redundancies, so I pipe it into coverageBed as follows:

intersectBed -abam reads.bam -b exons.gff -wb -f 1 | coverageBed -abam stdin -b exons.gff

Is this correct? Thanks.

↧

compute normal-tumor coverage ratio from exome BAMs

July 2, 2014, 6:22 am

≫ Next: How To Use Bedtools To Extract Promoters From A Mouse Bed File

≪ Previous: Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

Could someone please suggest a quick way to compute the data ratio of uniquely mapped reads in
the normal to uniquely mapped reads in the tumor, as required by Varscan in the command below? I have over 50 exome BAMs.

(normal_unique_mapped_reads/tumor_unique_mapped_reads).

java -jar VarScan.jar copynumber normal-tumor.mpileup output.basename -min-coverage 10 --data-ratio [data_ratio] --min-segment-size 20 --max-segment-size 100

↧

How To Use Bedtools To Extract Promoters From A Mouse Bed File

February 8, 2012, 12:36 pm

≫ Next: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

≪ Previous: compute normal-tumor coverage ratio from exome BAMs

Hello, I would like to know how to use Bedtools to extract promoter sequences (as FASTAs) from the mouse genome (mm9) starting from a BED file.

↧

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

August 20, 2012, 10:36 am

≫ Next: error with bedtools slop

≪ Previous: How To Use Bedtools To Extract Promoters From A Mouse Bed File

Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

↧