Multi Thread Bedtools

December 20, 2011, 7:59 am

≫ Next: Question about number of reads within intervals

≪ Previous: Bedtools To Compare A Vcf File From Samtools Mpileup With Dbsnp?

Hi,

Is there a multi thread version of bedtools ? or is this feature in development ?

Thanks,

↧

Question about number of reads within intervals

November 3, 2014, 8:49 am

≫ Next: General Considerations For Genomic Overlaps?

≪ Previous: Multi Thread Bedtools

Hi there, This question is very basic but I need to ensure that I'm going on the right way. I need to calculate the number of reads falling inside my bed intervals and the number of reads falling outside them. After reading this thread (https://www.biostars.org/p/11832/), I decided to try with this command: intersectBed -abam my_file.bam -b my_file.bed -wa -f 1 | coverageBed -abam stdin -b my_file.bed I would like to know what is the difference between using the previous command, or using only the second part: coverageBed -abam my_file.bam -b my_file.bed The output is quite different for some hits: - First command output: 1 50331576 50331667 (.. gene names..) 0 0 91 0.0000000 1 39845848 39846030 (..gene names..) 70 178182 0.9780220 - Second command output: 1 50331576 50331667 (..gene names..) 47 91 91 1.0000000 1 39845848 39846030 (..gene names..) 143 182182 1.0000000 I think that for first command I get only those reads falling strictly within interval, while for the second one also include reads that partially covering the intervals? This is true? For other hand, I would like also to get the number of reads falling outside the intervals. I can make a new bed file using bedtools complement, but if I use -v option of bedtools intersect would be OK? Like this: intersectBed -v -abam my_file.bam -b my_file.bed -wa -f 1 | c ...

↧

General Considerations For Genomic Overlaps?

March 26, 2014, 1:01 am

≫ Next: Heatmap Of Read Coverage Around Tsss

≪ Previous: Question about number of reads within intervals

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics. Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this. E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome. Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome. If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A. But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now. So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A. So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A. For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling). I hope you can f ...

↧

Heatmap Of Read Coverage Around Tsss

August 20, 2013, 8:53 am

≫ Next: Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

≪ Previous: General Considerations For Genomic Overlaps?

I am trying to plot a heatmap of read density around a feature of interest (TSSs) very common in genomics papers. something like this (B): However, I am struggling a bit in getting to look "right". A bit of background: I have mapped ChIP-seq reads for pol2 and calculate the coverage, per nucleotide, using bedtools.

coverageBed -d -abam $bamFile -b $TSSs > $coverage.bed
# output:
chr1    67108226    67110226    uc001dct.3    16    +    1    10
chr1    67108226    67110226    uc001dct.3    16    +    2    10
chr1    67108226    67110226    uc001dct.3    16    +    3    10
chr1    67108226    67110226    uc001dct.3    16    +    4    10
chr1    67108226    67110226    uc001dct.3    16    +    5    8
chr1    67108226    67110226    uc001dct.3    16    +    6    8
chr1    67108226    67110226    uc001dct.3    16    +    7    8
chr1    67108226    67110226    uc001dct.3    16    +    8    8
chr1    67108226    67110226    uc001dct.3    16    +    9    8
chr1    67108226    67110226    uc001dct.3    16    +    10    8

Then in R, the genomic position, in column 7, is converted to relative position to the TSS and read counts normalized to the library size. This is converted to a numeric matrix with each row being a TSS and each column the relative nucleotide position. For the plotting the matrix is ordered number of reads per TSS, and the values logged. This is the outcome: heatmap(cov.mlog, Rowv=NA, Colv= ...

↧

Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

March 27, 2013, 1:05 pm

≫ Next: How To Check Whole Genome With Bigwigsummary ?

≪ Previous: Heatmap Of Read Coverage Around Tsss

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
  1026779 qry.depth
  1027173 qry.dz
  1026779 qry.mpileup

Any ideas? Thanks

↧

How To Check Whole Genome With Bigwigsummary ?

March 30, 2012, 11:33 am

≫ Next: Bed File Of Mapq Sliding Window On A Bam File?

≪ Previous: Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

Hi,

I have question about bigwigsummary tools ,

I have my start and end positions and my bigwig file but I want to check whole genome instead of chromosome by chromosome Is there any option to use this tool in that way ?

I know that for each chromosome I have to use :

bigWigSummary -type=X bigwigfile chrN start end datapoints

I want to check from chr1 to chrX.

Thanks in Advance.

↧

Bed File Of Mapq Sliding Window On A Bam File?

February 27, 2014, 2:01 am

≫ Next: How To Get Annotation For Bed File From Another Bed File

≪ Previous: How To Check Whole Genome With Bigwigsummary ?

There may already be a recipe for this, so asking first before reinventing the wheel: I would like to create a bed file where the score is the average mapQ from the reads of the input.bam file. I think bedtools or bedops are the way to go:http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html Other than simply running bamtobed/bam2bed, I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200. I also would like to generate the bam2bed information only from a list of regions in regions.bed. E.g., something like:mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed EDITED: Thank you Aaron for you answer. I got it working but it's slow for my 30x WGS bams:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome
bedtools makewindows -g hg19.genome -w 1000 -s 200 > hg19.windows.bed
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean &gt ...

↧

How To Get Annotation For Bed File From Another Bed File

November 23, 2012, 7:52 pm

≫ Next: Tool: Bedtools: Analyzing Genomic Features

≪ Previous: Bed File Of Mapq Sliding Window On A Bam File?

Hello All,

I have a bed file (with Chr, Start, End, Name, Score and Strand)

Chr1 5678 5680 NA 7  +
Chr1 700  800  NA 8  -
Chr1 900  1200 NA 10 -

and would like to know, how can I get the annotation for the name column from another bed file

Chr1 5500 6000 Gene1 x +
Chr1  500 1000 Gene2 x -

or any standard genome file formats like gbk or .fna files or for that matter another bed file? So mu output file will be a bed file with Chr, Start, End, Name and Strand.

Chr1 5678 5680 Gene1 7 +
Chr1 700  800  Gene2 8 -
Chr1 900  1200 Gene2 10 -

Any easy and standard way to do this??

Bedtools usually operates more on the features but not sure if annotation from one bedfile can be extracted into the other based on overlapping feaures.

Thanks in advance!

↧

Tool: Bedtools: Analyzing Genomic Features

April 24, 2012, 10:54 am

≫ Next: Intersectbed Tool Generating Empty File

≪ Previous: How To Get Annotation For Bed File From Another Bed File

All practicing bioinformaticians will face problems that require them to compare, query and select genomic features across an entire genome. As it happens efficient interval representation and query is a surprisingly challenging problem that needs a specialized representation. The BEDTools suite contains a set of programs that support a broad range of interval analyses that involve selecting certain locations in the genome. The name reflects the original intent to process BED files but the tools operate just as well on GFF formats. The scripts need to be run in command line format and are available for UNIX type systems: Linux, Mac OSX, and Cygwin (on Windows). The link to the site is: http://code.google.com/p/bedtools/ With BEDTools one can answer questions such as:

how many reads map upstream/downstream of one or more locations in the genome?
how many reads cover a certain base in the genome?
which sections of the genome are not overlapping with target intervals?
what are the sequences specified by the coordinates?
...

The suite consists of multiple tools but for beginners the most important is ...

↧

Intersectbed Tool Generating Empty File

August 28, 2012, 10:13 am

≫ Next: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

≪ Previous: Tool: Bedtools: Analyzing Genomic Features

I have used the Bedtools command intersectBed to check the overlap between two bed files. A is my INDEL file and B is my Reference file. But it is producing an empty output file. I thought the problem was that the file B is much larger than file A. But I tried changing the file order and it is still not creating any output.

Here is the reference B file (larger):

gff_seqname      0        1395    gene    0    +
gff_seqname      0        1395    exon    0    +
gff_seqname    1397    2498    gene    0    +
gff_seqname    1397    2498    exon    0    +
gff_seqname    2524    3619    gene    0    +

Here is my A file with just 51 INDELS:

NC_0077121_SODALIS_GLOSSINIDIUS_STR_MORSITANS_CHROMOSOME    174708    174713    -GCCGG:2/6
NC_0077121_SODALIS_GLOSSINIDIUS_STR_MORSITANS_CHROMOSOME    1078686    1078686    +A:105/112
NC_0077121_SODALIS_GLOSSINIDIUS_STR_MORSITANS_CHROMOSOME    1229123    1229125    -CT:800/870
NC_0077121_SODALIS_GLOSSINIDIUS_STR_MORSITANS_CHROMOSOME    1234830    1234830    +AT:134/134
NC_0077121_SODALIS_GLOSSINIDIUS_STR_MORSITANS_CHROMOSOME    1234833    1234834    -A:134/134

here is my command:

intersectBed -a SOD_pal_BWA_GMM.PE.sorted.bam.sorted_cleaned_GMM.bam.sorted.hr.bam.raw.bed  -b sodalis_galaxy.bed  -wa -wb  >test13.bed

↧

Intersectbed - Overlap Analysis Usign Vcf And Bed Files

July 12, 2012, 2:04 pm

≫ Next: Extract Only Paired-End Reads That Map A Specific Interval

≪ Previous: Intersectbed Tool Generating Empty File

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

↧

Extract Only Paired-End Reads That Map A Specific Interval

August 31, 2012, 1:23 am

≫ Next: Convert Bamtobed Score

≪ Previous: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

↧

Convert Bamtobed Score

February 28, 2012, 6:00 am

≫ Next: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

≪ Previous: Extract Only Paired-End Reads That Map A Specific Interval

Hey,

just a short question....is there a possibility to set the score in the bed file to "1" an not to the the alignment score?? arguments -tag and -ed only use BAM alignment tags... ?!? :/

Cheers!

↧

Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

January 3, 2014, 5:32 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: Convert Bamtobed Score

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

↧

How To Install Bedtools In A User Directory

June 25, 2013, 7:55 pm

≫ Next: Bedtools Intersectbed

≪ Previous: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

Bedtools Intersectbed

November 17, 2011, 10:15 am

≫ Next: Getting Unmapped Reads: Comparing Fastq To Bam

≪ Previous: How To Install Bedtools In A User Directory

Apologies if this is blatantly obvious!

I would like to compare coordinates in setA with those of setB. The output should have the same number of coordinates as setA and tell me how many nucleotides of each setA coordinate are overlapped by any coordinate in setB.

For example a large coordinate in setA may be overlapped by two setB coordinates, but i want to know how many nucleotides of the setA coordinate are covered by both setB coordinate in total.

I know how to do this on GALAXY as there is the handy 'Coverage' tool in 'Operate on Genomic Intervals'. However, i want to do this on the command line. I have been trying to get BEDTools to do this using 'intersectBed', but i can only seem to get just the overlapping setA coords (using -u), or get the nucleotide over for multiple setB coordinates on separate line (using -wao), or a count of how many setB overlaps setA (using -c).

SetB coordinates are non-overlapping themselves, so i guess i could tally up those SetB coordinates that overlap the same setA coordinate.

Can BEDTools do what i want or there another command line way of doing what i want?

Thank you!

PS I have also sent the to BEDTools discussion, so apologies for any double postings!

↧

Getting Unmapped Reads: Comparing Fastq To Bam

December 4, 2011, 6:02 pm

≫ Next: How To Extract Scores From Bedgraph File Using Bed Tools

≪ Previous: Bedtools Intersectbed

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

↧

How To Extract Scores From Bedgraph File Using Bed Tools

January 23, 2013, 1:49 am

≫ Next: Merging/Intersecting Different Gene Annotations - Should I Extend Coordinates?

≪ Previous: Getting Unmapped Reads: Comparing Fastq To Bam

file1

chr1 10 20 name 0 +

file2

chr1 12 14 2.5
chr1 14 15 0.5

How could i extract average scores of file1 using file2, like below? I am trying to extract phastcons (file2) average scores of file1.

chr1  10 20 name 0 + 1.5

↧

Merging/Intersecting Different Gene Annotations - Should I Extend Coordinates?

October 12, 2013, 3:47 am

≫ Next: Getting Number Of Reads In Intervals With Bedtools

≪ Previous: How To Extract Scores From Bedgraph File Using Bed Tools

I want to create gene data-set (as big as possible), hence I am using several gene annotations. However, genes in different annotations overlap (it's the same gene). For reducing biases I overlap different annotations and if genes overlap leave only one gene.

Question:

To ensure this overlap I was thinking to expand gene coordinates - is this necessary? If so, how big extension should be (5bp/100bp)?

Example:

Want to create lncRNA data-set (in the following steps it will be used to search for genomic features).
Input:

GENCODE lncRNA annotation (version 18 - 04/09/2013);
Cabili lncRNA annotation (Cabili et al., 2011 (CSHLP)).

Workflow:

Extract GENCODE genes start/end coordinates;
Extract Cabili genes start/end coordinates;
Extend Cabili coordinates ( -/+ nbp );
Use BedTools intersect;
If genes intersect leave GENCODE gene (as it's a newer annotation (though this step is really subjective)).

I do realize that this extension question depends on the situation and how reliable annotation is, but still hope that someone could suggest something.

↧

Getting Number Of Reads In Intervals With Bedtools

December 14, 2012, 3:29 pm

≫ Next: How To Use Bedtools Windows To Overlap Upstream For Positive Strand Strand

≪ Previous: Merging/Intersecting Different Gene Annotations - Should I Extend Coordinates?

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

↧