Using Gnu Parallel For Bedtools

February 5, 2014, 4:36 am

≫ Next: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

≪ Previous: Converting Bam To Bedgraph For Viewing On Ucsc?

I am trying to run gnu:parallel on bedtools multicov function where the original command is

bedtools multicov -bams bam1 bam2 bam3.. -bed anon.bed  > Q1_Counst.bed

I would like to implement the above command using gnu parallel. But when I run the command below

parallel -j 25 "bedtools multicov -bams {1} -bed {2} > Q1_Counst.bed" ::: minus_1_common_sorted_q1.bam minus_2_common_sorted_q1.bam minus_3_common_sorted_q1.bam plus_1_common_sorted_q1.bam plus_2_common_sorted_q1.bam plus_3_common_sorted_q1.bam ::: '/genome/genes_exon_2.bed'

each bam file is taken as separate argument , hence the processes starting are like

bedtools multicov -bams  bam1 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam2 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam3 -bed anon.bed  > Q1_Counst.bed

instead of taking all files as separate arguments. Hence Q1_Counst.bed is overwritten randomly. Could any one help me in getting exact command ? My server has around 30 cores.

↧

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

June 21, 2013, 3:38 am

≫ Next: Tool: Bedtools: Analyzing Genomic Features

≪ Previous: Using Gnu Parallel For Bedtools

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

↧

Tool: Bedtools: Analyzing Genomic Features

April 24, 2012, 10:54 am

≫ Next: Getting Number Of Reads In Intervals With Bedtools

≪ Previous: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

All practicing bioinformaticians will face problems that require them to compare, query and select genomic features across an entire genome. As it happens efficient interval representation and query is a surprisingly challenging problem that needs a specialized representation. The BEDTools suite contains a set of programs that support a broad range of interval analyses that involve selecting certain locations in the genome. The name reflects the original intent to process BED files but the tools operate just as well on GFF formats. The scripts need to be run in command line format and are available for UNIX type systems: Linux, Mac OSX, and Cygwin (on Windows). The link to the site is: http://code.google.com/p/bedtools/ With BEDTools one can answer questions such as:

how many reads map upstream/downstream of one or more locations in the genome?
how many reads cover a certain base in the genome?
which sections of the genome are not overlapping with target intervals?
what are the sequences specified by the coordinates?
...

The suite consists of multiple tools but for beginners the most important is ...

↧

Getting Number Of Reads In Intervals With Bedtools

December 14, 2012, 3:29 pm

≫ Next: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

≪ Previous: Tool: Bedtools: Analyzing Genomic Features

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

↧

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

August 20, 2012, 10:36 am

≫ Next: Bed File Bedpe Format

≪ Previous: Getting Number Of Reads In Intervals With Bedtools

Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

↧

Bed File Bedpe Format

July 29, 2011, 8:12 am

≫ Next: Bedtools subtract not dealing well with large datasets

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

Hi,

I'm having trouble with converting the bam file into bed -bedpe using the bedtools.

workflow:
samtools sort -n mut.bam mut.Namesorted
bamTobed -i mut.Namesorted.bam -bedpe > dilpMerged_bedpe.bed

After sorting the file by read name (option -n) I run the bamTobed command. but it gives me an error message after running a few lines:

*ERROR: -bedpe requires BAM to be sorted/grouped by query name.

What am I doing wrong here?

Thanks

↧

Bedtools subtract not dealing well with large datasets

May 28, 2014, 8:23 am

≫ Next: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

≪ Previous: Bed File Bedpe Format

I am using bedtools subtract with large datasets and it keeps crashing, giving the following error

terminate called after throwing an instance of 'std::bad_alloc'

Is there a way to get over this problem in bedtools?

Alternatively is there any other way to find nonoverlapping regions for two bed files?

thanks

↧

Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

February 15, 2012, 1:56 pm

≫ Next: Intersectbed Overlap

≪ Previous: Bedtools subtract not dealing well with large datasets

Hi Everyone I would nedd some help on genomeCoverageBed tool. This tools when used for finding per base genome coverage uses an option -d. I am actually interested in finding read counts for each base within a particular intron of a gene. I will like to explain you more just to make myself clear. I used IGV to see how my alignments looks and moreover what is the coverage of each base within a particular intron. When I take my cursor in IGV to the area exactly above the base (i am interested in)within the coverage track it gives me such details:

Total Count:6
A:0
C:0
G:6
T:0
N:0

Now this total count is basically the read count for the base G within that intron. This counts says that 6 reads have actually covered this base position(and hence base). Now when i use this code snippet which is basically finding per base genome coverage genomeCoverageBed -i 2-B3-1b-D303A_sorted.bed -g pombe.genome -d this code gives me around 31 as the depth for that base(i.e G in my example). Looking closely in IGV i figured out that this 21 is basically 21 = 6 + 15 where 6 is the actual reads that has covered this base position(hence base) and 15 means that these reads have not covered that base at that position, but since the genomeCoverageBed tool calculates depth of feature coverage it also includes all those reads which skips that particular base. I would provide you with an image to make it more clear I would like to know how can i ...

↧

Intersectbed Overlap

November 23, 2011, 9:20 am

≫ Next: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

≪ Previous: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

Hi,

I've a question about intersectBed. Is it possible to extract only alignment like this :

chromosome ===============================================================
BED/BAM A               ==============              =================
BED FILE B               ============
RESULT                  ==============

But no alignment like this (even if the read overlapp 100% of the feature, I don't want to extract these reads)

chromosome ===============================================================
BED/BAM A    =========================              =================
BED FILE B               =============
RESULT

So, only extracting reads that have 90-95% of its sequence overlapping 90-95% of the feature.

Is it clear ?

Thanks,

↧

How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

April 8, 2014, 9:43 am

≫ Next: How To Rearrange Paired End Bam File?

≪ Previous: Intersectbed Overlap

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?

↧

How To Rearrange Paired End Bam File?

May 16, 2013, 10:17 am

≫ Next: Counting The Whole Insert Size From Paired-End Reads As Coverage

≪ Previous: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

Hello all,

I have a paired end bam file and I want to use bedtools for them. After merging, the paired end read alignments are not lying next to each other. It is making problems in the bedtools process. Is there any tool available to rearrange the paired end read alignments in bam file?

Thanks, Deeps

↧

Counting The Whole Insert Size From Paired-End Reads As Coverage

March 6, 2012, 1:46 pm

≫ Next: Remove Intronic Regions in .BAM

≪ Previous: How To Rearrange Paired End Bam File?

We have updated our workflows for per base sequence coverage to use genomeCoverageBed from BAM files. However for pair-end data it seems as though the regions between pair-end reads are not counted.

To be clear I am not talking about using -split for not counting introns in a single read of a paired-end, instead I am looking to count the probable whole insert when the insert size is greater than the combined read length of the paired reads.

We've looked at using iRanges from BioConductor as well but cannot tell if this would do what we want.

Is there is hidden flag in genomeCoverageBed to count the whole insert as coverage, not just the sequenced ends? Is there another program out there what would work on BAM files?

I know I can alter the SAM file before BAM conversion but this seems like something that should be coded somewhere already.

↧

Remove Intronic Regions in .BAM

May 14, 2014, 2:45 am

≫ Next: Per Base Coverage

≪ Previous: Counting The Whole Insert Size From Paired-End Reads As Coverage

I have a .BAM file which contains discordantly and concordantly mapped mate-pairs. I used bedtools Pairtobed to extract the mate-pairs which both show overlap with targeted regions (Illumina target .bed file). Is it somewhere possible to remove the parts of the mate-pairs that do not show overlap? I couldn't find it in the bedtools manual... can I just use intersectBed for each read for this?

Thanks!

↧

Per Base Coverage

March 9, 2012, 6:19 pm

≫ Next: Splice Junction file intersection with genome annotation

≪ Previous: Remove Intronic Regions in .BAM

Is there a way to obtain per-base coverage for a define chromosome interval using a bam file generated from Illumina single-end reads? genomeCoverageBed in Bedtools does not seem to have an option for it.

↧

Splice Junction file intersection with genome annotation

June 16, 2014, 10:38 pm

≫ Next: macs and bedtools

≪ Previous: Per Base Coverage

Hello, I have a tab delimited format Splice Junction file and the file looks something like this: chr1 11212 12009 1 1 0 0 2 48 chr1 11672 12009 1 1 0 0 1 31 chr1 11845 12009 1 1 0 0 1 28 chr1 12228 12612 1 1 1 0 1 32 chr1 12722 13220 1 1 1 0 3 9 chr1 14830 14969 2 2 1 0 218 50 chr1 15039 15795 2 2 1 0 98 50 chr1 15948 16606 2 2 1 1 10 48 chr1 16766 16857 2 2 1 0 24 44 chr1 16766 16875 2 2 0 0 2 36 The task is to filter out lines in which Column 6 has value 1, Column 7 has value 1 and Column 8 has value 10 or greater. I have been going through the bedtools documentation but I am not quite sure on how to get started, I would appreciate a few pointers on how to get going. My input file is going to be in the tab delimited format and I also have the Gencode V.19 GTF file for annotation. Thanks! *** Edit *** Column 1: chromosome Column 2: first base of the intron (1-based) Column 3: last base of the intron (1-based) Column 4: strand Column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT Column 6: 0: unannotated, 1: annotated (only if splice junctions database is used) Column 7: number of uniquely mapping reads crossing the junction Column 8: number of multi-mapping reads crossing th ...

↧

macs and bedtools

July 4, 2014, 2:07 pm

≫ Next: "mask" values in a bedgraph

≪ Previous: Splice Junction file intersection with genome annotation

Hello

I have MACS2 output and now looking for peaks which are situated in introns. I have bed file with introns from USCS for my species. What file with peaks should I use for bedtools intersection? Peaks summit (.bed) or narrow peak (.bed), both from MACS2 output?

↧

"mask" values in a bedgraph

June 10, 2014, 9:11 am

≫ Next: Bedtools Intersectbed

≪ Previous: macs and bedtools

I am trying to plot average conservation in a list of genomic features, and so far managed to do it successfully using a combination of the phastCons bigwig files (hg19.100way.phastCons.bw) and deepTools. However, as extra step, I want to re-do my analysis but this time by removing, or masking, the conservation values in the exons.

My first step, and the easiest, was to remove all features that overlap with exons, using bedtools intersect. This worked, bit seems like a crude way of doing it. So I am now trying to convert all phastCons values in exons to zero.

The question is: how to do it? Consider that I want a nice bigwig at end to input to deepTools. Initially I converted the phastCons bigwig to bedgrap, because it thought map from bedtools would work. It did not, so I am a bit out of ideas now.

↧

Bedtools Intersectbed

November 17, 2011, 10:15 am

≫ Next: how to run subtract command in java

≪ Previous: "mask" values in a bedgraph

Apologies if this is blatantly obvious!

I would like to compare coordinates in setA with those of setB. The output should have the same number of coordinates as setA and tell me how many nucleotides of each setA coordinate are overlapped by any coordinate in setB.

For example a large coordinate in setA may be overlapped by two setB coordinates, but i want to know how many nucleotides of the setA coordinate are covered by both setB coordinate in total.

I know how to do this on GALAXY as there is the handy 'Coverage' tool in 'Operate on Genomic Intervals'. However, i want to do this on the command line. I have been trying to get BEDTools to do this using 'intersectBed', but i can only seem to get just the overlapping setA coords (using -u), or get the nucleotide over for multiple setB coordinates on separate line (using -wao), or a count of how many setB overlaps setA (using -c).

SetB coordinates are non-overlapping themselves, so i guess i could tally up those SetB coordinates that overlap the same setA coordinate.

Can BEDTools do what i want or there another command line way of doing what i want?

Thank you!

PS I have also sent the to BEDTools discussion, so apologies for any double postings!

↧

how to run subtract command in java

September 3, 2014, 2:20 pm

≫ Next: Intersectbed Provides An Empty Output

≪ Previous: Bedtools Intersectbed

I want to run subtract command in java, could somebody tell me how to use.

Thank you very much.

↧

Intersectbed Provides An Empty Output

August 16, 2013, 10:53 pm

≫ Next: How To Get Annotation For Bed File From Another Bed File

≪ Previous: how to run subtract command in java

Hi,

I've downloaded the recent Cygwin version 1.7.24 and an trying to run bedTools but I get an empty file as my output. When I run the same commandline and files on a colleagues computer also through Cygwin I get a file containing the overlaps I seek. is the new Cygwin not compatable with BedTools? I've put the command line we used below:

./intersectbed -a Gene_body.bed -b EdgeR1.bed -wao > yyy.temp

Any help would be appreciated.

↧