error with bedtools slop

April 17, 2014, 2:28 am

≫ Next: To Group Items In Bed Files

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

Hi,

I am trying to run a bedtools slop on my.bed file and hg19.genome

bedtools slop -i H3K27me3.bed -g hg19.genome -b 30

I get the following error:

Less than the req'd two fields were encountered in the genome file (genomes/hg19.genome) at line 2. Exiting.

Any suggestions?

Thanks in advance

Samad

↧

To Group Items In Bed Files

January 20, 2012, 5:50 pm

≫ Next: How To Rearrange Paired End Bam File?

≪ Previous: error with bedtools slop

For example, we now have a bed file:

chr1 23455 45678
chr1 23446 45663
chr1 23449 45669
chr1 30000 31000

Is there anyway to group the first three lines, while leaving the last line alone? I know Bedtools have mergeBed function, merging those overlapping span, which, however will include the last line.

This may sound a pure computational question; but I'm just curious if we have available tools already to tackle such questions

thx

↧

How To Rearrange Paired End Bam File?

May 16, 2013, 10:17 am

≫ Next: Coveragebed, Depth/Breadth Of Coverage

≪ Previous: To Group Items In Bed Files

Hello all,

I have a paired end bam file and I want to use bedtools for them. After merging, the paired end read alignments are not lying next to each other. It is making problems in the bedtools process. Is there any tool available to rearrange the paired end read alignments in bam file?

Thanks, Deeps

↧

Coveragebed, Depth/Breadth Of Coverage

June 17, 2011, 3:47 pm

≫ Next: What Is The Best Way To Run Bedtools In Parallel With Blocking

≪ Previous: How To Rearrange Paired End Bam File?

I'm using coverageBed to calculate the depth and breadth of coverage, but I'm not sure I'm doing this right. I want to calculate the two values for each human chromosome.

For example, I've created a bed file with 1 chromosome. When I input my BAM file and the BED file, I get the following output:

chr1    0       249250621       103718897       224950839       249250621       0.9025086

I know the first 3 fields are from my chr BED file, the 4th field is the # of reads, 5th is # of bases covered, 6th is length of chromosome (redundant to field 3), and the last column is the fraction of bases covered (5th field/6th field).

So the 7th/last field gives the breadth of coverage, but I don't see a depth of coverage value. How do I get a depth of coverage?

↧

What Is The Best Way To Run Bedtools In Parallel With Blocking

January 15, 2013, 3:27 pm

≫ Next: Raw Counts From Cufflinks Output

≪ Previous: Coveragebed, Depth/Breadth Of Coverage

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

↧

Raw Counts From Cufflinks Output

February 13, 2013, 2:30 am

≫ Next: Intersectbed Provides An Empty Output

≪ Previous: What Is The Best Way To Run Bedtools In Parallel With Blocking

Hi, I want to ask how to get the raw counts from the output of cufflinks. One way to do this is to use the fpkm.

raw counts = FPKM * (length of that transcript/1000) * (# of mapped reads / 1e6)

The FPKM and length of transcript are in the cufflinks FPKM Tracking Files. But how about the # of mapped reads?

For instance, we have a foo.bam. samtools view -c (-f|-F) flag foo.bam can do this job but I am not quite which flag should I set when it's single-end or paired-end.

Thanks!

↧

Intersectbed Provides An Empty Output

August 16, 2013, 10:53 pm

≫ Next: Random shuffling of features leaving gene models intact

≪ Previous: Raw Counts From Cufflinks Output

Hi,

I've downloaded the recent Cygwin version 1.7.24 and an trying to run bedTools but I get an empty file as my output. When I run the same commandline and files on a colleagues computer also through Cygwin I get a file containing the overlaps I seek. is the new Cygwin not compatable with BedTools? I've put the command line we used below:

./intersectbed -a Gene_body.bed -b EdgeR1.bed -wao > yyy.temp

Any help would be appreciated.

↧

Random shuffling of features leaving gene models intact

May 26, 2014, 7:02 am

≫ Next: Bedtools Compare Multiple Bed Files?

≪ Previous: Intersectbed Provides An Empty Output

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene are placed on the same contig and related features are placed inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case?

↧

Bedtools Compare Multiple Bed Files?

October 26, 2011, 5:27 pm

≫ Next: Problem With Counting Mapped Reads

≪ Previous: Random shuffling of features leaving gene models intact

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result

I don't know if I'm on the right track. thx

↧

Problem With Counting Mapped Reads

March 23, 2014, 5:43 pm

≫ Next: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

≪ Previous: Bedtools Compare Multiple Bed Files?

Hi, This is my very first experience analysing RNAseq data. My goal is to do differential analysis between two strains of a bacteria. So far, i managed to align and produce SAM and BAM files. I'm having problems to annotate and count my reads. Here are the commands that I used. My reads are from SOLID and hence in colourspace

$ nohup solid2fastq.pl 291_01_01 291_01_01-bwa  #Convert .csfasta and .qual to .fastq

$ nohup bwa index -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta

$ nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai

$ perl -ne 'if($_ !~ m/^\S+?\t4\t/){print $_}' 291_01_01-bwa.sam > 291_01_01-bwa.sam.filtered #Convert to SAM file

$ samtools sort 291_01_01-bwa.bam 291_01_01-bwa.bam.sorted

$ samtools index 291_01_01-bwa.bam.sorted.bam

to produce .rpkm file

$ java -jar ~/bin/bam2rpkm-0.06/bam2rpkm-0.06.jar  -i 291_01_01-bwa.bam.sorted.bam -f Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.RPKM2.out  # i get an error here
$ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1

so i tried different method to count

$ htseq-count -i ID 291_01_01-bwa.sam Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.sam_htseq-count #still error
$Error occured when processing GFF file (line 37060 of file Tbrucei427_Tr ...

↧

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

June 21, 2013, 3:38 am

≫ Next: Tutorial: Piping With Samtools, Bwa And Bedtools

≪ Previous: Problem With Counting Mapped Reads

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

↧

Tutorial: Piping With Samtools, Bwa And Bedtools

April 26, 2012, 4:14 pm

≫ Next: Getting Rna Sequences From Gff And Fa Files

≪ Previous: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

In this tutorial I will introduce some concepts related to unix piping. Piping is a very useful feature to avoid creation of intermediate use once files. It is assumed that bedtools, samtools, and bwa are installed. Lets begin with a typical command to do paired end mapping with bwa: (./ means look in current directory only)

# -t 4 is for using 4 threads/cores
bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam

Supposed we wish to compress sam to bam, sort, remove duplicates, and create a bed file.

samtools view -Shu s1.sam > s1.bam
samtools sort s1.bam s1_sorted
samtools rmdup -s s1_sorted.bam s1_sorted_nodup.bam
bamToBed -i s1_sorted_nodup.bam > s1_sorted_nodup.bed

This workflow above creates many files that are only used once (such as s1.bam) and we can use the unix pipe utility to reduce the number intermediate files created. The pipe function is the character | and what it does is ...

↧

Getting Rna Sequences From Gff And Fa Files

August 24, 2013, 7:30 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: Tutorial: Piping With Samtools, Bwa And Bedtools

Hi. I have a folder full of .fa files, and a .gff. The gff file contains information about which loci look like they code for RNA sequences. The .fa contain the DNA sequences for a set of human chromosomes. I want to get all the sequences which code for RNA, as defined by the gff file, out of the DNA in the fasta files. I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff.

I have been trying to code my own little program in F# that will read these files and give me each RNA read defined in the gff, and its corresponding DNA. However I am a bit confused about how it works. Do the start and end of each feature in the gff file define a character in the corresponding .fa file? Are they 1 or 0 indexed? Does it matter what strand they are ('+' or '-') for my purposes?

Ultimately my goal is to get a bunch of RNAs with their corresponding types (miRNA, lincRNA, snRNA... etc) to do some computations on.

My question is this: what is the easiest way to get it out of the data I have?

The data I am using is freely available here: http://wanglab.pcbi.upenn.edu/coral/ under the heading "Annotation packages" if anyone is interested or needs specifics.

Thank you!

↧

How To Install Bedtools In A User Directory

June 25, 2013, 7:55 pm

≫ Next: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

≪ Previous: Getting Rna Sequences From Gff And Fa Files

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

November 8, 2011, 1:51 am

≫ Next: Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

≪ Previous: How To Install Bedtools In A User Directory

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

↧

Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

October 5, 2012, 12:53 am

≫ Next: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

≪ Previous: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

Motivation The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools. Result I'll first upload the difference between my-version and official version: Top to Bottom:

Black: my-version-POSitive-strand.bigwig
Blue: Official-version-POSitive-strand.bigwig
Red: Official-version-REVerse-strand.bigwig
Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy. Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think. Data Set ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

BAM ...

↧

Samtools or Bedtools: How to filter a bam file with a bed file using strand information

June 5, 2014, 5:29 am

≫ Next: Bedtools Multicov Need A Bam Index File Specification Option

≪ Previous: Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

I would like to filter a bam file, keeping only reads overlapping with genomic intervals from a bed file. I used samtools for this:

samtools view -b -h -L bedfile.bed bamfile.bam

However the -L option does not seem to take into account the strand information.

Do you know if there is another option or way to do it that would keep strand information?

↧

Bedtools Multicov Need A Bam Index File Specification Option

May 28, 2013, 2:02 am

≫ Next: Counting Features In A Bed File

≪ Previous: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

bedtools version 2.16.2 multicov used to compute the multiple sample coverage given a feature file(gtf bed).

format: bedtools multicov -bams alin1.bam aln2.bam .. -bed capturRegion.bed >out.coverage

official doc has mentioned that input bam files should be sorted and indexed, but it does not mention the details. suppose the bam file name is: sample1.bam, then the index file should be named: sample1.bam.bai(not sample1.bai) ,otherwise multicov will report an error: indexes not found.

I think it would be better to add an option which will allow the user to specify the bam index files or the suffix used for these index files.

↧

Counting Features In A Bed File

November 22, 2012, 4:02 am

≫ Next: Bedtools To Compare A Vcf File From Samtools Mpileup With Dbsnp?

≪ Previous: Bedtools Multicov Need A Bam Index File Specification Option

I have a file in the following BED format

Chr1 1022071 1022105  +
Chr1 1022071 1022105  +
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -

I am trying get the counts of each feature represented in this file.

mergeBed -i R5_chr.bed -n -s -d 0 > Output/R5_chr_counts.bed

I am interested in the counts of the features and I do not want to merge features by any number of base pairs. Then the output should be as follows

Chr1 1022071 1022105 2 +
Chr1 1022072 1022106 4 +

Any suggestions on how to achieve this using bedtools or in bash or awk? Thanks in advance!

↧

Bedtools To Compare A Vcf File From Samtools Mpileup With Dbsnp?

December 1, 2011, 7:43 pm

≫ Next: Multi Thread Bedtools

≪ Previous: Counting Features In A Bed File

Hello,

I have one big vcf file which is genereated by samtools mpileup by comparing 6 cell lines to see whether there are SNP differences between them.

I would like to use bedtools for intersecting. How can I do it? do you have some scripts for that.

Thanks

↧