bowtie2 -q -a -p 1 -x Multi -1 R100_1.fq -2 R100_2.fq -U 100_Orph.fastq -S 100.sam
samtools view -b -S 100.sam -o 100.bam
coverageBed -abam 100.bam -b BED_RefSeq >>100.cvg
CoverageBed ouput for genome("307679329")is :  307679329       1       25751   72      3568    25750   0.1385631

but when I index genome ("307679329") separately then CoverageBed output is:

307679329       1       25751   449     8369    25750   0.3250097

Can someone explain this differnece

↧

GTF2/GFF3 "feature" types and expression analysis

April 16, 2014, 3:00 pm

≫ Next: Calculate reciprocal overlap for thousands of samples

≪ Previous: Bowtie2 Mapping Different Number Of Reads To Same Sequence When Ref-Seq Is Part Of Different Indexes

Hi, I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time). Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here). My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file? My second question follows I have read both these resources (here and here) but do not understand the differences between:

exon vs CDS
transcript vs mRNA

I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics. Thanks ahead of time and let me know if my post was not clear enough. ...

↧

Calculate reciprocal overlap for thousands of samples

July 15, 2014, 10:20 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: GTF2/GFF3 "feature" types and expression analysis

I have around 20k samples with BED files. How can I calculate reciprocal overlap for each segment? I want to find all segments with 50% reciprocal overlap or better.

↧

How To Install Bedtools In A User Directory

June 25, 2013, 7:55 pm

≫ Next: Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?

≪ Previous: Calculate reciprocal overlap for thousands of samples

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?

February 13, 2012, 1:51 pm

≫ Next: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

≪ Previous: How To Install Bedtools In A User Directory

Is there an easy way of finding, probably with bedtools, given a window size, the top N most correlated regions when comparing two bed/wig files? For example, in comparing two bed/wig/bam files that have PolII data for 2 conditions, to give the top N windows where the wiggle profiles are most similar?

↧

Intersectbed - Overlap Analysis Usign Vcf And Bed Files

July 12, 2012, 2:04 pm

≫ Next: Picking Random Genomic Positions

≪ Previous: Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

↧

Picking Random Genomic Positions

July 9, 2012, 5:11 am

≫ Next: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

≪ Previous: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

I do have a set of TF binding coordinates and want to see if there is any significant overlap with an open chromatin annotation.

Example of TF coord:
chr1 19280 19298
chr1 245920 245938
chr2 97290 97308
chr9 752910 752938
...

Example of open chrom. coord. (UCSC track):
chr2 33031543 33032779
chr3 2304169 2304825
chr5 330899 330940
...

I have checked the intersection with the Bedtools (open chrom. coord vs TF coord. -/+ 100bp) and now I want to check the intersection between random genomic coordinates and open chrom.

The idea is to:

Pick random genomic position (from the same chromosome as TF coordinate);
-/+9bp (binding site size);
-/+ 100bp;
Run this simulation for 1000 times (TF x 1000);
Bedtools;

Any ideas how can I do this simulation to pick random genomic positions from the same chromosome? I know a little bit of bash and Perl, but won't be able to write the script by myself.
Is it possible to measure the length of every chromosome;
Pick TF chromosome and from it's length get a random number which would represent a genomic position?

Can someone help me with the simulation and the pipeline.

↧

Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

November 8, 2011, 1:51 am

≫ Next: Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

≪ Previous: Picking Random Genomic Positions

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

↧

Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

October 5, 2012, 12:53 am

≫ Next: How To Check Whole Genome With Bigwigsummary ?

≪ Previous: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

Motivation The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools. Result I'll first upload the difference between my-version and official version: Top to Bottom:

Black: my-version-POSitive-strand.bigwig
Blue: Official-version-POSitive-strand.bigwig
Red: Official-version-REVerse-strand.bigwig
Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy. Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think. Data Set ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

BAM ...

↧

How To Check Whole Genome With Bigwigsummary ?

March 30, 2012, 11:33 am

≫ Next: Error In Bedtools Getfasta: Chromosome Not Found

≪ Previous: Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

Hi,

I have question about bigwigsummary tools ,

I have my start and end positions and my bigwig file but I want to check whole genome instead of chromosome by chromosome Is there any option to use this tool in that way ?

I know that for each chromosome I have to use :

bigWigSummary -type=X bigwigfile chrN start end datapoints

I want to check from chr1 to chrX.

Thanks in Advance.

↧

Error In Bedtools Getfasta: Chromosome Not Found

February 7, 2013, 9:31 am

≫ Next: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

≪ Previous: How To Check Whole Genome With Bigwigsummary ?

Hi, I am triing to use BEDtools to get some sequences from genomic coordinates. But I am having an errors saying " WARNING. chromosome (chr12) was not found in the FASTA file. Skipping." for each read that I have in my bed file. I gave you some details about what I am doing. I just download the last version of BEDtools (I think) bedtools-2.17.0. Then I have 2 different files (much more longer that the little part that I show) : A fasta file with all the sequences of chromosomes:

>chr01
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

a BED file with my genomic coordinates (already sorted) chr01 187814 190840 chr01 307073 310104 chr01 701047 704068 chr01 702941 705962 chr01 702952 705972 chr01 867716 870740 chr01 914064 917087 chr01 991080 994104 chr01 1039795 1042815 chr01 1058713 1061736 And then I write the command line: bedtools getfasta -fi all.con -bed 1-13sorted2.bed -fo NewCandidates/Genomiccoordinates/1-13_1500.fa The only thing that I get is "WARNING. chromosome (chr01) was not found in the FASTA file. Skipping." , thousands of tim ...

↧

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

January 27, 2013, 2:27 am

≫ Next: Bedtools Compare Multiple Bed Files?

≪ Previous: Error In Bedtools Getfasta: Chromosome Not Found

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

↧

Bedtools Compare Multiple Bed Files?

October 26, 2011, 5:27 pm

≫ Next: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

≪ Previous: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result

I don't know if I'm on the right track. thx

↧

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

August 20, 2012, 10:36 am

≫ Next: Getting Unmapped Reads: Comparing Fastq To Bam

≪ Previous: Bedtools Compare Multiple Bed Files?

Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

↧

Getting Unmapped Reads: Comparing Fastq To Bam

December 4, 2011, 6:02 pm

≫ Next: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

↧

How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

April 8, 2014, 9:43 am

≫ Next: Memory Efficient Bedtools Sort And Merge With Millions Of Entries?

≪ Previous: Getting Unmapped Reads: Comparing Fastq To Bam

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?

↧

Memory Efficient Bedtools Sort And Merge With Millions Of Entries?

May 8, 2013, 6:52 am

≫ Next: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

≪ Previous: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.

I have tried the following but it blows up in memory beyond the 32G I have available here:

find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

Any suggestions?

↧

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

June 21, 2013, 3:38 am

≫ Next: Converting Gff To Bed With Bedtools?

≪ Previous: Memory Efficient Bedtools Sort And Merge With Millions Of Entries?

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

↧