How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

November 23, 2011, 2:32 pm

≫ Next: GTF2/GFF3 "feature" types and expression analysis

≪ Previous: Changing Column Order In Bed File

Hi all,

I am working on RNA-seq data analysis. I've finished running Tophat and Cufflinks to get FPKM values for each read from Illumina pair-end sequence. Also, parallely I've run Velvet to get contig sequences through de novo assembly and Gmap to see if the assembled sequences map to reference genome (this reference genome is not complete for now, but somewhat useful). Now, I am trying to combine all information so I can have sequence information for a contig and FPKM value for the corresponding to the contig. Some suggested I can convert Cufflink and Gmap outputs to bedfiles and then use IntersectBed to see if there's any overlap. However, I am not sure how I can have every information saved in the output from Bedtools. IntersectBed default seems to provide me overlapped region with 'A' file as a template, so I couldn't see any information from 'B' file. Is there any solution for me?? Please let me know. I would appreciate for your suggestion!

↧

GTF2/GFF3 "feature" types and expression analysis

April 16, 2014, 3:00 pm

≫ Next: macs and bedtools

≪ Previous: How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

Hi, I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time). Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here). My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file? My second question follows I have read both these resources (here and here) but do not understand the differences between:

exon vs CDS
transcript vs mRNA

I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics. Thanks ahead of time and let me know if my post was not clear enough. ...

↧

macs and bedtools

July 4, 2014, 2:07 pm

≫ Next: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

≪ Previous: GTF2/GFF3 "feature" types and expression analysis

Hello

I have MACS2 output and now looking for peaks which are situated in introns. I have bed file with introns from USCS for my species. What file with peaks should I use for bedtools intersection? Peaks summit (.bed) or narrow peak (.bed), both from MACS2 output?

↧

Samtools or Bedtools: How to filter a bam file with a bed file using strand information

June 5, 2014, 5:29 am

≫ Next: Genomic Regions To Exclude Before Shuffling Intervals

≪ Previous: macs and bedtools

I would like to filter a bam file, keeping only reads overlapping with genomic intervals from a bed file. I used samtools for this:

samtools view -b -h -L bedfile.bed bamfile.bam

However the -L option does not seem to take into account the strand information.

Do you know if there is another option or way to do it that would keep strand information?

↧

Genomic Regions To Exclude Before Shuffling Intervals

November 20, 2013, 4:01 am

≫ Next: Heatmap Of Read Coverage Around Tsss

≪ Previous: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

I want to do permutation test: randomly reposit (shuffle) given genomic intervals and measure intersection between new coordinates and specific genomic element.

Example:

Different sets of genes: protein coding, pseudogenes, ncRNA - intervals that I want to shuffle;
Genomic repeat L1 - coordinates are stable.
For every gene set shuffle intervals, intersect and measure the overlap with L1 (I am using bedtools shuffle - "reposition each feature in the input BED file on a random chromosome at a random position").

Question - Which genomic regions to exclude from the "genome" (bedtools shuffle -g option) before shuffling gene intervals?
I was going to exclude gaps in the assembly.
But what about:

All gene regions.
If I am shuffling pseudogene intervals should I exclude protein coding and ncRNA coordinates?
All non L1 Repeat masker coordinates.
As alu, LTR and DNA transposons aren't L1 so their won't be any intersection with them?

↧

Heatmap Of Read Coverage Around Tsss

August 20, 2013, 8:53 am

≫ Next: Intersectbed Provides An Empty Output

≪ Previous: Genomic Regions To Exclude Before Shuffling Intervals

I am trying to plot a heatmap of read density around a feature of interest (TSSs) very common in genomics papers. something like this (B): However, I am struggling a bit in getting to look "right". A bit of background: I have mapped ChIP-seq reads for pol2 and calculate the coverage, per nucleotide, using bedtools.

coverageBed -d -abam $bamFile -b $TSSs > $coverage.bed
# output:
chr1    67108226    67110226    uc001dct.3    16    +    1    10
chr1    67108226    67110226    uc001dct.3    16    +    2    10
chr1    67108226    67110226    uc001dct.3    16    +    3    10
chr1    67108226    67110226    uc001dct.3    16    +    4    10
chr1    67108226    67110226    uc001dct.3    16    +    5    8
chr1    67108226    67110226    uc001dct.3    16    +    6    8
chr1    67108226    67110226    uc001dct.3    16    +    7    8
chr1    67108226    67110226    uc001dct.3    16    +    8    8
chr1    67108226    67110226    uc001dct.3    16    +    9    8
chr1    67108226    67110226    uc001dct.3    16    +    10    8

Then in R, the genomic position, in column 7, is converted to relative position to the TSS and read counts normalized to the library size. This is converted to a numeric matrix with each row being a TSS and each column the relative nucleotide position. For the plotting the matrix is ordered number of reads per TSS, and the values logged. This is the outcome: heatmap(cov.mlog, Rowv=NA, Colv= ...

↧

Intersectbed Provides An Empty Output

August 16, 2013, 10:53 pm

≫ Next: Annotating Genomic Intervals

≪ Previous: Heatmap Of Read Coverage Around Tsss

Hi,

I've downloaded the recent Cygwin version 1.7.24 and an trying to run bedTools but I get an empty file as my output. When I run the same commandline and files on a colleagues computer also through Cygwin I get a file containing the overlaps I seek. is the new Cygwin not compatable with BedTools? I've put the command line we used below:

./intersectbed -a Gene_body.bed -b EdgeR1.bed -wao > yyy.temp

Any help would be appreciated.

↧

Annotating Genomic Intervals

October 1, 2012, 10:36 am

≫ Next: Converting Gff To Bed With Bedtools?

≪ Previous: Intersectbed Provides An Empty Output

How can I annotate human genomic intervals (BED file) from a ChIP-seq experiment with information such as whether the interval overlaps with a gene(s)? Upstream of a gene? Overlaps with an exon? Intron? 5kb upstream/downstream of TSS? Intergenic? Does it overlap with a DNAse I hypersensitive site?

Surely bedtools can help me with this, but I'm looking for the best workflow / data sources to use for this that will require the least amount of scripting.

Thanks.

↧

Converting Gff To Bed With Bedtools?

January 20, 2013, 1:51 pm

≫ Next: How To Create A Read Density Profile Within A Interval?

≪ Previous: Annotating Genomic Intervals

I use bedtools's sortBed utility to sort BED files for various operations. It takes as input GFF files as well. However, when I feed it a GFF file as in:

sortBed -i myfile.gff

it outputs it as GFF, not BED. Is there a way to make bedtools sort and then convert the result to BED? Many bedtools utilities have a -bed flag. Do I need to use a different subutility of bedtools to achieve this? thanks.

↧

How To Create A Read Density Profile Within A Interval?

February 22, 2013, 6:06 am

≫ Next: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

≪ Previous: Converting Gff To Bed With Bedtools?

HI!

I need some help: I have to create density profile with a window specific of 1kb (how many time a sequence is detected after NGS method). I have to use SAM and BEDtools, I think I can use genomeCov in BEDtools but I don't have genome reference.

So, if anybody is abble to help me...

Thanks

↧

Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

January 3, 2014, 5:32 am

≫ Next: Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

≪ Previous: How To Create A Read Density Profile Within A Interval?

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

↧

Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

March 26, 2012, 8:50 pm

≫ Next: How Can I Compare And Merge Bed Files

≪ Previous: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

bedtools intersect -v -abam my.bam -b myregions.gff > notinmyregions.bam

would we see reads with 4 in the FLAG field - i.e. unmapped reads in notinmyregions.bam

↧

How Can I Compare And Merge Bed Files

July 22, 2012, 1:46 pm

≫ Next: Which Of The Genes Are Enriched With Repeat Elements

≪ Previous: Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

I have three bed files with chrNo, start, end position and type. I need to compare each chrNo, start and end position of one file with 2 other files and write the common one in a new file. Can any one suggest how can I do this efficiently? I wrote the simple perl script, but as the file is huge, it is taking a lot of time, thus is not feasible. Thanks in advance

Example files:

file1.bed:

1 20 30

1 100 120

1 200 300

file2.bed:

1 2 5

1 25 34

1 200 300

file3.bed:

1 30 33

1 200 300

1 500 600

common.bed

1 30 34 --> coordinates with overlapping 5bp is considered as same but outermost coordinates of the 3 is taken in common file

1 200 300

↧

Which Of The Genes Are Enriched With Repeat Elements

November 14, 2013, 3:12 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: How Can I Compare And Merge Bed Files

I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements. I have a bam file and the repeats in bed format. As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements. I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene. For example,

chr   start  end gene number_of_repeats
chr1 100  200  gene1 70
chr1 190  240  gene1 40
chr1 250  400  gene1 100
chr2 500  600  gene2 150

if i sort and merge them i will get

chr1 100  240  gene1 90
chr1 250  400  gene1 100
chr2 500  600  gene2 150

So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats. Or b) shall I count the number of repeats which for each short sequence without any merging? ...

↧

How To Install Bedtools In A User Directory

June 25, 2013, 7:55 pm

≫ Next: Getting Number Of Reads In Intervals With Bedtools

≪ Previous: Which Of The Genes Are Enriched With Repeat Elements

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

Getting Number Of Reads In Intervals With Bedtools

December 14, 2012, 3:29 pm

≫ Next: Calculating Exome Coverage

≪ Previous: How To Install Bedtools In A User Directory

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

↧

Calculating Exome Coverage

April 3, 2014, 2:00 am

≫ Next: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

≪ Previous: Getting Number Of Reads In Intervals With Bedtools

*// Edit to make the post more clear (Mapping done via Bowtie2). My problem is that when counting Exome Coverage via coverageBed gives different results than via genomeCoverageBed. So I'm not sure if I'm doing something wrong, or which of the 2 methods is correct.

1) My first step is to build an .bed file of my Illumina Paired-End reads, returning the positions that only fall in targeted exon regions. I'm doing that via intersectBed -a [data.bed] -b [illuminaexonregions.bed].

2) My next step is to calculate the coverage of my new datafile via coverageBed -a [newdata.bed] -b [illuminaexonregions.bed]. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10993449.0

Nucleotides/Length*100 24.253740909 % Coverage.

3) The next step was to calculate the coverage of my new datafile via genomeCoverageBed -i [newdata.bed] -g [genome.txt] -d awk '$3>0 {print $1"\t"$2"\t"$3}'. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10576907.0

Nucleotides/Length*100 23.3347661863 % Coverage.

Somehow there's a difference in matched nucleotides, which I can't explain. What am I doing wrong?

↧

Intersectbed - Overlap Analysis Usign Vcf And Bed Files

July 12, 2012, 2:04 pm

≫ Next: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

≪ Previous: Calculating Exome Coverage

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

↧

To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

December 3, 2013, 7:08 am

≫ Next: Is It Possible To Filter Only Bookend Reads From A Bed File?

≪ Previous: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

Dear All, I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, ...

↧

Is It Possible To Filter Only Bookend Reads From A Bed File?

January 28, 2014, 3:58 am

≫ Next: Coveragebed, Depth/Breadth Of Coverage

≪ Previous: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

↧