compute normal-tumor coverage ratio from exome BAMs

July 2, 2014, 6:22 am

≫ Next: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

≪ Previous: How To Use Bedtools To Extract Promoters From A Mouse Bed File

Could someone please suggest a quick way to compute the data ratio of uniquely mapped reads in
the normal to uniquely mapped reads in the tumor, as required by Varscan in the command below? I have over 50 exome BAMs.

(normal_unique_mapped_reads/tumor_unique_mapped_reads).

java -jar VarScan.jar copynumber normal-tumor.mpileup output.basename -min-coverage 10 --data-ratio [data_ratio] --min-segment-size 20 --max-segment-size 100

↧

Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

February 9, 2012, 9:02 am

≫ Next: Bedgraph Not Displayed In Igv

≪ Previous: compute normal-tumor coverage ratio from exome BAMs

I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian populationhttp://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html I downloaded their SAM files from the groups website:http://funpopgen.unige.ch/data/ceu60 I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):

samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
samtools sort first.bam first.bam.sorted
samtools index first.bam.sorted
samtools index aln-sorted.bam

When I the ...

↧

Bedgraph Not Displayed In Igv

January 24, 2014, 8:24 am

≫ Next: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

≪ Previous: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

Hi, I am new and so facing problem. I was trying to make a bed graph file using bed tools genomecov command. The command was: bedtools genomecov -ibam filename.sorted.bam -g chromosome sizes.txt > O.bedgraph I got a bedgraph file which is much smaller in size. It is 500kb instead of ~6Mb. And when I load that 500kb file into IGV, I see nothing. Please help me out.

↧

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

August 20, 2012, 10:36 am

≫ Next: Filtering Bed Files By Using Bedops

≪ Previous: Bedgraph Not Displayed In Igv

Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

↧

Filtering Bed Files By Using Bedops

June 18, 2013, 1:40 am

≫ Next: How To Find The Nearest Gene To A Retrotransposon Insert?

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

hello every one,

I have paired end illumina reads, R1.fastq and R2.fastq and I have mapped them as single-end reads by using bowtie2 default parameters, I performed further downstream analysis by using samtools and bedops, and now I have R1.bed and R2. bed I made two sets, one of them have R1_uniquely_mapped.bed, R2_uniquely_mapped.bed and second of them R1_mapped_more_than_1.bed , R2_mapped_more_than_1.bed.

because R1 and R2 belongs paired end reads, and my restriction library has maximum 2KB size, then R1 and R2 pairs must be present in less than 2 kb territory of chromosome

theoretically I am assuming, in R1.bed format,

chr1  100   180    @R1_read1______1 .................
chr1   1000  1090 @R1_read2______1................

In R2.bed format,

chr1 2100   2180 @R2_read1_____2............. ## I just add 2KB length with respect to R1.bed###
chr1 2500 2590    @R2_read______2......... ## I just add 1.5KB [1500nts] with respect to R1.bed, because my library is >= 2KB.

How can I customize downstream tools like BEDOPS or bedtool which can capture such type of reads or alignment????? How can I filter this type of infromation by using bedops tool????

all suggestions and comments are most welcome,

↧

How To Find The Nearest Gene To A Retrotransposon Insert?

April 3, 2012, 7:24 am

≫ Next: Question about number of reads within intervals

≪ Previous: Filtering Bed Files By Using Bedops

Hi,

I have a BED file with the position of retrotransposons in the mouse genome and I would like to find the nearest gene, the distance to that gene and whether it is on the + or - strand. There are so many different file formats for the mouse genome and many different databases to choose from, I was wondering what the best tool and what the best database to use would be.

Cheers, Joseph

↧

Question about number of reads within intervals

November 3, 2014, 8:49 am

≫ Next: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

≪ Previous: How To Find The Nearest Gene To A Retrotransposon Insert?

Hi there, This question is very basic but I need to ensure that I'm going on the right way. I need to calculate the number of reads falling inside my bed intervals and the number of reads falling outside them. After reading this thread (https://www.biostars.org/p/11832/), I decided to try with this command: intersectBed -abam my_file.bam -b my_file.bed -wa -f 1 | coverageBed -abam stdin -b my_file.bed I would like to know what is the difference between using the previous command, or using only the second part: coverageBed -abam my_file.bam -b my_file.bed The output is quite different for some hits: - First command output: 1 50331576 50331667 (.. gene names..) 0 0 91 0.0000000 1 39845848 39846030 (..gene names..) 70 178182 0.9780220 - Second command output: 1 50331576 50331667 (..gene names..) 47 91 91 1.0000000 1 39845848 39846030 (..gene names..) 143 182182 1.0000000 I think that for first command I get only those reads falling strictly within interval, while for the second one also include reads that partially covering the intervals? This is true? For other hand, I would like also to get the number of reads falling outside the intervals. I can make a new bed file using bedtools complement, but if I use -v option of bedtools intersect would be OK? Like this: intersectBed -v -abam my_file.bam -b my_file.bed -wa -f 1 | c ...

↧

How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

February 27, 2013, 11:13 am

≫ Next: Determining Each Samples Coverage Area

≪ Previous: Question about number of reads within intervals

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

↧

Determining Each Samples Coverage Area

May 16, 2013, 7:06 am

≫ Next: Intersectbed/Coveragebed -Split Purify Exon?

≪ Previous: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

First time I am working with NGS data. I've got a BAM file with mapped reads for my samples and a BED file with the regions in HG19 that were targeted (used an Ion-torrent ampliseq panel). Are there any tools that can output something similar to this:

**Sample      Amplicon           Chromosome           Start_coordinate_of_coverage             End_coordinate_of_coverage**
Sample1       amp_001                chr6                 1,000,000                                   1,000,250
Sample2       amp_001                chr6                 1,000,111                                   1,000,255
Sample1       amp_002                chr6                 1,000,200                                   1,000,333

I basically want to know for each gene what coverage we have for each sample.

EDIT: changed column headings, I'm looking for coordinates that have coverage, not depth at each exon.

↧

Intersectbed/Coveragebed -Split Purify Exon?

September 15, 2012, 1:58 am

≫ Next: Getting Rna Sequences From Gff And Fa Files

≪ Previous: Determining Each Samples Coverage Area

all.reads.bam file records mapped RNA-seq reads data, including:

exon:exon junction
exon body
intron body
exon:intron junction

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed I'm confused by the mannual (Page 62):

When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.

If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like this:coverageBED -abam all.reads.bam -b ...

↧

Getting Rna Sequences From Gff And Fa Files

August 24, 2013, 7:30 am

≫ Next: how to get -nms for bedtools

≪ Previous: Intersectbed/Coveragebed -Split Purify Exon?

Hi. I have a folder full of .fa files, and a .gff. The gff file contains information about which loci look like they code for RNA sequences. The .fa contain the DNA sequences for a set of human chromosomes. I want to get all the sequences which code for RNA, as defined by the gff file, out of the DNA in the fasta files. I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff.

I have been trying to code my own little program in F# that will read these files and give me each RNA read defined in the gff, and its corresponding DNA. However I am a bit confused about how it works. Do the start and end of each feature in the gff file define a character in the corresponding .fa file? Are they 1 or 0 indexed? Does it matter what strand they are ('+' or '-') for my purposes?

Ultimately my goal is to get a bunch of RNAs with their corresponding types (miRNA, lincRNA, snRNA... etc) to do some computations on.

My question is this: what is the easiest way to get it out of the data I have?

The data I am using is freely available here: http://wanglab.pcbi.upenn.edu/coral/ under the heading "Annotation packages" if anyone is interested or needs specifics.

Thank you!

↧

how to get -nms for bedtools

August 10, 2014, 12:45 pm

≫ Next: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

≪ Previous: Getting Rna Sequences From Gff And Fa Files

I'd like to merge bed files and preserve the names of the merged features using bedtools -nms option.

However, this option (-nms) is deprecated in the newer bedtools.

The documentation says I can use -o option to get -nms behavior.

How do I get translate the new bedtools merge command to get:

bedtools merge -i file.bed -nms

↧

Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

January 16, 2013, 2:49 pm

≫ Next: Bed File Of Mapq Sliding Window On A Bam File?

≪ Previous: how to get -nms for bedtools

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

↧

Bed File Of Mapq Sliding Window On A Bam File?

February 27, 2014, 2:01 am

≫ Next: How to get the rRNA ratio from a RNAseq dataset

≪ Previous: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

There may already be a recipe for this, so asking first before reinventing the wheel: I would like to create a bed file where the score is the average mapQ from the reads of the input.bam file. I think bedtools or bedops are the way to go:http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html Other than simply running bamtobed/bam2bed, I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200. I also would like to generate the bam2bed information only from a list of regions in regions.bed. E.g., something like:mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed EDITED: Thank you Aaron for you answer. I got it working but it's slow for my 30x WGS bams:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome
bedtools makewindows -g hg19.genome -w 1000 -s 200 > hg19.windows.bed
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean &gt ...

↧

How to get the rRNA ratio from a RNAseq dataset

September 30, 2014, 9:25 am

≫ Next: Raw Counts From Cufflinks Output

≪ Previous: Bed File Of Mapq Sliding Window On A Bam File?

Hello,

I want to know if there is any way using the bedtools and miRdeep2 output bed file to get the rRNA ratio in my miRNAseq fastq data. Thank you very much!

I have a gtf file, a genome.fa, a bed file from the miRdeep2. Thanks!

↧

Raw Counts From Cufflinks Output

February 13, 2013, 2:30 am

≫ Next: Does Windowbed Extend Reads?

≪ Previous: How to get the rRNA ratio from a RNAseq dataset

Hi, I want to ask how to get the raw counts from the output of cufflinks. One way to do this is to use the fpkm.

raw counts = FPKM * (length of that transcript/1000) * (# of mapped reads / 1e6)

The FPKM and length of transcript are in the cufflinks FPKM Tracking Files. But how about the # of mapped reads?

For instance, we have a foo.bam. samtools view -c (-f|-F) flag foo.bam can do this job but I am not quite which flag should I set when it's single-end or paired-end.

Thanks!

↧

Does Windowbed Extend Reads?

October 21, 2013, 10:08 am

≫ Next: Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

≪ Previous: Raw Counts From Cufflinks Output

I am using WindowBed, part of the BedTools suite, to align reads to a reference file and I obtained a very interesting result. I am trying to rule out an analysis artifact that could be caused by extending the reads or by aligning read midpoints rather than 5' ends. It is my understanding that WindowBed aligns the 5' end of the read to the reference point, rather than extending than mapping the read midpoint, or extending the 3' end of the read and mapping the midpoint. Am I correct in this assumption, that the 5' end of the read is in fact what is being aligned?

Any help here would be appreciated. The BedTools manual, which is very good, doesn't seem to address this.

Thanks

↧

Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

March 27, 2013, 1:05 pm

≫ Next: How Can I Include One Bed File In Another Bed File ?

≪ Previous: Does Windowbed Extend Reads?

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
  1026779 qry.depth
  1027173 qry.dz
  1026779 qry.mpileup

Any ideas? Thanks

↧

How Can I Include One Bed File In Another Bed File ?

August 5, 2013, 4:34 am

≫ Next: How Can I Compare And Merge Bed Files

≪ Previous: Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

Hello, I have 2 bedfiles that share some common features let's call the first file A.bed (bigger file) and the second B.bed (smaller file). I would like to have a new bed file that includes everything in B.bed in the A.bed file. I don't need the intersect, I more like need the merge option I checked bedtools's manual... couldn't find an answer for merging 2 bedfiles. Can someone help?

Thanks in advance

↧

How Can I Compare And Merge Bed Files

July 22, 2012, 1:46 pm

≫ Next: Bedtools "Segmentation Fault" While Working With Genome.Fa

≪ Previous: How Can I Include One Bed File In Another Bed File ?

I have three bed files with chrNo, start, end position and type. I need to compare each chrNo, start and end position of one file with 2 other files and write the common one in a new file. Can any one suggest how can I do this efficiently? I wrote the simple perl script, but as the file is huge, it is taking a lot of time, thus is not feasible. Thanks in advance

Example files:

file1.bed:

1 20 30

1 100 120

1 200 300

file2.bed:

1 2 5

1 25 34

1 200 300

file3.bed:

1 30 33

1 200 300

1 500 600

common.bed

1 30 34 --> coordinates with overlapping 5bp is considered as same but outermost coordinates of the 3 is taken in common file

1 200 300

↧