General Considerations For Genomic Overlaps?

March 30, 2014, 8:10 am

≪ Previous: Problem with counting mapped reads

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics.

Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this.

E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome.

Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome.

If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A.

But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now.

So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A.

So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A.

For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling).

I hope you can follow this way of thinking.

So the question basically is, is this correct? And how far can I extend my intervals before the overlap becomes artificial? The largest sizes I'm overlapping are ~15% of the genome in dataset B, and this gives me almost all entries from A. This is far higher than in 1,000 simulations.

Any thoughts are appreciated, e.g. is this better to turn it around and make entries in A larger?

↧

Calculating Exome Coverage

April 6, 2014, 9:22 am

≫ Next: How to explain uneven coverage of a DNA seqment obtained via PCR amplification.

≪ Previous: General Considerations For Genomic Overlaps?

*// Edit to make the post more clear (Mapping done via Bowtie2). My problem is that when counting Exome Coverage via coverageBed gives different results than via genomeCoverageBed. So I'm not sure if I'm doing something wrong, or which of the 2 methods is correct.

1) My first step is to build an .bed file of my Illumina Paired-End reads, returning the positions that only fall in targeted exon regions. I'm doing that via intersectBed -a [data.bed] -b [illuminaexonregions.bed].

2) My next step is to calculate the coverage of my new datafile via coverageBed -a [newdata.bed] -b [illuminaexonregions.bed]. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10993449.0

Nucleotides/Length*100 24.253740909 % Coverage.

3) The next step was to calculate the coverage of my new datafile via genomeCoverageBed -i [newdata.bed] -g [genome.txt] -d awk '$3>0 {print $1"\t"$2"\t"$3}'. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10576907.0

Nucleotides/Length*100 23.3347661863 % Coverage.

Somehow there's a difference in matched nucleotides, which I can't explain. What am I doing wrong?

↧

How to explain uneven coverage of a DNA seqment obtained via PCR amplification.

April 9, 2014, 10:44 am

≫ Next: Extracting Genomic Coverage Information Across Different Samples

≪ Previous: Calculating Exome Coverage

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?

↧

Extracting Genomic Coverage Information Across Different Samples

April 17, 2014, 6:02 am

≫ Next: Bowtie2 Mapping Different Number Of Reads To Same Sequence When Ref-Seq Is Part Of Different Indexes

≪ Previous: How to explain uneven coverage of a DNA seqment obtained via PCR amplification.

Hello, I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples. 1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads. Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3. Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well. I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed. i have the following commands which might do similar but it ouputs differences at every base.

genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph

genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph

unionBedGraphs -header -i sample1.bedgraph sample2. ...

↧

Bowtie2 Mapping Different Number Of Reads To Same Sequence When Ref-Seq Is Part Of Different Indexes

April 17, 2014, 6:02 am

≫ Next: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

≪ Previous: Extracting Genomic Coverage Information Across Different Samples

I am using bowtie2 to map my PE reads.

I have indexed multiple bacterial genomes by putting them together in a multi-fasta file fashion.

bowtie2 -q -a -p 1 -x Multi -1 R100_1.fq -2 R100_2.fq -U 100_Orph.fastq -S 100.sam
samtools view -b -S 100.sam -o 100.bam
coverageBed -abam 100.bam -b BED_RefSeq >>100.cvg
CoverageBed ouput for genome("307679329")is :  307679329       1       25751   72      3568    25750   0.1385631

but when I index genome ("307679329") separately then CoverageBed output is:

307679329       1       25751   449     8369    25750   0.3250097

Can someone explain this differnece

↧

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

April 17, 2014, 6:02 am

≫ Next: Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

≪ Previous: Bowtie2 Mapping Different Number Of Reads To Same Sequence When Ref-Seq Is Part Of Different Indexes

Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

↧

Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

April 17, 2014, 6:02 am

≫ Next: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

≪ Previous: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

bedtools intersect -v -abam my.bam -b myregions.gff > notinmyregions.bam

would we see reads with 4 in the FLAG field - i.e. unmapped reads in notinmyregions.bam

↧

Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

April 17, 2014, 6:02 am

≫ Next: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

≪ Previous: Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

↧

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

April 17, 2014, 6:02 am

≫ Next: Bedtools Intersectbed

≪ Previous: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

↧

Bedtools Intersectbed

April 17, 2014, 6:02 am

≫ Next: Is It Possible To Filter Only Bookend Reads From A Bed File?

≪ Previous: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

Apologies if this is blatantly obvious!

I would like to compare coordinates in setA with those of setB. The output should have the same number of coordinates as setA and tell me how many nucleotides of each setA coordinate are overlapped by any coordinate in setB.

For example a large coordinate in setA may be overlapped by two setB coordinates, but i want to know how many nucleotides of the setA coordinate are covered by both setB coordinate in total.

I know how to do this on GALAXY as there is the handy 'Coverage' tool in 'Operate on Genomic Intervals'. However, i want to do this on the command line. I have been trying to get BEDTools to do this using 'intersectBed', but i can only seem to get just the overlapping setA coords (using -u), or get the nucleotide over for multiple setB coordinates on separate line (using -wao), or a count of how many setB overlaps setA (using -c).

SetB coordinates are non-overlapping themselves, so i guess i could tally up those SetB coordinates that overlap the same setA coordinate.

Can BEDTools do what i want or there another command line way of doing what i want?

Thank you!

PS I have also sent the to BEDTools discussion, so apologies for any double postings!

↧

Is It Possible To Filter Only Bookend Reads From A Bed File?

April 17, 2014, 6:02 am

≫ Next: Getting Unmapped Reads: Comparing Fastq To Bam

≪ Previous: Bedtools Intersectbed

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

↧

Getting Unmapped Reads: Comparing Fastq To Bam

April 17, 2014, 6:02 am

≫ Next: How To Create A Read Density Profile Within A Interval?

≪ Previous: Is It Possible To Filter Only Bookend Reads From A Bed File?

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

↧

How To Create A Read Density Profile Within A Interval?

April 17, 2014, 6:02 am

≫ Next: Intersectbed Overlap

≪ Previous: Getting Unmapped Reads: Comparing Fastq To Bam

HI!

I need some help: I have to create density profile with a window specific of 1kb (how many time a sequence is detected after NGS method). I have to use SAM and BEDtools, I think I can use genomeCov in BEDtools but I don't have genome reference.

So, if anybody is abble to help me...

Thanks

↧

Intersectbed Overlap

April 17, 2014, 6:02 am

≫ Next: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

≪ Previous: How To Create A Read Density Profile Within A Interval?

Hi,

I've a question about intersectBed. Is it possible to extract only alignment like this :

chromosome ===============================================================
BED/BAM A               ==============              =================
BED FILE B               ============
RESULT                  ==============

But no alignment like this (even if the read overlapp 100% of the feature, I don't want to extract these reads)

chromosome ===============================================================
BED/BAM A    =========================              =================
BED FILE B               =============
RESULT

So, only extracting reads that have 90-95% of its sequence overlapping 90-95% of the feature.

Is it clear ?

Thanks,

↧

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

April 17, 2014, 6:02 am

≫ Next: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

≪ Previous: Intersectbed Overlap

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

↧

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

April 17, 2014, 6:02 am

≫ Next: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

≪ Previous: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

↧

Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

April 17, 2014, 6:02 am

≫ Next: Bed File Bedpe Format

≪ Previous: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

↧

Bed File Bedpe Format

April 17, 2014, 6:02 am

≫ Next: error with bedtools slop

≪ Previous: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

Hi,

I'm having trouble with converting the bam file into bed -bedpe using the bedtools.

workflow:
samtools sort -n mut.bam mut.Namesorted
bamTobed -i mut.Namesorted.bam -bedpe > dilpMerged_bedpe.bed

After sorting the file by read name (option -n) I run the bamTobed command. but it gives me an error message after running a few lines:

*ERROR: -bedpe requires BAM to be sorted/grouped by query name.

What am I doing wrong here?

Thanks

↧

error with bedtools slop

April 17, 2014, 6:02 am

≫ Next: Error In Bedtools Getfasta: Chromosome Not Found

≪ Previous: Bed File Bedpe Format

Hi,

I am trying to run a bedtools slop on my.bed file and hg19.genome

bedtools slop -i H3K27me3.bed -g hg19.genome -b 30

I get the following error:

Less than the req'd two fields were encountered in the genome file (genomes/hg19.genome) at line 2. Exiting.

Any suggestions?

Thanks in advance

Samad

↧

Error In Bedtools Getfasta: Chromosome Not Found

April 17, 2014, 6:02 am

≫ Next: Converting Gff To Bed With Bedtools?

≪ Previous: error with bedtools slop

Hi, I am triing to use BEDtools to get some sequences from genomic coordinates. But I am having an errors saying " WARNING. chromosome (chr12) was not found in the FASTA file. Skipping." for each read that I have in my bed file. I gave you some details about what I am doing. I just download the last version of BEDtools (I think) bedtools-2.17.0. Then I have 2 different files (much more longer that the little part that I show) : A fasta file with all the sequences of chromosomes:

>chr01
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

a BED file with my genomic coordinates (already sorted) chr01 187814 190840 chr01 307073 310104 chr01 701047 704068 chr01 702941 705962 chr01 702952 705972 chr01 867716 870740 chr01 914064 917087 chr01 991080 994104 chr01 1039795 1042815 chr01 1058713 1061736 And then I write the command line: bedtools getfasta -fi all.con -bed 1-13sorted2.bed -fo NewCandidates/Genomiccoordinates/1-13_1500.fa The only thing that I get is "WARNING. chromosome (chr01) was not found in the FASTA file. Skipping." , thousands of tim ...

↧