Quantcast
Channel: Post Feed
Viewing all 3764 articles
Browse latest View live

General Considerations For Genomic Overlaps?

$
0
0

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics.

Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this.

E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome.

Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome.

If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A.

But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now.

So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A.

So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A.

For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling).

I hope you can follow this way of thinking.

So the question basically is, is this correct? And how far can I extend my intervals before the overlap becomes artificial? The largest sizes I'm overlapping are ~15% of the genome in dataset B, and this gives me almost all entries from A. This is far higher than in 1,000 simulations.

Any thoughts are appreciated, e.g. is this better to turn it around and make entries in A larger?


Calculating Exome Coverage

$
0
0

*// Edit to make the post more clear (Mapping done via Bowtie2). My problem is that when counting Exome Coverage via coverageBed gives different results than via genomeCoverageBed. So I'm not sure if I'm doing something wrong, or which of the 2 methods is correct.

1) My first step is to build an .bed file of my Illumina Paired-End reads, returning the positions that only fall in targeted exon regions. I'm doing that via intersectBed -a [data.bed] -b [illuminaexonregions.bed].

2) My next step is to calculate the coverage of my new datafile via coverageBed -a [newdata.bed] -b [illuminaexonregions.bed]. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10993449.0

Nucleotides/Length*100 24.253740909 % Coverage.

3) The next step was to calculate the coverage of my new datafile via genomeCoverageBed -i [newdata.bed] -g [genome.txt] -d awk '$3>0 {print $1"\t"$2"\t"$3}'. I calculated some statistics:

Number of exons 214126 with a total length of 45326818

Number of matched nucleotides 10576907.0

Nucleotides/Length*100 23.3347661863 % Coverage.

Somehow there's a difference in matched nucleotides, which I can't explain. What am I doing wrong?

How to explain uneven coverage of a DNA seqment obtained via PCR amplification.

$
0
0

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?

Extracting Genomic Coverage Information Across Different Samples

$
0
0
Hello, I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples. 1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads. Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3. Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well. I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed. i have the following commands which might do similar but it ouputs differences at every base. genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph unionBedGraphs -header -i sample1.bedgraph sample2. ...

Bowtie2 Mapping Different Number Of Reads To Same Sequence When Ref-Seq Is Part Of Different Indexes

$
0
0

I am using bowtie2 to map my PE reads.

I have indexed multiple bacterial genomes by putting them together in a multi-fasta file fashion.

bowtie2 -q -a -p 1 -x Multi -1 R100_1.fq -2 R100_2.fq -U 100_Orph.fastq -S 100.sam
samtools view -b -S 100.sam -o 100.bam
coverageBed -abam 100.bam -b BED_RefSeq >>100.cvg
CoverageBed ouput for genome("307679329")is :  307679329       1       25751   72      3568    25750   0.1385631

but when I index genome ("307679329") separately then CoverageBed output is:

307679329       1       25751   449     8369    25750   0.3250097

Can someone explain this differnece

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

$
0
0
Hello, in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:
  1. Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
  2. Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.
However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.Can someone she ...

Does Bedtools Intersect -V Consider Unmapped Reads "As Not In B"

$
0
0
bedtools intersect -v -abam my.bam -b myregions.gff > notinmyregions.bam

would we see reads with 4 in the FLAG field - i.e. unmapped reads in notinmyregions.bam

Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

$
0
0

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

$
0
0
Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

Bedtools Intersectbed

$
0
0

Apologies if this is blatantly obvious!

I would like to compare coordinates in setA with those of setB. The output should have the same number of coordinates as setA and tell me how many nucleotides of each setA coordinate are overlapped by any coordinate in setB.

For example a large coordinate in setA may be overlapped by two setB coordinates, but i want to know how many nucleotides of the setA coordinate are covered by both setB coordinate in total.

I know how to do this on GALAXY as there is the handy 'Coverage' tool in 'Operate on Genomic Intervals'. However, i want to do this on the command line. I have been trying to get BEDTools to do this using 'intersectBed', but i can only seem to get just the overlapping setA coords (using -u), or get the nucleotide over for multiple setB coordinates on separate line (using -wao), or a count of how many setB overlaps setA (using -c).

SetB coordinates are non-overlapping themselves, so i guess i could tally up those SetB coordinates that overlap the same setA coordinate.

Can BEDTools do what i want or there another command line way of doing what i want?

Thank you!

PS I have also sent the to BEDTools discussion, so apologies for any double postings!

Is It Possible To Filter Only Bookend Reads From A Bed File?

$
0
0

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

Getting Unmapped Reads: Comparing Fastq To Bam

$
0
0

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

How To Create A Read Density Profile Within A Interval?

$
0
0

HI!

I need some help: I have to create density profile with a window specific of 1kb (how many time a sequence is detected after NGS method). I have to use SAM and BEDtools, I think I can use genomeCov in BEDtools but I don't have genome reference.

So, if anybody is abble to help me...

Thanks

Intersectbed Overlap

$
0
0

Hi,

I've a question about intersectBed. Is it possible to extract only alignment like this :

chromosome ===============================================================
BED/BAM A               ==============              =================
BED FILE B               ============
RESULT                  ==============

But no alignment like this (even if the read overlapp 100% of the feature, I don't want to extract these reads)

chromosome ===============================================================
BED/BAM A    =========================              =================
BED FILE B               =============
RESULT

So, only extracting reads that have 90-95% of its sequence overlapping 90-95% of the feature.

Is it clear ?

Thanks,

N.

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

$
0
0
I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following: chr7 127471196 127472363 Pos1 12 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 200 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 120 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 54 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 2 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 15 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 25 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 2 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 9 - 127480532 127481699 0,0,255 According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

$
0
0

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

$
0
0

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

Bed File Bedpe Format

$
0
0

Hi,

I'm having trouble with converting the bam file into bed -bedpe using the bedtools.

workflow:
samtools sort -n mut.bam mut.Namesorted
bamTobed -i mut.Namesorted.bam -bedpe > dilpMerged_bedpe.bed

After sorting the file by read name (option -n) I run the bamTobed command. but it gives me an error message after running a few lines:

*ERROR: -bedpe requires BAM to be sorted/grouped by query name.

What am I doing wrong here?

Thanks

A.

error with bedtools slop

$
0
0

Hi, 

I am trying to run a bedtools slop on my.bed file and hg19.genome

bedtools slop -i H3K27me3.bed -g hg19.genome -b 30

I get the following error:

Less than the req'd two fields were encountered in the genome file (genomes/hg19.genome) at line 2.  Exiting.

Any suggestions?

Thanks in advance

Samad

 

Error In Bedtools Getfasta: Chromosome Not Found

$
0
0
Hi, I am triing to use BEDtools to get some sequences from genomic coordinates. But I am having an errors saying " WARNING. chromosome (chr12) was not found in the FASTA file. Skipping." for each read that I have in my bed file. I gave you some details about what I am doing. I just download the last version of BEDtools (I think) bedtools-2.17.0. Then I have 2 different files (much more longer that the little part that I show) : A fasta file with all the sequences of chromosomes: >chr01 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN a BED file with my genomic coordinates (already sorted) chr01 187814 190840 chr01 307073 310104 chr01 701047 704068 chr01 702941 705962 chr01 702952 705972 chr01 867716 870740 chr01 914064 917087 chr01 991080 994104 chr01 1039795 1042815 chr01 1058713 1061736 And then I write the command line: bedtools getfasta -fi all.con -bed 1-13sorted2.bed -fo NewCandidates/Genomiccoordinates/1-13_1500.fa The only thing that I get is "WARNING. chromosome (chr01) was not found in the FASTA file. Skipping." , thousands of tim ...
Viewing all 3764 articles
Browse latest View live