Identify Overlapping And Non Overlapping Regions For Paired-End Data

April 19, 2014, 6:20 am

≪ Previous: Tool For Binning Windowbed Output For K-Means Clustering

gene1            gene2
chr1    25    30    chr1    34    37
chr1    15    20    chr1    25    28
chr1    80    90    chr1    10    13

gene1            gene2
chr1    25    30    chr1    36    39
chr1    15    20    chr1    18    20
chr1    80    90    chr1    19    22

common gene1 uniq gene2 (when we compare file 1 with file2)
chr1    15    20    chr1    25    28
chr1    80    90    chr1    10    13

common gene1 uniq gene2 (when we compare file2 with file1)

chr1    15    20    chr1    18    20
chr1    80    90    chr1    19    22

common gene1 common gene2
 chr1    25    30    chr1    34     37  chr1    25    30    chr1    36    39

common in gene1 gene2 i was able to do with bedtools pairToPair. buth i have problem with common gene1 and uniq gene2

↧

error with bedtools slop

April 19, 2014, 6:20 am

≫ Next: Per Base Coverage

≪ Previous: Identify Overlapping And Non Overlapping Regions For Paired-End Data

Hi,

I am trying to run a bedtools slop on my.bed file and hg19.genome

bedtools slop -i H3K27me3.bed -g hg19.genome -b 30

I get the following error:

Less than the req'd two fields were encountered in the genome file (genomes/hg19.genome) at line 2. Exiting.

Any suggestions?

Thanks in advance

Samad

↧

Per Base Coverage

April 19, 2014, 6:20 am

≫ Next: How To Install Bedtools In A User Directory

≪ Previous: error with bedtools slop

Is there a way to obtain per-base coverage for a define chromosome interval using a bam file generated from Illumina single-end reads? genomeCoverageBed in Bedtools does not seem to have an option for it.

↧

How To Install Bedtools In A User Directory

April 20, 2014, 7:15 am

≫ Next: What Is The Best Way To Run Bedtools In Parallel With Blocking

≪ Previous: Per Base Coverage

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

↧

What Is The Best Way To Run Bedtools In Parallel With Blocking

April 20, 2014, 7:15 am

≫ Next: How To Create A Read Density Profile Within A Interval?

≪ Previous: How To Install Bedtools In A User Directory

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

↧

How To Create A Read Density Profile Within A Interval?

April 20, 2014, 7:15 am

≫ Next: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

≪ Previous: What Is The Best Way To Run Bedtools In Parallel With Blocking

HI!

I need some help: I have to create density profile with a window specific of 1kb (how many time a sequence is detected after NGS method). I have to use SAM and BEDtools, I think I can use genomeCov in BEDtools but I don't have genome reference.

So, if anybody is abble to help me...

Thanks

↧

Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

April 20, 2014, 7:15 am

≫ Next: Bedtools Genomecoveragebed Usage : How To Create A Genome File?

≪ Previous: How To Create A Read Density Profile Within A Interval?

I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian populationhttp://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html I downloaded their SAM files from the groups website:http://funpopgen.unige.ch/data/ceu60 I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):

samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
samtools sort first.bam first.bam.sorted
samtools index first.bam.sorted
samtools index aln-sorted.bam

When I the ...

↧

Bedtools Genomecoveragebed Usage : How To Create A Genome File?

April 20, 2014, 7:15 am

≫ Next: Picking Random Genomic Positions

≪ Previous: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

I am using BEDTOOLS and the following command to get the coverage file:

$ ./genomeCoverageBed -ibam ~/GG_project/trim/ecoli.bam -g > ~/GG_project/trim/coverage

where ecoli.bam is my sorted bam file, and coverage is my output file

From where do I get the genome file? How do I create a genome file?? Specifically I would need a ecoli.genome file.

↧

Picking Random Genomic Positions

April 20, 2014, 7:15 am

≫ Next: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

≪ Previous: Bedtools Genomecoveragebed Usage : How To Create A Genome File?

I do have a set of TF binding coordinates and want to see if there is any significant overlap with an open chromatin annotation.

Example of TF coord:
chr1 19280 19298
chr1 245920 245938
chr2 97290 97308
chr9 752910 752938
...

Example of open chrom. coord. (UCSC track):
chr2 33031543 33032779
chr3 2304169 2304825
chr5 330899 330940
...

I have checked the intersection with the Bedtools (open chrom. coord vs TF coord. -/+ 100bp) and now I want to check the intersection between random genomic coordinates and open chrom.

The idea is to:

Pick random genomic position (from the same chromosome as TF coordinate);
-/+9bp (binding site size);
-/+ 100bp;
Run this simulation for 1000 times (TF x 1000);
Bedtools;

Any ideas how can I do this simulation to pick random genomic positions from the same chromosome? I know a little bit of bash and Perl, but won't be able to write the script by myself.
Is it possible to measure the length of every chromosome;
Pick TF chromosome and from it's length get a random number which would represent a genomic position?

Can someone help me with the simulation and the pipeline.

↧

Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

April 20, 2014, 7:15 am

≫ Next: Finding Overlapping Variants (I.E. Indels, Snps) Using Annovar Format.

≪ Previous: Picking Random Genomic Positions

Hi Everyone I would nedd some help on genomeCoverageBed tool. This tools when used for finding per base genome coverage uses an option -d. I am actually interested in finding read counts for each base within a particular intron of a gene. I will like to explain you more just to make myself clear. I used IGV to see how my alignments looks and moreover what is the coverage of each base within a particular intron. When I take my cursor in IGV to the area exactly above the base (i am interested in)within the coverage track it gives me such details:

Total Count:6
A:0
C:0
G:6
T:0
N:0

Now this total count is basically the read count for the base G within that intron. This counts says that 6 reads have actually covered this base position(and hence base). Now when i use this code snippet which is basically finding per base genome coverage genomeCoverageBed -i 2-B3-1b-D303A_sorted.bed -g pombe.genome -d this code gives me around 31 as the depth for that base(i.e G in my example). Looking closely in IGV i figured out that this 21 is basically 21 = 6 + 15 where 6 is the actual reads that has covered this base position(hence base) and 15 means that these reads have not covered that base at that position, but since the genomeCoverageBed tool calculates depth of feature coverage it also includes all those reads which skips that particular base. I would provide you with an image to make it more clear I would like to know how can i ...

↧

Finding Overlapping Variants (I.E. Indels, Snps) Using Annovar Format.

April 20, 2014, 7:15 am

≫ Next: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

≪ Previous: Genomecoveragebed - Bedtool For Reporting Per Base Genome Coverage

Hello,

I know that using bedtools functions (specifically intersect and windows), it is possible to find overlapping features in the two sets of data. The catch here is that bedtools only accept the files in VCF, GFF, BED or BAM format. I have this tool that generates the output data in ANNOVAR format. My initial thought here is to convert the existing VCF files to ANNOVAR, but I am not sure whether there are tools out there that does the similar job as described earlier, except using the ANNOVAR files.

Thank you, Young

↧

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

April 20, 2014, 7:15 am

≫ Next: How To Find The Nearest Gene To A Retrotransposon Insert?

≪ Previous: Finding Overlapping Variants (I.E. Indels, Snps) Using Annovar Format.

Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

↧

How To Find The Nearest Gene To A Retrotransposon Insert?

April 20, 2014, 7:15 am

≫ Next: How Can I Include One Bed File In Another Bed File ?

≪ Previous: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

Hi,

I have a BED file with the position of retrotransposons in the mouse genome and I would like to find the nearest gene, the distance to that gene and whether it is on the + or - strand. There are so many different file formats for the mouse genome and many different databases to choose from, I was wondering what the best tool and what the best database to use would be.

Cheers, Joseph

↧

How Can I Include One Bed File In Another Bed File ?

April 20, 2014, 7:15 am

≫ Next: Comparative Snp Analysis

≪ Previous: How To Find The Nearest Gene To A Retrotransposon Insert?

Hello, I have 2 bedfiles that share some common features let's call the first file A.bed (bigger file) and the second B.bed (smaller file). I would like to have a new bed file that includes everything in B.bed in the A.bed file. I don't need the intersect, I more like need the merge option I checked bedtools's manual... couldn't find an answer for merging 2 bedfiles. Can someone help?

Thanks in advance

↧

Comparative Snp Analysis

April 20, 2014, 7:15 am

≫ Next: Bedgraph Not Displayed In Igv

≪ Previous: How Can I Include One Bed File In Another Bed File ?

Hello, I am trying to compare the degree of A-to-G editing in a near-to-isogenic pair of cell lines. I have two biological replicates and have mapped with Bowtie and BWA, followed by a samtools mpileup | VarScan analysis. After this, I have used bedtools intersect to extract variants not annotated in dbSNP, but are in Alu repeats. Here is where I have some doubts, mainly two questions: QUESTION 1: In the vcf file (VarScan output),

#CHROM  POS     ID      REF     ALT     QUAL    FILTER    INFO    FORMAT  Sample1    Sample2
   chrM    73      .           G       A       PASS     DP=238  GT:GQ:DP           1/1:71:121  1/1:69:117

What exactly is the meaning of

FORMAT   Sample1    Sample2
GT:GQ:DP 1/1:71:121  1/1:69:117

QUESTION 2:

I have higher number of editing sites "called" in sample 1 than in sample 2 in the 1st biological replicate (about 16% difference). However this difference is reversed in the 2nd biological replicate. What is the proper way of comparing the degree of RNA editing in two different samples? Is there a quantitative procedure? I have naively compared them with bedtools intersect, using or omitting option -v. Is this the correct way to go about it?

Many thanks. G.

↧

Bedgraph Not Displayed In Igv

April 20, 2014, 7:15 am

≫ Next: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

≪ Previous: Comparative Snp Analysis

Hi, I am new and so facing problem. I was trying to make a bed graph file using bed tools genomecov command. The command was: bedtools genomecov -ibam filename.sorted.bam -g chromosome sizes.txt > O.bedgraph I got a bedgraph file which is much smaller in size. It is 500kb instead of ~6Mb. And when I load that 500kb file into IGV, I see nothing. Please help me out.

↧

How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

April 20, 2014, 7:15 am

≫ Next: Is It Possible To Filter Only Bookend Reads From A Bed File?

≪ Previous: Bedgraph Not Displayed In Igv

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

↧

Is It Possible To Filter Only Bookend Reads From A Bed File?

April 20, 2014, 7:15 am

≫ Next: How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

≪ Previous: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

↧

How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

April 20, 2014, 7:15 am

≫ Next: Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

≪ Previous: Is It Possible To Filter Only Bookend Reads From A Bed File?

Hi all,

I am working on RNA-seq data analysis. I've finished running Tophat and Cufflinks to get FPKM values for each read from Illumina pair-end sequence. Also, parallely I've run Velvet to get contig sequences through de novo assembly and Gmap to see if the assembled sequences map to reference genome (this reference genome is not complete for now, but somewhat useful). Now, I am trying to combine all information so I can have sequence information for a contig and FPKM value for the corresponding to the contig. Some suggested I can convert Cufflink and Gmap outputs to bedfiles and then use IntersectBed to see if there's any overlap. However, I am not sure how I can have every information saved in the output from Bedtools. IntersectBed default seems to provide me overlapped region with 'A' file as a template, so I couldn't see any information from 'B' file. Is there any solution for me?? Please let me know. I would appreciate for your suggestion!

↧

Discrepancy In Samtools Mpileup/Depth And Bedtools Genomecoveragebed Counts

April 21, 2014, 7:42 am

≫ Next: General Considerations For Genomic Overlaps?

≪ Previous: How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

I am getting different counts for the number of bases on reference covered by aligned reads using samtools depth/mpileup and BEDTools genomeCoverageBed commands. I am using samtools-0.1.19 and bedtools-2.17.0

samtools mpileup -ABQ0 -d10000000 -f ref.fas qry.bam > qry.mpileup
samtools depth -q0 -Q0 qry.bam > qry.depth

genomeCoverageBed -ibam qry.bam -g ref.genome -dz > qry.dz
wc -l qry.[dm]*
  1026779 qry.depth
  1027173 qry.dz
  1026779 qry.mpileup

Any ideas? Thanks

↧