how to get -nms for bedtools

August 10, 2014, 12:45 pm

≫ Next: Per Base Coverage

≪ Previous: Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

I'd like to merge bed files and preserve the names of the merged features using bedtools -nms option.

However, this option (-nms) is deprecated in the newer bedtools.

The documentation says I can use -o option to get -nms behavior.

How do I get translate the new bedtools merge command to get:

bedtools merge -i file.bed -nms

↧

Per Base Coverage

March 9, 2012, 6:19 pm

≫ Next: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

≪ Previous: how to get -nms for bedtools

Is there a way to obtain per-base coverage for a define chromosome interval using a bam file generated from Illumina single-end reads? genomeCoverageBed in Bedtools does not seem to have an option for it.

↧

Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

February 9, 2012, 9:02 am

≫ Next: Calculate reciprocal overlap for thousands of samples

≪ Previous: Per Base Coverage

I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian populationhttp://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html I downloaded their SAM files from the groups website:http://funpopgen.unige.ch/data/ceu60 I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):

samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
samtools sort first.bam first.bam.sorted
samtools index first.bam.sorted
samtools index aln-sorted.bam

When I the ...

↧

Calculate reciprocal overlap for thousands of samples

July 15, 2014, 10:20 am

≫ Next: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

≪ Previous: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

I have around 20k samples with BED files. How can I calculate reciprocal overlap for each segment? I want to find all segments with 50% reciprocal overlap or better.

↧

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

May 2, 2013, 10:37 am

≫ Next: Problems Extracting Non-Snps From A Vcf File

≪ Previous: Calculate reciprocal overlap for thousands of samples

Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

↧

Problems Extracting Non-Snps From A Vcf File

January 16, 2013, 7:14 am

≫ Next: Extract Only Paired-End Reads That Map A Specific Interval

≪ Previous: Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

↧

Extract Only Paired-End Reads That Map A Specific Interval

August 31, 2012, 1:23 am

≫ Next: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

≪ Previous: Problems Extracting Non-Snps From A Vcf File

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

↧

How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

April 8, 2014, 9:43 am

≫ Next: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

≪ Previous: Extract Only Paired-End Reads That Map A Specific Interval

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?

↧

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

January 27, 2013, 2:27 am

≫ Next: Tool For Binning Windowbed Output For K-Means Clustering

≪ Previous: How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

↧

Tool For Binning Windowbed Output For K-Means Clustering

September 24, 2013, 6:38 am

≫ Next: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

≪ Previous: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

I have mapped high resolution ChIP-seq data to transcription start sites using windowBed. I now want to bin the data, in bin sizes of my choosing, relative to TSSs so that I can generate heat maps and do k-means clustering on the data.

What tool/s exist for doing this?

Thanks!

↧

Samtools or Bedtools: How to filter a bam file with a bed file using strand information

June 5, 2014, 5:29 am

≫ Next: Variant annotation using several .BED files

≪ Previous: Tool For Binning Windowbed Output For K-Means Clustering

I would like to filter a bam file, keeping only reads overlapping with genomic intervals from a bed file. I used samtools for this:

samtools view -b -h -L bedfile.bed bamfile.bam

However the -L option does not seem to take into account the strand information.

Do you know if there is another option or way to do it that would keep strand information?

↧

Variant annotation using several .BED files

October 29, 2014, 2:39 pm

≫ Next: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

≪ Previous: Samtools or Bedtools: How to filter a bam file with a bed file using strand information

So, I have a data file containing several hundred variants in the following format: CHR # START POS END POS VARIANT ID 1 100 1000 rs1 1 1200 1400 rs2 I ran the latter through Annovar to get the gene each variant was in (or its nearest gene and its distance) as well as the region (intronic, exonic, etc) each variant was in. The output had the following columns GENE/NEARESTGENE REGION CHR# STARTPOS ENDPOS REFALLELE ALTALLELE VARIANTID SOMEGENE exonic 1 100 1000 A G rs1 SOMEGENE2 intergenic 1 1200 1400 G T rs2 I moved some columns around to make a file with the following columns, lets call this file.txt CHR# STARTPOS ENDPOS GENE/NEARESTGENE REGION VARIANTID 1 100 1000 SOMEGENE exonic rs1 1 1200 1400 SOMEGENE2 intergenic rs2 Now, I have several database files - something around 10 - of promoters, TSS, enhancers, etc, all of which in .bed format looking like the following -> lets call these database1.txt ... database10.txt database1.bed CHR # STARTPOS ENDPOS LABEL--FOR-THE-REGION/NAME 1 500 600 ...

↧

How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

February 27, 2013, 11:13 am

≫ Next: Intersectbed: Return Reads In Fraction In Input Files

≪ Previous: Variant annotation using several .BED files

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

↧

Intersectbed: Return Reads In Fraction In Input Files

September 27, 2012, 9:55 am

≫ Next: How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

≪ Previous: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

I have a question with respect to intersectBED and multiple input files:

Is it possible to return reads which are present in, say 8/10 input files, without fractioning the reads in smaller intervals?

Thank you

↧

How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

November 23, 2011, 2:32 pm

≫ Next: Extract coverage per feature from a bam and bed to a file

≪ Previous: Intersectbed: Return Reads In Fraction In Input Files

Hi all,

I am working on RNA-seq data analysis. I've finished running Tophat and Cufflinks to get FPKM values for each read from Illumina pair-end sequence. Also, parallely I've run Velvet to get contig sequences through de novo assembly and Gmap to see if the assembled sequences map to reference genome (this reference genome is not complete for now, but somewhat useful). Now, I am trying to combine all information so I can have sequence information for a contig and FPKM value for the corresponding to the contig. Some suggested I can convert Cufflink and Gmap outputs to bedfiles and then use IntersectBed to see if there's any overlap. However, I am not sure how I can have every information saved in the output from Bedtools. IntersectBed default seems to provide me overlapped region with 'A' file as a template, so I couldn't see any information from 'B' file. Is there any solution for me?? Please let me know. I would appreciate for your suggestion!

↧

Extract coverage per feature from a bam and bed to a file

August 24, 2014, 11:07 pm

≫ Next: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

≪ Previous: How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

Hi, a simple task.. or should be. I need to extract the average coverage per feature in a bam file. I have a genbank and bed file for the reference the bam was mapped to. if I map with e.g. Geneous I can see good, variable coverage over the reference genome. I have tried GATK (could not get to run) and Bedtools (genomecov and coverage) -coverage will give me an output file but all the features have zero coverage.. here's the top of the .bed file: track name="Example E.coli" o26chr.gb 189 255 thrL gene 0 + o26chr.gb 189 255 thrL CDS 0 + o26chr.gb 336 2799 thrA gene 0 + o26chr.gb 336 2799 thrA CDS 0 + o26chr.gb 2800 3733 thrB gene 0 + o26chr.gb 2800 3733 thrB CDS 0 + o26chr.gb 3733 5020 thrC gene 0 + o26chr.gb 3733 5020 thrC CDS 0 + o26chr.gb 5233 5530 yaaX gene 0 + Here's the top of the output from bedtools coveage -ibam file.bam -b file.bed o26chr.gb 1047122 1048841 poxB gene 0 - 0 0 1719 0.0000000 o26chr.gb 1047122 1048841 poxB CDS 0 - 0 0 1719 0.0000000 o26chr.gb 2096828 2097287 gene 0 + 0 0 459 0.0000000 o26chr.gb 3144900 3148635 yfaL gene 0 - 0 0 3735 0.0000000 o26chr.gb 3144900 3148635 yfaL CDS 0 - 0 0 3735 0.0000000 o26chr.gb 4194149 4194368 tdcR gene 0 + 0 0 219 0.00 ...

↧

To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

December 3, 2013, 7:08 am

≫ Next: How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

≪ Previous: Extract coverage per feature from a bam and bed to a file

Dear All, I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, ...

↧

How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

January 31, 2012, 2:12 pm

≫ Next: Random shuffling of features leaving gene models intact

≪ Previous: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

Hi,

I am trying to extract the coverage and the average quality score for each position of a reference assembly in bam/sam format. I have managed to get the coverage using BEDtools

 genomeCoverageBed -ibam mybamfile.bam -g my_genome -d > my_coverage.txt

but am at a loss on how to get some measure of the quality of the base calls at each position. I was thinking that I could use the bcftools to get a variant call formatted file

samtools mpileup -uf ref.fa mybamfile.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

but this only provides the sites for which there are SNPs. Any advice greatly appreciated.

Joseph

↧

Random shuffling of features leaving gene models intact

May 26, 2014, 7:02 am

≫ Next: Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

≪ Previous: How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene are placed on the same contig and related features are placed inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case?

↧

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

June 21, 2013, 3:38 am

≫ Next: Profile Coverage Of Rnaseq Samples?

≪ Previous: Random shuffling of features leaving gene models intact

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...

↧