Quantcast
Channel: Post Feed
Viewing all 3764 articles
Browse latest View live

how to get -nms for bedtools

$
0
0

I'd like to merge bed files and preserve the names of the merged features using bedtools -nms option.

However, this option (-nms) is deprecated in the newer bedtools.

The documentation says I can use -o option to get -nms behavior.

How do I get translate the new bedtools merge command to get:

 

bedtools merge -i file.bed -nms

 

 


Per Base Coverage

$
0
0

Is there a way to obtain per-base coverage for a define chromosome interval using a bam file generated from Illumina single-end reads? genomeCoverageBed in Bedtools does not seem to have an option for it.

Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

$
0
0
I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian populationhttp://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html I downloaded their SAM files from the groups website:http://funpopgen.unige.ch/data/ceu60 I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):
  1. samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
  2. samtools sort first.bam first.bam.sorted
  3. samtools index first.bam.sorted
  4. samtools index aln-sorted.bam
When I the ...

Calculate reciprocal overlap for thousands of samples

$
0
0

I have around 20k samples with BED files. How can I calculate reciprocal overlap for each segment? I want to find all segments with 50% reciprocal overlap or better. 

Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.

$
0
0
Hi everyone! I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough. What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files. However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to. Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I ...

Problems Extracting Non-Snps From A Vcf File

$
0
0

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

Extract Only Paired-End Reads That Map A Specific Interval

$
0
0

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

N.

How To Explain Uneven Coverage Of A Dna Seqment Obtained Via Pcr Amplification.

$
0
0

Experiment: deep sequencing for mutants in 700nt fragment.

the fragment of dna was preamplified by primers flanking the fragment followed by hiseq.

per base coverage was calculated by coverageBed -d -abam in.bam -b ref.bed > out.cov

Observation: two distinct peaks in coverage at the ends as below plot.. coverage vs positions

enter image description here

the peaks are made from reads having part of primers..thus also show soft clipping at ends..

there is a huge difference in the calculations if i include such reads And if I exclude them.

Question: is there anyone who knows how to handle such a situation?


How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

$
0
0

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

Tool For Binning Windowbed Output For K-Means Clustering

$
0
0

I have mapped high resolution ChIP-seq data to transcription start sites using windowBed. I now want to bin the data, in bin sizes of my choosing, relative to TSSs so that I can generate heat maps and do k-means clustering on the data.

What tool/s exist for doing this?

Thanks!

Samtools or Bedtools: How to filter a bam file with a bed file using strand information

$
0
0

Hi

I would like to filter a bam file, keeping only reads overlapping with genomic intervals from a bed file. I used samtools for this:

samtools view -b -h -L bedfile.bed bamfile.bam 

However the -L option does not seem to take into account the strand information.

Do you know if there is another option or way to do it that would keep strand information?

 

Variant annotation using several .BED files

$
0
0
So, I have a data file containing several hundred variants in the following format: CHR #       START POS      END POS     VARIANT ID     1              100                   1000             rs1 1              1200                  1400            rs2           I ran the latter through Annovar to get the gene each variant was in (or its nearest gene and its distance) as well as the region (intronic, exonic, etc) each variant was in. The output had the following columns   GENE/NEARESTGENE   REGION  CHR#   STARTPOS    ENDPOS    REFALLELE   ALTALLELE  VARIANTID   SOMEGENE                   exonic     1         100                 1000           A                  G                 rs1 SOMEGENE2                 intergenic 1        1200                1400           G                 T                    rs2 I moved some columns around to make a file with the following columns, lets call this file.txt CHR#   STARTPOS    ENDPOS   GENE/NEARESTGENE   REGION  VARIANTID  1           100              1000           SOMEGENE                 exonic           rs1 1           1200             1400           SOMEGENE2               intergenic      rs2   Now, I have several database files - something around 10 - of promoters, TSS, enhancers, etc, all of which in .bed format looking like the following -> lets call these database1.txt ... database10.txt database1.bed CHR #       STARTPOS     ENDPOS      LABEL--FOR-THE-REGION/NAME 1               500                600       ...

How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

$
0
0

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

Intersectbed: Return Reads In Fraction In Input Files

$
0
0

I have a question with respect to intersectBED and multiple input files:

Is it possible to return reads which are present in, say 8/10 input files, without fractioning the reads in smaller intervals?

Thank you

How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

$
0
0

Hi all,

I am working on RNA-seq data analysis. I've finished running Tophat and Cufflinks to get FPKM values for each read from Illumina pair-end sequence. Also, parallely I've run Velvet to get contig sequences through de novo assembly and Gmap to see if the assembled sequences map to reference genome (this reference genome is not complete for now, but somewhat useful). Now, I am trying to combine all information so I can have sequence information for a contig and FPKM value for the corresponding to the contig. Some suggested I can convert Cufflink and Gmap outputs to bedfiles and then use IntersectBed to see if there's any overlap. However, I am not sure how I can have every information saved in the output from Bedtools. IntersectBed default seems to provide me overlapped region with 'A' file as a template, so I couldn't see any information from 'B' file. Is there any solution for me?? Please let me know. I would appreciate for your suggestion!


Extract coverage per feature from a bam and bed to a file

$
0
0
Hi,   a simple task.. or should be. I need to extract the average coverage per feature in a bam  file. I have a genbank and bed file for the reference the bam was mapped to. if I map with e.g. Geneous I can see good, variable coverage over the reference genome. I have tried GATK (could not get to run) and Bedtools (genomecov and coverage) -coverage will give me an output file but all the features have zero coverage.. here's the top of the .bed file: track name="Example E.coli" o26chr.gb 189 255 thrL gene 0 + o26chr.gb 189 255 thrL CDS 0 + o26chr.gb 336 2799 thrA gene 0 + o26chr.gb 336 2799 thrA CDS 0 + o26chr.gb 2800 3733 thrB gene 0 + o26chr.gb 2800 3733 thrB CDS 0 + o26chr.gb 3733 5020 thrC gene 0 + o26chr.gb 3733 5020 thrC CDS 0 + o26chr.gb 5233 5530 yaaX gene 0 + Here's the top of the output from bedtools coveage -ibam file.bam -b file.bed o26chr.gb 1047122 1048841 poxB gene 0 - 0 0 1719 0.0000000 o26chr.gb 1047122 1048841 poxB CDS 0 - 0 0 1719 0.0000000 o26chr.gb 2096828 2097287 gene 0 + 0 0 459 0.0000000 o26chr.gb 3144900 3148635 yfaL gene 0 - 0 0 3735 0.0000000 o26chr.gb 3144900 3148635 yfaL CDS 0 - 0 0 3735 0.0000000 o26chr.gb 4194149 4194368 tdcR gene 0 + 0 0 219 0.00 ...

To Calculate The Exact Total Number Of Mapped Reads In Exome Regions

$
0
0
Dear All, I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, ...

How Do You Get The Quality Score And Coverage For Every Single Position Of A Reference Assembly

$
0
0

Hi,

I am trying to extract the coverage and the average quality score for each position of a reference assembly in bam/sam format. I have managed to get the coverage using BEDtools

 genomeCoverageBed -ibam mybamfile.bam -g my_genome -d > my_coverage.txt

but am at a loss on how to get some measure of the quality of the base calls at each position. I was thinking that I could use the bcftools to get a variant call formatted file

samtools mpileup -uf ref.fa mybamfile.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

but this only provides the sites for which there are SNPs. Any advice greatly appreciated.

Joseph

Random shuffling of features leaving gene models intact

$
0
0

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene are placed on the same contig and related features are placed inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case? 

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

$
0
0
I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value? For example, let's say that I have a BED file like the following: chr7 127471196 127472363 Pos1 12 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 200 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 120 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 54 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 2 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 15 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 25 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 2 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 9 - 127480532 127481699 0,0,255 According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search: $: awk '$5 > 20 {print}' mybedfile.bed However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with th ...
Viewing all 3764 articles
Browse latest View live