Fastafrombed Problem

April 18, 2014, 6:14 am

≫ Next: Problems Extracting Non-Snps From A Vcf File

≪ Previous: Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population

hi,

I try this tools from BedTools but it doesnt work!

$ cat testgenome404.fa

>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1    5       10

$ ./fastaFromBed -fi testgenome404.fa -bed test.bed  -fo test.fa.out

**index file testgenome404.fa.fai not found, generating...

unable to find FASTA index entry for 'chr1'**

$ cat testgenome404.fa.fai
chr1    46      7       46      47

what is this file "testgenome404.fa.fai" what does means this number? chr1 46 7 46 47

why this message?

unable to find FASTA index entry for 'chr1'

Thanks in advance for any help Sara

↧

Problems Extracting Non-Snps From A Vcf File

April 18, 2014, 6:14 am

≫ Next: Using Gnu Parallel For Bedtools

≪ Previous: Fastafrombed Problem

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

↧

Using Gnu Parallel For Bedtools

April 18, 2014, 6:14 am

≫ Next: Bedtools Multicov Need A Bam Index File Specification Option

≪ Previous: Problems Extracting Non-Snps From A Vcf File

I am trying to run gnu:parallel on bedtools multicov function where the original command is

bedtools multicov -bams bam1 bam2 bam3.. -bed anon.bed  > Q1_Counst.bed

I would like to implement the above command using gnu parallel. But when I run the command below

parallel -j 25 "bedtools multicov -bams {1} -bed {2} > Q1_Counst.bed" ::: minus_1_common_sorted_q1.bam minus_2_common_sorted_q1.bam minus_3_common_sorted_q1.bam plus_1_common_sorted_q1.bam plus_2_common_sorted_q1.bam plus_3_common_sorted_q1.bam ::: '/genome/genes_exon_2.bed'

each bam file is taken as separate argument , hence the processes starting are like

bedtools multicov -bams  bam1 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam2 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam3 -bed anon.bed  > Q1_Counst.bed

instead of taking all files as separate arguments. Hence Q1_Counst.bed is overwritten randomly. Could any one help me in getting exact command ? My server has around 30 cores.

↧

Bedtools Multicov Need A Bam Index File Specification Option

April 18, 2014, 6:14 am

≫ Next: Simple Redirection, I/O Problem With Bedtools

≪ Previous: Using Gnu Parallel For Bedtools

bedtools version 2.16.2 multicov used to compute the multiple sample coverage given a feature file(gtf bed).

format: bedtools multicov -bams alin1.bam aln2.bam .. -bed capturRegion.bed >out.coverage

official doc has mentioned that input bam files should be sorted and indexed, but it does not mention the details. suppose the bam file name is: sample1.bam, then the index file should be named: sample1.bam.bai(not sample1.bai) ,otherwise multicov will report an error: indexes not found.

I think it would be better to add an option which will allow the user to specify the bam index files or the suffix used for these index files.

↧

Simple Redirection, I/O Problem With Bedtools

April 18, 2014, 6:14 am

≫ Next: Filtering Bed Files By Using Bedops

≪ Previous: Bedtools Multicov Need A Bam Index File Specification Option

Hi Guys, Just a quick question. Its more of a Bash question rather than Bioinformatics, with Bedtools in question.

I mostly pipe the bedtools I/O. Here's a general scenario :

sed 1d fileA.bed | intersectBed -a stdin -b peaks.bed | intersectBed -u -a stdin -b fileB.bed

Now, the problem is fileB is also having a head, which is reported as an error by intersectBed (makes sense, non-integer start).

How can I remove the first line or the head of the fileB on the fly in the pipe.

Thanks

↧

Filtering Bed Files By Using Bedops

April 18, 2014, 6:14 am

≫ Next: Annotating Genomic Intervals

≪ Previous: Simple Redirection, I/O Problem With Bedtools

hello every one,

I have paired end illumina reads, R1.fastq and R2.fastq and I have mapped them as single-end reads by using bowtie2 default parameters, I performed further downstream analysis by using samtools and bedops, and now I have R1.bed and R2. bed I made two sets, one of them have R1_uniquely_mapped.bed, R2_uniquely_mapped.bed and second of them R1_mapped_more_than_1.bed , R2_mapped_more_than_1.bed.

because R1 and R2 belongs paired end reads, and my restriction library has maximum 2KB size, then R1 and R2 pairs must be present in less than 2 kb territory of chromosome

theoretically I am assuming, in R1.bed format,

chr1  100   180    @R1_read1______1 .................
chr1   1000  1090 @R1_read2______1................

In R2.bed format,

chr1 2100   2180 @R2_read1_____2............. ## I just add 2KB length with respect to R1.bed###
chr1 2500 2590    @R2_read______2......... ## I just add 1.5KB [1500nts] with respect to R1.bed, because my library is >= 2KB.

How can I customize downstream tools like BEDOPS or bedtool which can capture such type of reads or alignment????? How can I filter this type of infromation by using bedops tool????

all suggestions and comments are most welcome,

↧

Annotating Genomic Intervals

April 18, 2014, 6:14 am

≫ Next: Changing Column Order In Bed File

≪ Previous: Filtering Bed Files By Using Bedops

How can I annotate human genomic intervals (BED file) from a ChIP-seq experiment with information such as whether the interval overlaps with a gene(s)? Upstream of a gene? Overlaps with an exon? Intron? 5kb upstream/downstream of TSS? Intergenic? Does it overlap with a DNAse I hypersensitive site?

Surely bedtools can help me with this, but I'm looking for the best workflow / data sources to use for this that will require the least amount of scripting.

Thanks.

↧

Changing Column Order In Bed File

April 19, 2014, 6:20 am

≫ Next: Intersect Gene Annotation With Specific Position Or Genomic Interval

≪ Previous: Annotating Genomic Intervals

Here is my data with A, B, C and D columns in my bed file.

   A.     B.     C.     D.
  Chr 1.  1.    12.     +
  Chr 2.  24.   56.     +

How can I move my D column to position 1 where the Column A right now?

↧

Intersect Gene Annotation With Specific Position Or Genomic Interval

April 19, 2014, 6:20 am

≫ Next: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

≪ Previous: Changing Column Order In Bed File

Hi,

I've several genomic interval and I want to check if they are overlapping with known gene. I've a gtf file with the coordinates of gene exons. My idea was to use intersectBed from bedtools but I've a little problem with small genomic interval that are are overlapping intron coordinates and not exons ( it do ot report me the gene where this interval is). Is it possible to specifiy to intersectBed to take into account introns ? or is there an another tool ?

Thanks

↧

Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

April 19, 2014, 6:20 am

≫ Next: Getting Unmapped Reads: Comparing Fastq To Bam

≪ Previous: Intersect Gene Annotation With Specific Position Or Genomic Interval

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

↧

Getting Unmapped Reads: Comparing Fastq To Bam

April 19, 2014, 6:20 am

≫ Next: Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

≪ Previous: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

↧

Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

April 19, 2014, 6:20 am

≫ Next: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

≪ Previous: Getting Unmapped Reads: Comparing Fastq To Bam

how can I count the number of BAM reads falling directly within a set of intervals, given in a GFF format? Note that I do not want reads overlapping the intervals, but ones that fall directly within them.

I tried the following:

intersectBed -abam reads.bam -b exons.gff -wb -f 1

this has redundancies, so I pipe it into coverageBed as follows:

intersectBed -abam reads.bam -b exons.gff -wb -f 1 | coverageBed -abam stdin -b exons.gff

Is this correct? Thanks.

↧

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

April 19, 2014, 6:20 am

≫ Next: Bed File Bedpe Format

≪ Previous: Counting Number Of Bam Reads Directly Within Set Of Intervals With Bedtools

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

↧

Bed File Bedpe Format

April 19, 2014, 6:20 am

≫ Next: How To Extract Scores From Bedgraph File Using Bed Tools

≪ Previous: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

Hi,

I'm having trouble with converting the bam file into bed -bedpe using the bedtools.

workflow:
samtools sort -n mut.bam mut.Namesorted
bamTobed -i mut.Namesorted.bam -bedpe > dilpMerged_bedpe.bed

After sorting the file by read name (option -n) I run the bamTobed command. but it gives me an error message after running a few lines:

*ERROR: -bedpe requires BAM to be sorted/grouped by query name.

What am I doing wrong here?

Thanks

↧

How To Extract Scores From Bedgraph File Using Bed Tools

April 19, 2014, 6:20 am

≫ Next: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

≪ Previous: Bed File Bedpe Format

file1

chr1 10 20 name 0 +

file2

chr1 12 14 2.5
chr1 14 15 0.5

How could i extract average scores of file1 using file2, like below? I am trying to extract phastcons (file2) average scores of file1.

chr1  10 20 name 0 + 1.5

↧

Intersectbed - Overlap Analysis Usign Vcf And Bed Files

April 19, 2014, 6:20 am

≫ Next: Creating Bed File For Lncrna Using Gencode Gtf File

≪ Previous: How To Extract Scores From Bedgraph File Using Bed Tools

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

↧

Creating Bed File For Lncrna Using Gencode Gtf File

April 19, 2014, 6:20 am

≫ Next: Determining Each Samples Coverage Area

≪ Previous: Intersectbed - Overlap Analysis Usign Vcf And Bed Files

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

↧

Determining Each Samples Coverage Area

April 19, 2014, 6:20 am

≫ Next: Which Of The Genes Are Enriched With Repeat Elements

≪ Previous: Creating Bed File For Lncrna Using Gencode Gtf File

First time I am working with NGS data. I've got a BAM file with mapped reads for my samples and a BED file with the regions in HG19 that were targeted (used an Ion-torrent ampliseq panel). Are there any tools that can output something similar to this:

**Sample      Amplicon           Chromosome           Start_coordinate_of_coverage             End_coordinate_of_coverage**
Sample1       amp_001                chr6                 1,000,000                                   1,000,250
Sample2       amp_001                chr6                 1,000,111                                   1,000,255
Sample1       amp_002                chr6                 1,000,200                                   1,000,333

I basically want to know for each gene what coverage we have for each sample.

EDIT: changed column headings, I'm looking for coordinates that have coverage, not depth at each exon.

↧

Which Of The Genes Are Enriched With Repeat Elements

April 19, 2014, 6:20 am

≫ Next: Tool For Binning Windowbed Output For K-Means Clustering

≪ Previous: Determining Each Samples Coverage Area

I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements. I have a bam file and the repeats in bed format. As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements. I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene. For example,

chr   start  end gene number_of_repeats
chr1 100  200  gene1 70
chr1 190  240  gene1 40
chr1 250  400  gene1 100
chr2 500  600  gene2 150

if i sort and merge them i will get

chr1 100  240  gene1 90
chr1 250  400  gene1 100
chr2 500  600  gene2 150

So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats. Or b) shall I count the number of repeats which for each short sequence without any merging? ...

↧

Tool For Binning Windowbed Output For K-Means Clustering

April 19, 2014, 6:20 am

≫ Next: Identify Overlapping And Non Overlapping Regions For Paired-End Data

≪ Previous: Which Of The Genes Are Enriched With Repeat Elements

I have mapped high resolution ChIP-seq data to transcription start sites using windowBed. I now want to bin the data, in bin sizes of my choosing, relative to TSSs so that I can generate heat maps and do k-means clustering on the data.

What tool/s exist for doing this?

Thanks!

↧