BEDTools "Segmentation fault" while working with genome.fa

March 27, 2013, 9:48 pm

≫ Next: N closest Genes to a given location

I wanted to use BEDTools to extract genomic sequences (fastaFromBed).

My BED file has all 24 chromosomes, hence I want to use whole genome (merged from chromosome.fa).

Tried to:
fastaFromBed -fi genome.fa -bed all.chromosomes.bed -fo output
but got
Segmentation fault (core dumped)

Tried to use every chromosome.fa separately and it worked:
fastaFromBed -fi chromosome${i}.fa -bed all.chromosomes.bed -fo output
Of course I am getting annoying
WARNING. chromosome (chr..) was not found in the FASTA file. Skipping.
But it's still better than nothing and really fast.

I prefer to use BEDTools for sequence extraction so I am wondering is it possible to solve this segmentation fault thing? It seems that large genome.fa file can't be handled by BEDTools as I also tried nucBed and got the same thing or it might be some genome merging problem.

EDITED

This is the bed file I used for: intersectBed; closestBed; fastaFromBed ([www.box.com][1]).
There were problems only with fastaFromBed and only when I tried to use the whole genome.fa (~3.15GB). As I mentioned before - used every chromosome separately, got warnings but there was no segmentation fault and output was fine. I am wandering that it might be genome.fa problem (used cat to merge chromosomes)

EDITED#2

head genome.fa

>chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

cat genome.fa.fai

chr1 249250621 6 50 51
chr2 243199373 254235646 50 51
chr3 198022430 502299013 50 51
chr4 191154276 704281898 50 51
chr5 180915260 899259266 50 51
chr6 171115067 1083792838 50 51
chr7 159138663 1258330213 50 51
chr8 146364022 1420651656 50 51
chr9 141213431 1569942965 50 51
chr10 135534747 1713980672 50 51
chr11 135006516 1852226121 50 51
chr12 133851895 1989932775 50 51
chr13 115169878 2126461715 50 51
chr14 107349540 2243934998 50 51
chr15 102531392 2353431536 50 51
chr16 90354753 2458013563 50 51
chr17 81195210 2550175419 50 51
chr18 78077248 2632994541 50 51
chr19 59128983 2712633341 50 51
chr20 63025520 2772944911 50 51
chr21 48129895 2837230949 50 51
chr22 51304566 2886323449 50 51
chrX 155270560 2938654113 50 51
chrY 59373566 3097030091 50 51

genome.fa.fai was generated by BEDTools
index file genome.fa.fai not found, generating... And just after it's generated I am getting segmentation fault. If BEDTools scans the genome and generates index file maybe it's not the genome problem.

↧

N closest Genes to a given location

March 27, 2013, 9:48 pm

≫ Next: intersectBED: return reads in fraction in input files

≪ Previous: BEDTools "Segmentation fault" while working with genome.fa

Hi,

This is basically an extension of the following question already asked in biostar (http://biostars.org/post/show/53561/python-finding-gene-closest-to-a-given-location/).

Let us say I have a list of genomic regions (as a bed file), and also a list of genes (as a bed file). For each genomic region I want to find the 5 (or N to be general) closest genes. How would I try to do that? Any suggestions?

Thanks!

↧

intersectBED: return reads in fraction in input files

March 27, 2013, 9:48 pm

≫ Next: Annotating Genomic Intervals

≪ Previous: N closest Genes to a given location

I have a question with respect to intersectBED and multiple input files:

Is it possible to return reads which are present in, say 8/10 input files, without fractioning the reads in smaller intervals?

Thank you

↧

Annotating Genomic Intervals

March 27, 2013, 9:48 pm

≫ Next: Reproduce ENCODE/CSHL Long RNA-seq data visualization viewed in UCSC, but failed? [DONE]

≪ Previous: intersectBED: return reads in fraction in input files

How can I annotate human genomic intervals (BED file) from a ChIP-seq experiment with information such as whether the interval overlaps with a gene(s)? Upstream of a gene? Overlaps with an exon? Intron? 5kb upstream/downstream of TSS? Intergenic? Does it overlap with a DNAse I hypersensitive site?

Surely bedtools can help me with this, but I'm looking for the best workflow / data sources to use for this that will require the least amount of scripting.

Thanks.

↧

Reproduce ENCODE/CSHL Long RNA-seq data visualization viewed in UCSC, but failed? [DONE]

March 27, 2013, 9:48 pm

≫ Next: Counting features in a BED file

≪ Previous: Annotating Genomic Intervals

Motivation

The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools.

Result

I'll first upload the difference between my-version and official version: enter image description here

Top to Bottom:

Black: my-version-POSitive-strand.bigwig
Blue: Official-version-POSitive-strand.bigwig
Red: Official-version-REVerse-strand.bigwig
Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.

Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.

Data Set

ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

BAM file ready for my Data Processing;
bigWig Positive signal file ready for uploading to UCSC;
bigwig Reverse signal file ready for uploading to UCSC.

Data Processing

BAM sort

samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort

Genome Coverage

I refer to the standard manual of BEDtools, I'll use forward strand as example, and the reverse strand signal is generated in the same way.

genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph

Note that I've used -strand flag to separate the two strands.

bedgraphtoBigWig

bedGraphToBigWig executive script available from UCSC exe list

bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig

Upload to ftp and finally to UCSC genome browser.

Discussion

I was wondering which filtering step I've missed.

I've checked whether all the reads in the .bam file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.

Another gene difference Thus, any suggestions?

↧

Counting features in a BED file

March 27, 2013, 9:48 pm

≫ Next: How to get Annotation for Bed File from Another Bed FIle

≪ Previous: Reproduce ENCODE/CSHL Long RNA-seq data visualization viewed in UCSC, but failed? [DONE]

I have a file in the following BED format

Chr1 1022071 1022105  +      
Chr1 1022071 1022105  +
Chr1 1022072 1022106  -  
Chr1 1022072 1022106  - 
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -

I am trying get the counts of each feature represented in this file.

mergeBed -i R5_chr.bed -n -s -d 0 > Output/R5_chr_counts.bed

I am interested in the counts of the features and I do not want to merge features by any number of base pairs. Then the output should be as follows

Chr1 1022071 1022105 2 +
Chr1 1022072 1022106 4 +

Any suggestions on how to achieve this using bedtools or in bash or awk? Thanks in advance!

↧

How to get Annotation for Bed File from Another Bed FIle

March 27, 2013, 9:48 pm

≫ Next: getting number of reads in intervals with bedtools

≪ Previous: Counting features in a BED file

Hello All,

I have a bed file (with Chr, Start, End, Name, Score and Strand)

Chr1 5678 5680 NA 7  +
Chr1 700  800  NA 8  -
Chr1 900  1200 NA 10 -

and would like to know, how can I get the annotation for the name column from another bed file

Chr1 5500 6000 Gene1 x +
Chr1  500 1000 Gene2 x -

or any standard genome file formats like gbk or .fna files or for that matter another bed file? So mu output file will be a bed file with Chr, Start, End, Name and Strand.

Chr1 5678 5680 Gene1 7 +
Chr1 700  800  Gene2 8 -
Chr1 900  1200 Gene2 10 -

Any easy and standard way to do this??

Bedtools usually operates more on the features but not sure if annotation from one bedfile can be extracted into the other based on overlapping feaures.

Thanks in advance!

↧

getting number of reads in intervals with bedtools

March 27, 2013, 9:48 pm

≫ Next: comparative SNP analysis

≪ Previous: How to get Annotation for Bed File from Another Bed FIle

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

↧

comparative SNP analysis

March 27, 2013, 9:48 pm

≫ Next: What is the best way to run bedtools in parallel with blocking

≪ Previous: getting number of reads in intervals with bedtools

Hello, I am trying to compare the degree of A-to-G editing in a near-to-isogenic pair of cell lines. I have two biological replicates and have mapped with Bowtie and BWA, followed by a samtools mpileup | VarScan analysis. After this, I have used bedtools intersect to extract variants not annotated in dbSNP, but are in Alu repeats. Here is where I have some doubts, mainly two questions: QUESTION 1: In the vcf file (VarScan output),

#CHROM  POS     ID      REF     ALT     QUAL    FILTER    INFO    FORMAT  Sample1    Sample2
   chrM    73      .           G       A       PASS     DP=238  GT:GQ:DP           1/1:71:121  1/1:69:117

What exactly is the meaning of

FORMAT   Sample1    Sample2
GT:GQ:DP 1/1:71:121  1/1:69:117

QUESTION 2:

I have higher number of editing sites "called" in sample 1 than in sample 2 in the 1st biological replicate (about 16% difference). However this difference is reversed in the 2nd biological replicate. What is the proper way of comparing the degree of RNA editing in two different samples? Is there a quantitative procedure? I have naively compared them with bedtools intersect, using or omitting option -v. Is this the correct way to go about it?

Many thanks. G.

↧

What is the best way to run bedtools in parallel with blocking

March 27, 2013, 9:48 pm

≫ Next: Problems extracting non-SNPs from a VCF file

≪ Previous: comparative SNP analysis

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

↧

Problems extracting non-SNPs from a VCF file

March 27, 2013, 9:48 pm

≫ Next: getting all reads that align to a region in compact BED format using bedtools?

≪ Previous: What is the best way to run bedtools in parallel with blocking

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf 

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf 

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

↧

getting all reads that align to a region in compact BED format using bedtools?

March 27, 2013, 9:48 pm

≫ Next: Does bedops have a command similar to the bedtools makewindows?

≪ Previous: Problems extracting non-SNPs from a VCF file

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

↧

Does bedops have a command similar to the bedtools makewindows?

March 27, 2013, 9:48 pm

≫ Next: converting GFF to BED with bedtools?

≪ Previous: getting all reads that align to a region in compact BED format using bedtools?

With bedtools you can make genomic windows from a genome file or a bed file

input.bed

chr1    1000000 1500000
chr3    500000  900000

[prompt]$ bedtools makewindows -b input.bed -w 250000

chr1    1000000 1250000
chr1    1250000 1500000
chr3    500000  750000
chr3    750000  900000

Does the bedops suite provide a similar way to create genomic windows?

↧

converting GFF to BED with bedtools?

March 27, 2013, 9:48 pm

≫ Next: How to extract scores from BEDGraph file using BED tools

≪ Previous: Does bedops have a command similar to the bedtools makewindows?

I use bedtools's sortBed utility to sort BED files for various operations. It takes as input GFF files as well. However, when I feed it a GFF file as in:

sortBed -i myfile.gff

it outputs it as GFF, not BED. Is there a way to make bedtools sort and then convert the result to BED? Many bedtools utilities have a -bed flag. Do I need to use a different subutility of bedtools to achieve this? thanks.

↧

How to extract scores from BEDGraph file using BED tools

March 27, 2013, 9:48 pm

≫ Next: Simple redirection, I/O problem with bedtools

≪ Previous: converting GFF to BED with bedtools?

file1

chr1 10 20 name 0 +

file2

chr1 12 14 2.5
chr1 14 15 0.5

How could i extract average scores of file1 using file2, like below? I am trying to extract phastcons (file2) average scores of file1.

chr1  10 20 name 0 + 1.5

↧

Simple redirection, I/O problem with bedtools

March 27, 2013, 9:48 pm

≫ Next: How to get FASTA format using fastaFromBed OR How to turn linearized FASTA to the same length columns

≪ Previous: How to extract scores from BEDGraph file using BED tools

Hi Guys, Just a quick question. Its more of a Bash question rather than Bioinformatics, with Bedtools in question.

I mostly pipe the bedtools I/O. Here's a general scenario :

sed 1d fileA.bed | intersectBed -a stdin -b peaks.bed | intersectBed -u -a stdin -b fileB.bed

Now, the problem is fileB is also having a head, which is reported as an error by intersectBed (makes sense, non-integer start).

How can I remove the first line or the head of the fileB on the fly in the pipe.

Thanks

↧

How to get FASTA format using fastaFromBed OR How to turn linearized FASTA to the same length columns

March 27, 2013, 9:48 pm

≫ Next: Error in bedtools getfasta: chromosome not found

≪ Previous: Simple redirection, I/O problem with bedtools

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622   
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
>chr19:13985689-13985825  
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

↧

Error in bedtools getfasta: chromosome not found

March 27, 2013, 9:48 pm

≫ Next: raw counts from cufflinks output

≪ Previous: How to get FASTA format using fastaFromBed OR How to turn linearized FASTA to the same length columns

Hi,

I am triing to use BEDtools to get some sequences from genomic coordinates. But I am having an errors saying " WARNING. chromosome (chr12) was not found in the FASTA file. Skipping." for each read that I have in my bed file. I gave you some details about what I am doing.

I just download the last version of BEDtools (I think) bedtools-2.17.0.

Then I have 2 different files (much more longer that the little part that I show) :

A fasta file with all the sequences of chromosomes:

>chr01
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

a BED file with my genomic coordinates (already sorted) chr01 187814 190840 chr01 307073 310104 chr01 701047 704068 chr01 702941 705962 chr01 702952 705972 chr01 867716 870740 chr01 914064 917087 chr01 991080 994104 chr01 1039795 1042815 chr01 1058713 1061736

And then I write the command line: bedtools getfasta -fi all.con -bed 1-13sorted2.bed -fo NewCandidates/Genomiccoordinates/1-13_1500.fa

The only thing that I get is "WARNING. chromosome (chr01) was not found in the FASTA file. Skipping." , thousands of times...

If someone can help me and tell me what I am doing wrong, I will be very grateful.

Thank you all of you in advance.

↧

raw counts from cufflinks output

March 27, 2013, 9:48 pm

≫ Next: Profile Coverage of RNAseq samples?

≪ Previous: Error in bedtools getfasta: chromosome not found

Hi, I want to ask how to get the raw counts from the output of cufflinks. One way to do this is to use the fpkm.

raw counts = FPKM * (length of that transcript/1000) * (# of mapped reads / 1e6)

The FPKM and length of transcript are in the cufflinks FPKM Tracking Files. But how about the # of mapped reads?

For instance, we have a foo.bam. samtools view -c (-f|-F) flag foo.bam can do this job but I am not quite which flag should I set when it's single-end or paired-end.

Thanks!

↧

Profile Coverage of RNAseq samples?

March 27, 2013, 9:48 pm

≫ Next: Converting BAM to bedGraph for viewing on UCSC?

≪ Previous: raw counts from cufflinks output

Hi all,

I have a quick question:

How can I visualize aligned paired-end reads from RNAseq datasets in UCSC browser?

I already mapped the reads and assembled the transcripts with Tophat/Cufflinks but I'm not sure how to proceed to visualize the mappings

After sorting the BAM files and fixing the mate pairs, I tried to compute the coverage using the following commands:

genomeCoverageBed -bg -split -ibam F.T0.rep2-accepted_hits-fS.bam -g ~/conversion_util/chrom.hg19.sizes > F.T0.rep2-accepted_hits-fS.bg 
bedGraphToBigWig F.T0.rep2-accepted_hits-fS.bg ~/conversion_util/chrom.hg19.sizes F.T0.rep2-accepted_hits-fS.bw

But I was not able to visualize properly the mappings. Here I paste a screenshot of how it looks like:

Do you know where is the mistake?

Thanks!

↧