Quantcast
Viewing all 3764 articles
Browse latest View live

Bedtools "Segmentation Fault" While Working With Genome.Fa

I wanted to use BEDTools to extract genomic sequences (fastaFromBed). My BED file has all 24 chromosomes, hence I want to use whole genome (merged from chromosome.fa). Tried to: fastaFromBed -fi genome.fa -bed all.chromosomes.bed -fo output but gotSegmentation fault (core dumped) Tried to use every chromosome.fa separately and it worked: fastaFromBed -fi chromosome${i}.fa -bed all.chromosomes.bed -fo output Of course I am getting annoyingWARNING. chromosome (chr..) was not found in the FASTA file. Skipping. But it's still better than nothing and really fast. I prefer to use BEDTools for sequence extraction so I am wondering is it possible to solve this segmentation fault thing? It seems that large genome.fa file can't be handled by BEDTools as I also tried nucBed and got the same thing or it might be some genome merging problem.EDITED This is the bed file I used for: intersectBed; closestBed; fastaFromBed ([www.box.com][1]). There were problems only with fastaFromBed and only when I tried to use the whole genome.fa (~3.15GB). As I mentioned before - used every chromosome separately, got warnings but there was no segmentation fault and output was fine. I am wandering that it might be genome.fa problem (used cat to me ...

Tool: Bedtools: Analyzing Genomic Features

All practicing bioinformaticians will face problems that require them to compare, query and select genomic features across an entire genome. As it happens efficient interval representation and query is a surprisingly challenging problem that needs a specialized representation. The BEDTools suite contains a set of programs that support a broad range of interval analyses that involve selecting certain locations in the genome. The name reflects the original intent to process BED files but the tools operate just as well on GFF formats. The scripts need to be run in command line format and are available for UNIX type systems: Linux, Mac OSX, and Cygwin (on Windows). The link to the site is: http://code.google.com/p/bedtools/ With BEDTools one can answer questions such as:
  • how many reads map upstream/downstream of one or more locations in the genome?
  • how many reads cover a certain base in the genome?
  • which sections of the genome are not overlapping with target intervals?
  • what are the sequences specified by the coordinates?
  • ...
The suite consists of multiple tools but for beginners the most important is ...

How To Combine Fpkm Values From Cufflinks With Contigs From De Novo Assembly Program Velvet/Oases?

Hi all,

I am working on RNA-seq data analysis. I've finished running Tophat and Cufflinks to get FPKM values for each read from Illumina pair-end sequence. Also, parallely I've run Velvet to get contig sequences through de novo assembly and Gmap to see if the assembled sequences map to reference genome (this reference genome is not complete for now, but somewhat useful). Now, I am trying to combine all information so I can have sequence information for a contig and FPKM value for the corresponding to the contig. Some suggested I can convert Cufflink and Gmap outputs to bedfiles and then use IntersectBed to see if there's any overlap. However, I am not sure how I can have every information saved in the output from Bedtools. IntersectBed default seems to provide me overlapped region with 'A' file as a template, so I couldn't see any information from 'B' file. Is there any solution for me?? Please let me know. I would appreciate for your suggestion!

Why does BedTools Map operation produce all dots as output?

I am using BedTools Map operation to map the DNAse I signal of a cell type into some chromosome regions, by computing the mean on the third column The command I use is the following: $ bedtools map -a inputFile1.bed -b inputFile2.bedgraph -c 4 -o mean 1> outputFile In the output file, I have real value for chrom1 -> chrom9, but strangely I find all dots for the other chromosome regions: chr1 66660 66810 0.849999999999999977796 chr1 87640 87790 0.0500000000000000027756 chr1 96520 96670 0 chr1 115600 115750 115.527272727272702468 chr1 118840 118990 3.10000000000000008882 chr1 125340 125490 0 chr1 136280 136430 . chr1 136960 137110 . chr1 235600 235750 39.0559633027522963289 chr1 237020 237170 1.59999999999999986677 .... .... .... .... .... .... .... .... .... .... .... .... chr10 134874600 134874750 . chr10 134876820 134876970 . chr10 134877940 134878090 . chr10 134878160 134878310 . chr10 134879420 134879570 . chr10 134897500 134897650 . chr10 134907140 134907290 . chr10 134915640 134915790 . chr10 134939120 134939270 . chr10 134939280 134939430 . chr10 134940860 134941010 . .... .... .... .... .... ....   ...

Getting Number Of Reads In Intervals With Bedtools

What is the correct way to get the total number of reads strictly contained in each interval in a GFF from a BAM file while enforcing strandedness? What I am looking for is very close to this intersectBed feature:

-c    For each entry in A, report the number of overlaps with B.
    - Reports 0 for A entries that have no overlap with B.
    - Overlaps restricted by -f and -r.

Except that I'd like the number of overlaps in A for each entry in B (i.e. the other way around). If I do:

intersectBed -abam mybam.bam -b mygff.gff -s -f 1 -wb

Then my understanding is that this will report the entry in B for each overlap with A. But I'd like each entry in B to be outputted exactly once, with the number of reads from A that are contained strictly within it. I'm not sure how to enforce strict containment here.

Is coverageBed the solution to this? Or multicov? I'm not sure how to enforce strict containment using coverageBed - it's not clear to me if that's the default from the docs. Thanks.

Creating Bed File For Lncrna Using Gencode Gtf File

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

What Is The Best Way To Run Bedtools In Parallel With Blocking

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

Help With Exception When Using Bedtools Coveragebed With Paired Alignment. [Resolved]

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

General Considerations For Genomic Overlaps?

Hello I was wondering about general considerations for performing overlap of genomic regions and doing Monte Carlo-type statistics. Below I have made a description of how I do it, unfortunately Im not fully confident that this is correct, so I'll appreciate any thought on this. E.g. I have an experimental dataset (A) of 10 bp coordinates, this dataset constitutes approx. 5,000 entries all across the genome. Then I have another experimental dataset (B) (ChIP-seq) of ~1,000 bp coordinates, and ~50,000 entries all across the genome. If I perform overlap/intersection with BEDTools I get my overlap. E.g. 2000 entries from A. But then I also want to find overlaps in the vicinity of the ChIP-seq peaks, so I extend the size of these peaks e.g. by 1,000 bp on each side, then there are still 50,000 entries but the amount of the genome that is searched becomes larger, and some entries may also overlap now. So I do the intersection again of A and B, and count entries in A only once. This gives me e.g. 3,000 entries from A. So for the simulations, I use random intervals that look like dataset B. E.g. I pick 50,000 1,000 bp coordinates randomly, and intersect with A, and do this 1,000 times. Then I get e.g. an average of 500 entries from A. For overlaps in the vicinity I calculate the total size of dataset B and generate random intervals of the same length and total size in bp as dataset B (size-matched sampling). I hope you can f ...

Getting Unmapped Reads: Comparing Fastq To Bam

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

bedtools 2.0 merge - "unable to open file or unable to determine types"

I have a sorted bedfile comprised of three columns: seqid, start, and end.

 sort -k1,1 -k2,2n tmp2.bed > tmp3.bed
1    6589256    6589207
1    11627195    12127194
1    12616616    12116617
1    18283067    18273068
1    21826932    21926931
1    28787213    28788212
1    31195434    31195483
1    39374350    39364351
1    42307024    42357023
1    47379997    47374998

 

~/tools/bedtools2/bin/bedtools merge -i tmp3.bed
Error: unable to open file or unable to determine types for file tmp3.bed

 

Probably something silly, but I'm not seeing it.

 

EDIT:

The second column - start must be before the end coordinates.  It would be nice if bedtools threw an informative error as the user sucks.

 

Profile Coverage Of Rnaseq Samples?

Hi all,

I have a quick question:

How can I visualize aligned paired-end reads from RNAseq datasets in UCSC browser?

I already mapped the reads and assembled the transcripts with Tophat/Cufflinks but I'm not sure how to proceed to visualize the mappings

After sorting the BAM files and fixing the mate pairs, I tried to compute the coverage using the following commands:

genomeCoverageBed -bg -split -ibam F.T0.rep2-accepted_hits-fS.bam -g ~/conversion_util/chrom.hg19.sizes > F.T0.rep2-accepted_hits-fS.bg
bedGraphToBigWig F.T0.rep2-accepted_hits-fS.bg ~/conversion_util/chrom.hg19.sizes F.T0.rep2-accepted_hits-fS.bw

But I was not able to visualize properly the mappings. Here I paste a screenshot of how it looks like:

Image may be NSFW.
Clik here to view.

Do you know where is the mistake?

Thanks!

How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream

How can I use the closestBed from bedtools to find the closest locations between two bed files. The important bit here is that i want them to be upstream and in correct oriantation.

When I use the -s option, it does not report anything (everything is -1).

Then I checked the -D a option. It is returning some results but not sure if it is the right thing.

The other thing to mention is that my genes bed file (lets call is gene.bed) is organized as

chr1 123 234 +
chr1 456 789 -

rather than end position being smaller to indicate the negative strand.

Whereas my repeats.bed file are organized as

chr1 239 456
chr3 456 987

Does bedtools get confused with this?

Which options should i use if i want to find the distance to nearest repeat that is upstream and in the correct orientation?

How To Get All Entries Of B With Bedtools

Hi All,

Is there any way to get all the original B entry(even the ones for which there is no overlap with A as well) in intersectBed utility? I am trying to overlap two bed files. I need all the entry of file B which I put in -b options. I can not switch the file as the size of file A i.e in -a option is very very big.

Thanks

How To Extract Scores From Bedgraph File Using Bed Tools

file1

chr1 10 20 name 0 +

file2

chr1 12 14 2.5
chr1 14 15 0.5

How could i extract average scores of file1 using file2, like below? I am trying to extract phastcons (file2) average scores of file1.

chr1  10 20 name 0 + 1.5

Counting The Whole Insert Size From Paired-End Reads As Coverage

We have updated our workflows for per base sequence coverage to use genomeCoverageBed from BAM files. However for pair-end data it seems as though the regions between pair-end reads are not counted.

To be clear I am not talking about using -split for not counting introns in a single read of a paired-end, instead I am looking to count the probable whole insert when the insert size is greater than the combined read length of the paired reads.

We've looked at using iRanges from BioConductor as well but cannot tell if this would do what we want.

Is there is hidden flag in genomeCoverageBed to count the whole insert as coverage, not just the sequenced ends? Is there another program out there what would work on BAM files?

I know I can alter the SAM file before BAM conversion but this seems like something that should be coded somewhere already.

Picking Random Genomic Positions

I do have a set of TF binding coordinates and want to see if there is any significant overlap with an open chromatin annotation.

Example of TF coord:
chr1 19280 19298
chr1 245920 245938
chr2 97290 97308
chr9 752910 752938
...

Example of open chrom. coord. (UCSC track):
chr2 33031543 33032779
chr3 2304169 2304825
chr5 330899 330940
...

I have checked the intersection with the Bedtools (open chrom. coord vs TF coord. -/+ 100bp) and now I want to check the intersection between random genomic coordinates and open chrom.

The idea is to:

  1. Pick random genomic position (from the same chromosome as TF coordinate);
  2. -/+9bp (binding site size);
  3. -/+ 100bp;
  4. Run this simulation for 1000 times (TF x 1000);
  5. Bedtools;

Any ideas how can I do this simulation to pick random genomic positions from the same chromosome? I know a little bit of bash and Perl, but won't be able to write the script by myself.
Is it possible to measure the length of every chromosome;
Pick TF chromosome and from it's length get a random number which would represent a genomic position?

Can someone help me with the simulation and the pipeline.

Bedtools Compare Multiple Bed Files?

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result

I don't know if I'm on the right track. thx

Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?

Is there an easy way of finding, probably with bedtools, given a window size, the top N most correlated regions when comparing two bed/wig files? For example, in comparing two bed/wig/bam files that have PolII data for 2 conditions, to give the top N windows where the wiggle profiles are most similar?

Problem With Counting Mapped Reads

Hi, This is my very first experience analysing RNAseq data. My goal is to do differential analysis between two strains of a bacteria. So far, i managed to align and produce SAM and BAM files. I'm having problems to annotate and count my reads. Here are the commands that I used. My reads are from SOLID and hence in colourspace$ nohup solid2fastq.pl 291_01_01 291_01_01-bwa #Convert .csfasta and .qual to .fastq $ nohup bwa index -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta $ nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai $ perl -ne 'if($_ !~ m/^\S+?\t4\t/){print $_}' 291_01_01-bwa.sam > 291_01_01-bwa.sam.filtered #Convert to SAM file $ samtools sort 291_01_01-bwa.bam 291_01_01-bwa.bam.sorted $ samtools index 291_01_01-bwa.bam.sorted.bam to produce .rpkm file $ java -jar ~/bin/bam2rpkm-0.06/bam2rpkm-0.06.jar -i 291_01_01-bwa.bam.sorted.bam -f Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.RPKM2.out # i get an error here $ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1 so i tried different method to count $ htseq-count -i ID 291_01_01-bwa.sam Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.sam_htseq-count #still error $Error occured when processing GFF file (line 37060 of file Tbrucei427_Tr ...
Viewing all 3764 articles
Browse latest View live