Quantcast
Channel: Post Feed
Viewing all 3764 articles
Browse latest View live

Extracting Genomic Coverage Information Across Different Samples

$
0
0
Hello, I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples. 1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads. Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3. Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well. I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed. i have the following commands which might do similar but it ouputs differences at every base. genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph unionBedGraphs -header -i sample1.bedgraph sample2. ...

Which Of The Genes Are Enriched With Repeat Elements

$
0
0
I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements. I have a bam file and the repeats in bed format. As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements. I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene. For example, chr start end gene number_of_repeats chr1 100 200 gene1 70 chr1 190 240 gene1 40 chr1 250 400 gene1 100 chr2 500 600 gene2 150 if i sort and merge them i will get chr1 100 240 gene1 90 chr1 250 400 gene1 100 chr2 500 600 gene2 150 So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats. Or b) shall I count the number of repeats which for each short sequence without any merging? ...

Random shuffling of features leaving gene models intact

$
0
0

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene are placed on the same contig and related features are placed inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case? 

GTF2/GFF3 "feature" types and expression analysis

$
0
0
Hi, I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is.  Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time). Use of the gtf2 file produces a file containing read counts per exon.  I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here).  My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file? My second question follows I have read both these resources (here and here) but do not understand the differences between:
  • exon vs CDS
  • transcript vs mRNA
I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output.  What must I do for isoforms?  I am confused by the semantics. Thanks ahead of time and let me know if my post was not clear enough. ...

Bedtools on Cygwin problem.

$
0
0
Hi   I'm trying to install the latest release of Bedtools via Cygwin but there's a weird error during process. I know this isn't the best solution, but I do not have an other choice. Perhaps anyone knows how to fix this?   NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew $ make clean  * Cleaning-up BamTools API  * Cleaning up.NijbroekK@UTWKS11498 /cygdrive/g/Stage_Enschede/methods/methods_Bedtoolsnew $ make Building BEDTools: ========================================================= DETECTED_VERSION = v2.20.1 CURRENT_VERSION  = v2.20.1  * Creating BamTools API - Building in src/utils/bedFile   * compiling bedFile.cpp bedFile.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]  /*****************************************************************************  ^ - Building in src/utils/BinTree   * compiling BinTree.cpp BinTree.cpp:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled by default]  #include "BinTree.h"  ^ In file included from ../../utils//FileRecordTools/FileReaders/BufferedStreamMgr.h:16:0,                  from ../../utils//FileRecordTools/FileRecordMgr.h:19,                  from ../../utils//FileRecordTools/FileRecordMergeMgr.h:11,                  from ../../utils//Contexts/ContextBase.h:23,                  from ../../utils//Contexts/ContextIntersect.h:11,                  from BinTree.h:20,                  from BinTree.cpp:1: ../.. ...

Getting The Average Coverage From The Coverage Counts At Each Depth.

$
0
0
Hi, I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X). So, I decided to use genomeCoverageBed for my analysis. And I used the following command:genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt As many are aware, the output of the file looks something like this: genome 0 26849578 100286070 0.26773 genome 1 30938928 100286070 0.308507 genome 2 21764479 100286070 0.217024 genome 3 11775917 100286070 0.117423 genome 4 5346208 100286070 0.0533096 genome 5 2135366 100286070 0.0212927 genome 6 785983 100286070 0.00783741 genome 7 281282 100286070 0.0028048 genome 8 106971 100286070 0.00106666 genome 9 47419 100286070 0.000472837 genome 10 27403 100286070 0.000273248 To find the coverage, I multiplied col2 (depth) with col3 (number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3 is:0 30938928 43528958 35327751 21384832 10676830 4715898 1968974 855768 426771 274030 And the sum is: 150098740. Since the genome length is 1002860 ...

bedtools: extracting no coverage regions

$
0
0

Hello,

I am not sure if this has been answered before as I looked and couldn't find a simple answer.

I have a bam file, and all I want is to annotated all regions with 0 coverage in bed format. Is that possible?

Thank you,

Adrian

 

Counting Features In A Bed File

$
0
0

I have a file in the following BED format

Chr1 1022071 1022105  +
Chr1 1022071 1022105  +
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -
Chr1 1022072 1022106  -

I am trying get the counts of each feature represented in this file.

mergeBed -i R5_chr.bed -n -s -d 0 > Output/R5_chr_counts.bed

I am interested in the counts of the features and I do not want to merge features by any number of base pairs. Then the output should be as follows

Chr1 1022071 1022105 2 +
Chr1 1022072 1022106 4 +

Any suggestions on how to achieve this using bedtools or in bash or awk? Thanks in advance!


How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions

$
0
0

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

Extract coverage per feature from a bam and bed to a file

$
0
0
Hi,   a simple task.. or should be. I need to extract the average coverage per feature in a bam  file. I have a genbank and bed file for the reference the bam was mapped to. if I map with e.g. Geneous I can see good, variable coverage over the reference genome. I have tried GATK (could not get to run) and Bedtools (genomecov and coverage) -coverage will give me an output file but all the features have zero coverage.. here's the top of the .bed file: track name="Example E.coli" o26chr.gb 189 255 thrL gene 0 + o26chr.gb 189 255 thrL CDS 0 + o26chr.gb 336 2799 thrA gene 0 + o26chr.gb 336 2799 thrA CDS 0 + o26chr.gb 2800 3733 thrB gene 0 + o26chr.gb 2800 3733 thrB CDS 0 + o26chr.gb 3733 5020 thrC gene 0 + o26chr.gb 3733 5020 thrC CDS 0 + o26chr.gb 5233 5530 yaaX gene 0 + Here's the top of the output from bedtools coveage -ibam file.bam -b file.bed o26chr.gb 1047122 1048841 poxB gene 0 - 0 0 1719 0.0000000 o26chr.gb 1047122 1048841 poxB CDS 0 - 0 0 1719 0.0000000 o26chr.gb 2096828 2097287 gene 0 + 0 0 459 0.0000000 o26chr.gb 3144900 3148635 yfaL gene 0 - 0 0 3735 0.0000000 o26chr.gb 3144900 3148635 yfaL CDS 0 - 0 0 3735 0.0000000 o26chr.gb 4194149 4194368 tdcR gene 0 + 0 0 219 0.00 ...

Profile Coverage Of Rnaseq Samples?

$
0
0

Hi all,

I have a quick question:

How can I visualize aligned paired-end reads from RNAseq datasets in UCSC browser?

I already mapped the reads and assembled the transcripts with Tophat/Cufflinks but I'm not sure how to proceed to visualize the mappings

After sorting the BAM files and fixing the mate pairs, I tried to compute the coverage using the following commands:

genomeCoverageBed -bg -split -ibam F.T0.rep2-accepted_hits-fS.bam -g ~/conversion_util/chrom.hg19.sizes > F.T0.rep2-accepted_hits-fS.bg
bedGraphToBigWig F.T0.rep2-accepted_hits-fS.bg ~/conversion_util/chrom.hg19.sizes F.T0.rep2-accepted_hits-fS.bw

But I was not able to visualize properly the mappings. Here I paste a screenshot of how it looks like:

Do you know where is the mistake?

Thanks!

Renaming SNPs or SNP matching

$
0
0
This should be easy to do by now, but... we have SNP data from an Illumina exome array given to us in PLINK format. The BIM file looks like this:1 exm2253575 0 881627 G A 1 exm269 0 881918 A G 1 exm340 0 888659 T C 1 exm348 0 889238 A G 1 exm2264981 0 894573 G A 1 exm773 0 909238 G C 1 exm782 0 909309 C T 1 exm912 0 949608 A G 1 exm991 0 977028 T G 1 exm1024 0 978762 A G And I have all of the SNPs in dbSNP 138  downloaded as a large VCF file: #CHROM POS ID REF ALT QUAL FILTER INFO 1 10019 rs376643643 TA T . . RS=376643643;RSPOS=10020;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000200;WGT=1;VC=DIV;R5;OTHERKG 1 10054 rs373328635 CAA C,CA . . RS=373328635;RSPOS=10055;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000210;WGT=1;VC=DIV;R5;OTHERKG;NOC 1 10109 rs376007522 A T . . RS=376007522;RSPOS=10109;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG 1 10139 rs368469931 A T . . RS=368469931;RSPOS=10139;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG 1 10144 rs144773400 TA T . . RS=1447734 ...

Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?

$
0
0

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

bedtools intersect - something wrong with chromosome numbers >= 10?

$
0
0

Hi!

I have an alignment (.bam) of reads to mm9 genome. I sorted it with samtools sort, so that later I can use -sorted key with bedtools. I also created a .bed-file with regions of interest, in which I want to count number of reads, that mapped to them. I tried this: converted .bam to .bed with bedtools bamtobed, and then intersected them counting number of hits (bedtools intersect -a regions_of_interest.bed -b alignment_sorted.bed -c -sorted  > Neg2H_counts.bedgraph). The problem is, it looks fine for all chromosomes with numbers from 0 to 9 (and X), but all counts for all regions of interest of chromosomes with higher number (chr10, chr11, etc) are 0. There is no biological reason for that, in fact the highest signal should be on chr11. What could be wrong here? I am fairly new to all these tools.


UPDATE
I tried to do the same intersection with bedmap and the result is identical... So there probably is something wrong with my files - what could it be?
I also tried sorting the alignment-derived bed-file in the same way, as I did with the files with regions of interest and it doesn't help.

Extract Only Paired-End Reads That Map A Specific Interval

$
0
0

Hi,

Is it possible to extract paired-end reads that map to a specific interval ( from a bam file ). I tried with intersectBed :

intersectBed -abam align.bam -b interval.gff3 -wa > result.bam

here's the result :

enter image description here

But I only want reads that map to the feature in bold blue (one of the paired reads is enough). For example, I don't want the reads that map either side of this feature (red arrow).

Is it possible with intersectbed or an other program ?

Thanks,

N.


Snps Comparison

$
0
0

Hello,

I would like to compare SNPs from different methods:

  • number of SNPs
  • SNPs postion (position where method A has SNPs but not B and vice versa. Where both have SNPs)

I would be interested to get a output file which contain all above information and also would like to see the differences visualized i.e. where could load the two files which contain the SNPs and aligment.

In which format the SNPs has to be stored and which tools have to be used in order to make a comparison possibel?

Thank you in advance.

How To Install Bedtools In A User Directory

$
0
0

I am trying to install Bedtools in a user directory, however I looked at the manual for its makefile, and there is no such argument like "--prefix" for me to change. Is there a way to install all Bedtools in a directory that I specify? Thanks!

To Group Items In Bed Files

$
0
0

For example, we now have a bed file:

chr1 23455 45678
chr1 23446 45663
chr1 23449 45669
chr1 30000 31000

Is there anyway to group the first three lines, while leaving the last line alone? I know Bedtools have mergeBed function, merging those overlapping span, which, however will include the last line.

This may sound a pure computational question; but I'm just curious if we have available tools already to tackle such questions

thx

Creating Bed File For Lncrna Using Gencode Gtf File

$
0
0

Hi all,

I want to get the bed file of lncRNA based on GENCODE GTF file

I download the file "gencode.v16.long_noncoding_RNAs.gtf.gz", and extract the chr, start, end info from the file, then I use mergeBed to merge those overlapped lncRNA, am I correct? Since I know we can merge the exon genomic position using this kind of method

While for lncRNA I am not so sure, and is there any place already offering such kind of bed files?

actually, we should got 22444 Long non-coding RNA loci transcripts, however only 11817 genomic regions after merging process.

Anyone knows the answer, could you give me some help?

Converting Bam To Bedgraph For Viewing On Ucsc?

$
0
0

I'm trying to go from a BAM file to a representation viewable in UCSC, ideally bedGraph. I am trying to use Bedtools's genomeCoverage like this:

genomeCoverageBed -ibam accepted_hits.sorted.bam -bg -trackline -split -g ... > mytrack.bedGraph

I'm not sure what the -g argument is supposed to be or how to generate it. The documentation does not explicitly say what it is supposed to be, though it gives an example where it is some sort of BED file. I am simply looking for a bedGraph or other UCSC-friendly compact representation that will allow me to visualize read densities using UCSC from the BAM.EDIT When I generate a bedGraph and put it in UCSC, I get tracks that look like this:

enter image description here

not a histogram. How can I make it a histogram? How can I generate the genome file for use with genomeCoverageBed? Also Is this the best way to get a UCSC viewable file with Bedtools? To clarify, I want to visualize the BAM as a histogram. I'm not sure this is possible with bedGraph? Thank you.

Viewing all 3764 articles
Browse latest View live