Getting RNA sequences from gff and fa files

August 28, 2013, 4:04 am

≫ Next: Intersect gene annotation with specific position or genomic interval

≪ Previous: Heatmap of read coverage around TSSs

Hi. I have a folder full of .fa files, and a .gff. The gff file contains information about which loci look like they code for RNA sequences. The .fa contain the DNA sequences for a set of human chromosomes. I want to get all the sequences which code for RNA, as defined by the gff file, out of the DNA in the fasta files. I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff.

I have been trying to code my own little program in F# that will read these files and give me each RNA read defined in the gff, and its corresponding DNA. However I am a bit confused about how it works. Do the start and end of each feature in the gff file define a character in the corresponding .fa file? Are they 1 or 0 indexed? Does it matter what strand they are ('+' or '-') for my purposes?

Ultimately my goal is to get a bunch of RNAs with their corresponding types (miRNA, lincRNA, snRNA... etc) to do some computations on.

My question is this: what is the easiest way to get it out of the data I have?

The data I am using is freely available here: http://wanglab.pcbi.upenn.edu/coral/ under the heading "Annotation packages" if anyone is interested or needs specifics.

Thank you!

↧

Intersect gene annotation with specific position or genomic interval

August 29, 2013, 9:40 pm

≫ Next: Tool for binning windowBed output for K-means clustering

≪ Previous: Getting RNA sequences from gff and fa files

Hi,

I've several genomic interval and I want to check if they are overlapping with known gene. I've a gtf file with the coordinates of gene exons. My idea was to use intersectBed from bedtools but I've a little problem with small genomic interval that are are overlapping intron coordinates and not exons ( it do ot report me the gene where this interval is). Is it possible to specifiy to intersectBed to take into account introns ? or is there an another tool ?

Thanks

↧

Tool for binning windowBed output for K-means clustering

November 19, 2013, 9:11 pm

≫ Next: Merging/Intersecting different gene annotations - should I extend coordinates?

≪ Previous: Intersect gene annotation with specific position or genomic interval

I have mapped high resolution ChIP-seq data to transcription start sites using windowBed. I now want to bin the data, in bin sizes of my choosing, relative to TSSs so that I can generate heat maps and do k-means clustering on the data.

What tool/s exist for doing this?

Thanks!

↧

Merging/Intersecting different gene annotations - should I extend coordinates?

November 19, 2013, 9:11 pm

≫ Next: Does WindowBed extend reads?

≪ Previous: Tool for binning windowBed output for K-means clustering

I want to create gene data-set (as big as possible), hence I am using several gene annotations. However, genes in different annotations overlap (it's the same gene). For reducing biases I overlap different annotations and if genes overlap leave only one gene.

Question:

To ensure this overlap I was thinking to expand gene coordinates - is this necessary? If so, how big extension should be (5bp/100bp)?

Example:

Want to create lncRNA data-set (in the following steps it will be used to search for genomic features).
Input:

GENCODE lncRNA annotation (version 18 - 04/09/2013);
Cabili lncRNA annotation (Cabili et al., 2011 (CSHLP)).

Workflow:

Extract GENCODE genes start/end coordinates;
Extract Cabili genes start/end coordinates;
Extend Cabili coordinates ( -/+ nbp );
Use BedTools intersect;
If genes intersect leave GENCODE gene (as it's a newer annotation (though this step is really subjective)).

I do realize that this extension question depends on the situation and how reliable annotation is, but still hope that someone could suggest something.

↧

Does WindowBed extend reads?

November 19, 2013, 9:11 pm

≫ Next: Which of the genes are enriched with repeat elements

≪ Previous: Merging/Intersecting different gene annotations - should I extend coordinates?

I am using WindowBed, part of the BedTools suite, to align reads to a reference file and I obtained a very interesting result. I am trying to rule out an analysis artifact that could be caused by extending the reads or by aligning read midpoints rather than 5' ends. It is my understanding that WindowBed aligns the 5' end of the read to the reference point, rather than extending than mapping the read midpoint, or extending the 3' end of the read and mapping the midpoint. Am I correct in this assumption, that the 5' end of the read is in fact what is being aligned?

Any help here would be appreciated. The BedTools manual, which is very good, doesn't seem to address this.

Thanks

↧

Which of the genes are enriched with repeat elements

November 19, 2013, 9:11 pm

≫ Next: Genomic regions to exclude before shuffling intervals

≪ Previous: Does WindowBed extend reads?

I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements.

I have a bam file and the repeats in bed format.

As far as I know BAM files contains aligned data for each short read sequence from the fastq file. I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements.

I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use

a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene.

For example,

chr   start  end gene number_of_repeats
chr1 100  200  gene1 70
chr1 190  240  gene1 40
chr1 250  400  gene1 100
chr2 500  600  gene2 150

if i sort and merge them i will get

chr1 100  240  gene1 90
chr1 250  400  gene1 100
chr2 500  600  gene2 150

So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats.

b) shall I count the number of repeats which for each short sequence without any merging? so i will also get some insight into the gene counts vs .number of repeats?

For example using the same example above, i will get

for gene1 210 (70 + 40 +100) and for gene2 150 number of repeats.

Am i on the completely wrong track and should think a better solution?

↧

Genomic regions to exclude before shuffling intervals

November 23, 2013, 6:15 am

≫ Next: how to use bedtools windows to overlap upstream for positive strand strand

≪ Previous: Which of the genes are enriched with repeat elements

I want to do permutation test: randomly reposit (shuffle) given genomic intervals and measure intersection between new coordinates and specific genomic element.

Example:

Different sets of genes: protein coding, pseudogenes, ncRNA - intervals that I want to shuffle;
Genomic repeat L1 - coordinates are stable.
For every gene set shuffle intervals, intersect and measure the overlap with L1 (I am using bedtools shuffle - "reposition each feature in the input BED file on a random chromosome at a random position").

Question - Which genomic regions to exclude from the "genome" (bedtools shuffle -g option) before shuffling gene intervals?
I was going to exclude gaps in the assembly.
But what about:

All gene regions.
If I am shuffling pseudogene intervals should I exclude protein coding and ncRNA coordinates?
All non L1 Repeat masker coordinates.
As alu, LTR and DNA transposons aren't L1 so their won't be any intersection with them?

↧

how to use bedtools windows to overlap upstream for positive strand strand

November 23, 2013, 6:15 am

≫ Next: To calculate the exact total number of mapped reads in exome regions

≪ Previous: Genomic regions to exclude before shuffling intervals

Hi,

I am trying to use bedtools windows. It has been explained in the manual of the bedtools but I am still bit confused and thought a confirmation would be good. And I have no biological background.

I have divided my bedfile into two, based on the strand information(For example, posStrand.bed and negStrand.bed).

I would like to screen overlaps of LINEs within 5000bp upstream of my postStrand.bed file.

In this case shall I use -l or -r option from bedtools window?
since all are on + strand, do I need to use the -sw option?

↧

To calculate the exact total number of mapped reads in exome regions

December 11, 2013, 5:38 pm

≫ Next: Help with exception when using bedtools coveragebed with paired alignment. [resolved]

≪ Previous: how to use bedtools windows to overlap upstream for positive strand strand

Dear All,

I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, reference genome and exome_baits.bed file how shall I do it? Any single line command for that? This might be recurrent but am not getting any specific answer in the forums so I had to post. Any suggestions?

↧

Help with exception when using bedtools coveragebed with paired alignment. [resolved]

January 13, 2014, 5:00 am

≫ Next: how to find the closest distance from bed files between genes and repeats that are upstream

≪ Previous: To calculate the exact total number of mapped reads in exome regions

I use bwa mem to align paired reads to few hundreds of microbial contigs; then I sort the alignment, and trying to get a coverage using bedtools genomecov -ibam alignments.paired.sorted.bam -bg >ranges.txt, which fails with an exception:

*** glibc detected *** bedtools: double free or corruption (out): 0x0000000001c5f270 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d7b2750c6]
bedtools[0x45ab43]
bedtools[0x45b146]
bedtools[0x45c163]
bedtools[0x45e2ed]
bedtools[0x434c4b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3d7b21ecdd]

if I run the same using not paired alignment, everything is ok. So I am really not sure where is my mistake... maybe bedtools doesn't digest the paired alignment?

-- edit: works with the latest versions of these tools. Here are the ones that failed:

$ bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.0-r313
Contact: Heng Li <lh3@sanger.ac.uk>

$ bedtools -version
bedtools v2.16.1

↧

how to find the closest distance from bed files between genes and repeats that are upstream

January 13, 2014, 5:00 am

≫ Next: Bedgraph not displayed in IGV

≪ Previous: Help with exception when using bedtools coveragebed with paired alignment. [resolved]

How can I use the closestBed from bedtools to find the closest locations between two bed files. The important bit here is that i want them to be upstream and in correct oriantation.

When I use the -s option, it does not report anything (everything is -1).

Then I checked the -D a option. It is returning some results but not sure if it is the right thing.

The other thing to mention is that my genes bed file (lets call is gene.bed) is organized as

chr1 123 234 +
chr1 456 789 -

rather than end position being smaller to indicate the negative strand.

Whereas my repeats.bed file are organized as

chr1 239 456
chr3 456 987

Does bedtools get confused with this?

Which options should i use if i want to find the distance to nearest repeat that is upstream and in the correct orientation?

↧

Bedgraph not displayed in IGV

January 28, 2014, 7:14 am

≫ Next: Is it possible to filter only bookend reads from a bed file?

≪ Previous: how to find the closest distance from bed files between genes and repeats that are upstream

Hi, I am new and so facing problem. I was trying to make a bed graph file using bed tools genomecov command. The command was: bedtools genomecov -ibam filename.sorted.bam -g chromosome sizes.txt > O.bedgraph I got a bedgraph file which is much smaller in size. It is 500kb instead of ~6Mb. And when I load that 500kb file into IGV, I see nothing. Please help me out.

↧

Is it possible to filter only bookend reads from a bed file?

January 28, 2014, 7:14 am

≫ Next: Get the idea of splicing from reads mapped in RNA-seq

≪ Previous: Bedgraph not displayed in IGV

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

↧

Get the idea of splicing from reads mapped in RNA-seq

January 31, 2014, 8:08 am

≫ Next: using GNU Parallel for bedtools

≪ Previous: Is it possible to filter only bookend reads from a bed file?

I've got a set of 100 bam files from a public experiment, I want to have an idea of splicing in each of them regarding three exons,without entering in some kind of depth-level procedure like Cufflinks or DEXSeq,

Lets say that my exons are named 1,2 and 3, and I want to know in how many samples I have a splicing event of the number two, so i was looking in the threads and I found that using coverageBed with my bed file of the three exons I could get some kind of idea per bam file

coverageBed -split -abam my_alignment -b exons_to.bed

Am I correct?

I was also thinking of getting the reads mapped in flanking end positions of read 1 and start of read 3 with samtools

What do you think about it? Any idea will be kindly appreciated

Thanks in advance!

↧

using GNU Parallel for bedtools

February 7, 2014, 9:36 am

≫ Next: identify overlapping and non overlapping regions for paired-end data

≪ Previous: Get the idea of splicing from reads mapped in RNA-seq

I am trying to run gnu:parallel on bedtools multicov function where the original command is

bedtools multicov -bams bam1 bam2 bam3.. -bed anon.bed  > Q1_Counst.bed

I would like to implement the above command using gnu parallel. But when I run the command below

parallel -j 25 "bedtools multicov -bams {1} -bed {2} > Q1_Counst.bed" ::: minus_1_common_sorted_q1.bam minus_2_common_sorted_q1.bam minus_3_common_sorted_q1.bam plus_1_common_sorted_q1.bam plus_2_common_sorted_q1.bam plus_3_common_sorted_q1.bam ::: '/genome/genes_exon_2.bed'

each bam file is taken as separate argument , hence the processes starting are like

bedtools multicov -bams  bam1 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam2 -bed anon.bed  > Q1_Counst.bed
bedtools multicov -bams  bam3 -bed anon.bed  > Q1_Counst.bed

instead of taking all files as separate arguments. Hence Q1_Counst.bed is overwritten randomly. Could any one help me in getting exact command ? My server has around 30 cores.

↧

identify overlapping and non overlapping regions for paired-end data

February 18, 2014, 4:31 pm

≫ Next: Finding overlapping variants (i.e. indels, snps) using annovar format.

≪ Previous: using GNU Parallel for bedtools

gene1            gene2
chr1    25    30    chr1    34    37
chr1    15    20    chr1    25    28
chr1    80    90    chr1    10    13

gene1            gene2
chr1    25    30    chr1    36    39
chr1    15    20    chr1    18    20
chr1    80    90    chr1    19    22

common gene1 uniq gene2 (when we compare file 1 with file2)
chr1    15    20    chr1    25    28
chr1    80    90    chr1    10    13

common gene1 uniq gene2 (when we compare file2 with file1)

chr1    15    20    chr1    18    20
chr1    80    90    chr1    19    22

common gene1 common gene2 
 chr1    25    30    chr1    34     37  chr1    25    30    chr1    36    39

common in gene1 gene2 i was able to do with bedtools pairToPair. buth i have problem with common gene1 and uniq gene2

↧

Finding overlapping variants (i.e. indels, snps) using annovar format.

February 19, 2014, 4:58 pm

≫ Next: bed file of mapQ sliding window on a bam file?

≪ Previous: identify overlapping and non overlapping regions for paired-end data

Hello,

I know that using bedtools functions (specifically intersect and windows), it is possible to find overlapping features in the two sets of data. The catch here is that bedtools only accept the files in VCF, GFF, BED or BAM format. I have this tool that generates the output data in ANNOVAR format. My initial thought here is to convert the existing VCF files to ANNOVAR, but I am not sure whether there are tools out there that does the similar job as described earlier, except using the ANNOVAR files.

Thank you, Young

↧

bed file of mapQ sliding window on a bam file?

February 26, 2014, 9:54 pm

≫ Next: Extracting genomic coverage information across different samples

≪ Previous: Finding overlapping variants (i.e. indels, snps) using annovar format.

There may already be a recipe for this, so asking first before reinventing the wheel:

I would like to create a bed file where the score is the average mapQ from the reads of the input.bam file. I think bedtools or bedops are the way to go:

http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html
http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html

But I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200.

I also would like to generate the bam2bed information only from a list of regions in regions.bed. E.g., something like:

mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed

Anyone?

↧

Extracting genomic coverage information across different samples

March 30, 2014, 8:10 am

≫ Next: Problem with counting mapped reads

≪ Previous: bed file of mapQ sliding window on a bam file?

Hello,

I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples.

1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads.

Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3.

Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well.

I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed.

i have the following commands which might do similar but it ouputs differences at every base.

genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph

genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph

unionBedGraphs -header -i sample1.bedgraph sample2.bedgraph -names sample1 sample2 -g reference.fai -empty > samples1and2.txt

↧

Problem with counting mapped reads

March 30, 2014, 8:10 am

≫ Next: General Considerations For Genomic Overlaps?

≪ Previous: Extracting genomic coverage information across different samples

Hi, This is my very first experience analysing RNAseq data. My goal is to do differential analysis between two strains of a bacteria. So far, i managed to align and produce SAM and BAM files. I'm having problems to annotate and count my reads. Here are the commands that I used. My reads are from SOLID and hence in colourspace

$ nohup solid2fastq.pl 291_01_01 291_01_01-bwa  #Convert .csfasta and .qual to .fastq

$ nohup bwa index -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 

$ nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai 

$ perl -ne 'if($_ !~ m/^\S+?\t4\t/){print $_}' 291_01_01-bwa.sam > 291_01_01-bwa.sam.filtered #Convert to SAM file

$ samtools sort 291_01_01-bwa.bam 291_01_01-bwa.bam.sorted

$ samtools index 291_01_01-bwa.bam.sorted.bam

to produce .rpkm file

$ java -jar ~/bin/bam2rpkm-0.06/bam2rpkm-0.06.jar  -i 291_01_01-bwa.bam.sorted.bam -f Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.RPKM2.out  # i get an error here
$ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1

so i tried different method to count

$ htseq-count -i ID 291_01_01-bwa.sam Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.sam_htseq-count #still error 
$Error occured when processing GFF file (line 37060 of file Tbrucei427_TriTrypDB-4.0.gff):

need more than 1 value to unpack

and different method

$ python make_bed_from_fasta.py ~/Downloads/reference/TbruceiTreu927AnnotatedCDS_TriTrypDB-4.0.fasta > 927_reference.bed #this python script converts .fasta into .bed file since the .gff file cannot be processed
$multiBamCov -q 30 -p -bams 291_01_01-bwa.bam.sorted.bam -bed 927_reference.bed > test_counts.txt

now I only get 0 counts for all genes. Does this mean that there is something wrong with my alignment files or something wrong with the counting method . And it seems like my .gff (version 3) file was unable to be read by htseq-count and also the java script . I downloaded the gff file from GeneDB and it seems like in many tutorials .gtf files are used instead. So I'm stuck at counting the read part and I really need some help . Help please .

↧