I would like to know which of my genes are enriched with repeats of LINE/SINE/ERV etc. elements.
I have a bam file and the repeats in bed format.
As far as I know BAM files contains aligned data for each short read sequence from the fastq file.
I am trying to figure out what is the best approach to know which genes (+- 1000 bp) have more repeats elements.
I am thinking about two approaches to implement but not sure which one is the best. here are the approaches i was thinking to use
a) Shall I convert the bam file into bed file and then use bedtools merge. So that I can overlap with the repeats file using bedtools window -c -l -r option. And I know how many of the repeats are overlapping or near by the short reads. Then count this number for each gene.
For example,
chr start end gene number_of_repeats
chr1 100 200 gene1 70
chr1 190 240 gene1 40
chr1 250 400 gene1 100
chr2 500 600 gene2 150
if i sort and merge them i will get
chr1 100 240 gene1 90
chr1 250 400 gene1 100
chr2 500 600 gene2 150
So gene1 will have 190 (90 + 100) and gene 2 will have 150 number of repeats.
Or
b) shall I count the number of repeats which for each short sequence without any merging? ...