I want to create gene data-set (as big as possible), hence I am using several gene annotations. However, genes in different annotations overlap (it's the same gene). For reducing biases I overlap different annotations and if genes overlap leave only one gene.
Question:
To ensure this overlap I was thinking to expand gene coordinates - is this necessary? If so, how big extension should be (5bp/100bp)?
Example:
Want to create lncRNA data-set (in the following steps it will be used to search for genomic features).
Input:
- GENCODE lncRNA annotation (version 18 - 04/09/2013);
- Cabili lncRNA annotation (Cabili et al., 2011 (CSHLP)).
Workflow:
- Extract GENCODE genes start/end coordinates;
- Extract Cabili genes start/end coordinates;
- Extend Cabili coordinates ( -/+ nbp );
- Use BedTools intersect;
- If genes intersect leave GENCODE gene (as it's a newer annotation (though this step is really subjective)).
I do realize that this extension question depends on the situation and how reliable annotation is, but still hope that someone could suggest something.