“ginmappeR” is an R package designed to provide functionalities to
translate gene or protein identifiers between state-of-art biological sequence
databases: CARD (https://card.mcmaster.ca/, (Alcock et al. 2023)), NCBI
Protein, Nucleotide and Gene (https://www.ncbi.nlm.nih.gov/), UniProt
(https://www.uniprot.org/, (‘UniProt’ 2017)) and KEGG
(https://www.kegg.jp, (Kanehisa & Goto 2000)). Also offers complementary
functionality like NCBI identical proteins or UniProt similar genes
clusters retrieval.
Nowadays, biological sequence databases offer programmatic interfaces (API) to access their data, like NCBI, UniProt or KEGG and, consequently, community developed R packages to consume these services are available, such as rentrez (Winter 2017), UniProt.ws (Carlson & Ramos 2022) and KEGGREST (Tenenbaum & Maintainer 2022), respectively. Other databases, like The Comprehensive Antibiotic Resistance Database (CARD) offer their data as a downloadable file.
The heterogeneity and low coupling of these tools motivated us to
conceive ginmappeR, an integral package that translates gene or protein
identifiers between the mentioned databases, making it easier for users
to work with multiple datasources in an unified and complete way.
The gene/protein identifier translation feature is bidirectional in
every cited database and translates into a 6x6 matrix (see figure below)
of functions of the form getSource2Target
. For example, to translate
from CARD to UniProt, getCARD2UniProt
can be used.
Additionally, features that were not available in their respective
packages like retrieval of UniProt similar genes clusters, or were not
easily accessible (such as NCBI identical proteins retrieval), are part of
ginmappeR id translation implementation and are also offered as
individual functions for the user: getUniProtSimilarGenes
and
getNCBIIdenticalProteins
.
Finally, as previously mentioned, considered databases offer API interfaces
and associated R packages, except for CARD, which is only available as a
downloadable zip file. To solve this, ginmappeR automatically downloads
CARD’s latest version and also offers the user the possibility to update
it through the updateCARDDataBase
function.
In order to illustrate the functionality of our package, we display some id conversion examples, and later on, NCBI identical protein and UniProt similar genes clusters examples.
Let us take CARD ARO identifier 3003955
and map it to the other
databases starting with the NCBI group, Protein, Nucleotide and Gene:
library(ginmappeR)
getCARD2NCBIProtein('3003955')
## [1] "CCP45647.1"
getCARD2NCBINucleotide('3003955')
## [1] "AL123456.3"
getCARD2NCBIGene('3003955')
## [1] "888575"
Now, let’s map the id to UniProt:
getCARD2UniProt('3003955')
## [1] "P9WJY5"
Finally, let’s map the id to KEGG database:
getCARD2KEGG('3003955')
## [1] "mtu:Rv2846c"
Some of the mapping functions have parameters to obtain all possible
translations (exhaustiveMapping
) or to detail the percentage of identity of
the source id with the obtained id (detailedMapping
). More information on this
in the code’s documentation. Let’s see an example employing these parameters:
# Note that when using exhaustiveMapping = TRUE, it returns a list instead
# of a character vector, to avoid mixing the result identifiers
getCARD2UniProt('3002372', exhaustiveMapping = TRUE, detailedMapping = TRUE)
## [[1]]
## [[1]]$DT
## [1] "Q6QJ79"
##
## [[1]]$`1.0`
## [1] "Q6QJ79" "A0A7G1KXU2" "D0UY02"
All the functions in ginmappeR are vectorized, that is, they can map a vector of identifiers, for example:
getCARD2NCBIProtein(c('3003955', 'wrong_id', '3002535'))
## [1] "CCP45647.1" NA "CAA38525.1"
R package rentrez offers access to NCBI databases, among which is Identical
Protein Groups. In order to make it more accessible to users, ginmappeR includes
getNCBIIdenticalProteins
that receives a NCBI identifier and returns its
identical proteins in form of a list of identifiers:
getNCBIIdenticalProteins('AHA80958')
## [[1]]
## [1] "WP_063864654.1" "AHA80958.1" "EKD8974449.1" "EKD8979565.1"
Through format
parameter, it is possible to obtain results in a dataframe:
result <- getNCBIIdenticalProteins('AHA80958', format = 'dataframe')
knitr::kable(result)
Id | Source | Nucleotide.Accession | Start | Stop | Strand | Protein | Protein.Name | Organism | Strain | Assembly |
---|---|---|---|---|---|---|---|---|---|---|
45721358 | RefSeq | NG_050043.1 | 1 | 861 | + | WP_063864654.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | 845332 | |
45721358 | INSDC | KF513177.1 | 1 | 861 | + | AHA80958.1 | beta-lactamase SHV-172 | Klebsiella pneumoniae | 845332 | |
45721358 | INSDC | ABJLVL010000001.1 | 124981 | 125841 | - | EKD8974449.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | NA | GCA_026265195.1 |
45721358 | INSDC | ABJLVL010000113.1 | 1755 | 2615 | + | EKD8979565.1 | class A beta-lactamase SHV-172 | Klebsiella pneumoniae | NA | GCA_026265195.1 |
The function getUniProtSimilarGenes
allows to retrieve clusters of genes with
100%, 90% or 50% identity with the provided identifier. Let us try with UniProt
gene Q2A799
and 100% identity:
getUniProtSimilarGenes('Q2A799', clusterIdentity = '1.0')
## [[1]]
## [1] "B0BL11" "A0A344X7M9" "B7VEQ9"
We can use argument clusterNames
to also retrieve the clusters names:
getUniProtSimilarGenes('Q2A799', clusterIdentity = '0.9')
## [[1]]
## [1] "A0A173DQX0" "A0A1Y0BRE0" "Q8GKX3" "A0A1S5SJJ9" "D7GKY5"
## [6] "A0A0U3BEI9" "A0A2V4FMD8" "D7GKY3" "I3VI54" "A0A023SG55"
## [11] "A0A1B2F089" "A0A344X7M9" "B7VEQ9" "D6CJE1" "D7GKZ1"
## [16] "G1CSK5" "A0A1W6F5I4" "A0A844NVA2" "A0AAI9KXE0" "B0BL11"
## [21] "D0EW81" "D7GKY7" "Q1WLM9" "Q9RGC2" "A5LHV8"
## [26] "Q0PRG2" "U5NIQ3" "A0A0U1PYJ5" "A0AAX0APK3" "C0JBE4"
## [31] "C0LIL9" "H6V565" "A0A2S5T091" "D3VX06" "D6CI36"
## [36] "H2E8M2" "A4KZ69" "A0A5Q2V4N5" "D2KHP5" "F1B1U0"