Contents

1 Abstract

“ginmappeR” is an R package designed to provide functionalities to translate gene or protein identifiers between state-of-art biological sequence databases: CARD (https://card.mcmaster.ca/, (Alcock et al. 2023)), NCBI Protein, Nucleotide and Gene (https://www.ncbi.nlm.nih.gov/), UniProt (https://www.uniprot.org/, (‘UniProt’ 2017)) and KEGG (https://www.kegg.jp, (Kanehisa & Goto 2000)). Also offers complementary functionality like NCBI identical proteins or UniProt similar genes clusters retrieval.

2 Introduction

Nowadays, biological sequence databases offer programmatic interfaces (API) to access their data, like NCBI, UniProt or KEGG and, consequently, community developed R packages to consume these services are available, such as rentrez (Winter 2017), UniProt.ws (Carlson & Ramos 2022) and KEGGREST (Tenenbaum & Maintainer 2022), respectively. Other databases, like The Comprehensive Antibiotic Resistance Database (CARD) offer their data as a downloadable file.

The heterogeneity and low coupling of these tools motivated us to conceive ginmappeR, an integral package that translates gene or protein identifiers between the mentioned databases, making it easier for users to work with multiple datasources in an unified and complete way.

3 Software features

The gene/protein identifier translation feature is bidirectional in every cited database and translates into a 6x6 matrix (see figure below) of functions of the form getSource2Target. For example, to translate from CARD to UniProt, getCARD2UniProt can be used.

Additionally, features that were not available in their respective packages like retrieval of UniProt similar genes clusters, or were not easily accessible (such as NCBI identical proteins retrieval), are part of ginmappeR id translation implementation and are also offered as individual functions for the user: getUniProtSimilarGenes and getNCBIIdenticalProteins.

Finally, as previously mentioned, considered databases offer API interfaces and associated R packages, except for CARD, which is only available as a downloadable zip file. To solve this, ginmappeR automatically downloads CARD’s latest version and also offers the user the possibility to update it through the updateCARDDataBase function.


4 Example

In order to illustrate the functionality of our package, we display some id conversion examples, and later on, NCBI identical protein and UniProt similar genes clusters examples.

4.1 Identifier translation

Let us take CARD ARO identifier 3003955 and map it to the other databases starting with the NCBI group, Protein, Nucleotide and Gene:

library(ginmappeR)
getCARD2NCBIProtein('3003955')
## [1] "CCP45647.1"
getCARD2NCBINucleotide('3003955')
## [1] "AL123456.3"
getCARD2NCBIGene('3003955')
## [1] "888575"

Now, let’s map the id to UniProt:

getCARD2UniProt('3003955')
## [1] "P9WJY5"

Finally, let’s map the id to KEGG database:

getCARD2KEGG('3003955')
## [1] "mtu:Rv2846c"

Some of the mapping functions have parameters to obtain all possible translations (exhaustiveMapping) or to detail the percentage of identity of the source id with the obtained id (detailedMapping). More information on this in the code’s documentation. Let’s see an example employing these parameters:

# Note that when using exhaustiveMapping = TRUE, it returns a list instead
# of a character vector, to avoid mixing the result identifiers
getCARD2UniProt('3002372', exhaustiveMapping = TRUE, detailedMapping = TRUE)
## [[1]]
## [[1]]$DT
## [1] "Q6QJ79"
## 
## [[1]]$`1.0`
## [1] "Q6QJ79"     "A0A7G1KXU2" "D0UY02"

All the functions in ginmappeR are vectorized, that is, they can map a vector of identifiers, for example:

getCARD2NCBIProtein(c('3003955', 'wrong_id', '3002535'))
## [1] "CCP45647.1" NA           "CAA38525.1"


4.2 NCBI identical protein retrieval

R package rentrez offers access to NCBI databases, among which is Identical Protein Groups. In order to make it more accessible to users, ginmappeR includes getNCBIIdenticalProteins that receives a NCBI identifier and returns its identical proteins in form of a list of identifiers:

getNCBIIdenticalProteins('AHA80958')
## [[1]]
## [1] "WP_063864654.1" "AHA80958.1"     "EKD8974449.1"   "EKD8979565.1"

Through format parameter, it is possible to obtain results in a dataframe:

result <- getNCBIIdenticalProteins('AHA80958', format = 'dataframe')
knitr::kable(result)
Id Source Nucleotide.Accession Start Stop Strand Protein Protein.Name Organism Strain Assembly
45721358 RefSeq NG_050043.1 1 861 + WP_063864654.1 class A beta-lactamase SHV-172 Klebsiella pneumoniae 845332
45721358 INSDC KF513177.1 1 861 + AHA80958.1 beta-lactamase SHV-172 Klebsiella pneumoniae 845332
45721358 INSDC ABJLVL010000001.1 124981 125841 - EKD8974449.1 class A beta-lactamase SHV-172 Klebsiella pneumoniae NA GCA_026265195.1
45721358 INSDC ABJLVL010000113.1 1755 2615 + EKD8979565.1 class A beta-lactamase SHV-172 Klebsiella pneumoniae NA GCA_026265195.1

4.3 UniProt similar genes clusters

The function getUniProtSimilarGenes allows to retrieve clusters of genes with 100%, 90% or 50% identity with the provided identifier. Let us try with UniProt gene Q2A799 and 100% identity:

getUniProtSimilarGenes('Q2A799', clusterIdentity = '1.0')
## [[1]]
## [1] "B0BL11"     "A0A344X7M9" "B7VEQ9"

We can use argument clusterNames to also retrieve the clusters names:

getUniProtSimilarGenes('Q2A799', clusterIdentity = '0.9')
## [[1]]
##  [1] "A0A173DQX0" "A0A1Y0BRE0" "Q8GKX3"     "A0A1S5SJJ9" "D7GKY5"    
##  [6] "A0A0U3BEI9" "A0A2V4FMD8" "D7GKY3"     "I3VI54"     "A0A023SG55"
## [11] "A0A1B2F089" "A0A344X7M9" "B7VEQ9"     "D6CJE1"     "D7GKZ1"    
## [16] "G1CSK5"     "A0A1W6F5I4" "A0A844NVA2" "A0AAI9KXE0" "B0BL11"    
## [21] "D0EW81"     "D7GKY7"     "Q1WLM9"     "Q9RGC2"     "A5LHV8"    
## [26] "Q0PRG2"     "U5NIQ3"     "A0A0U1PYJ5" "A0AAX0APK3" "C0JBE4"    
## [31] "C0LIL9"     "H6V565"     "A0A2S5T091" "D3VX06"     "D6CI36"    
## [36] "H2E8M2"     "A4KZ69"     "A0A5Q2V4N5" "D2KHP5"     "F1B1U0"


References

Alcock, B.P., Huynh, W., Chalil, R., Smith, K.W., Raphenya, A.R., Wlodarski, M.A., Edalatmand, A., Petkau, A., Syed, S.A., Tsang, K.K. & others. (2023). CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the comprehensive antibiotic resistance database. Nucleic acids research, 51, D690–D699.
Carlson, M. & Ramos, M. (2022). UniProt.ws: R interface to UniProt web services.
Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28, 27–30. Retrieved from http://nar.oxfordjournals.org/content/28/1/27
Tenenbaum, D. & Maintainer, B.P. (2022). KEGGREST: Client-side REST access to the kyoto encyclopedia of genes and genomes (KEGG).
UniProt: The universal protein knowledgebase. (2017).Nucleic acids research, 45, D158–D169.
Winter, D.J. (2017). rentrez: An r package for the NCBI eUtils API. The R Journal, 9, 520–526.