This vignettte explains the process to create a new biodb generic entry field, and how to parse its value for a particular database connector.
biodb 1.15.0
In this vignette we will focus on creating a new biodb field to be used inside an existing connector. biodb fields are defined for all database connectors. They are definitions of what types of data may be set inside biodb entry objects. Since they are shared by all connectors, they need to be defined without any reference to a particular database. However many of them are linked to a particular science or technological domain (genetics, metabolomics, mass spectrometry, …).
An entry field is like a type definition. The definition is done at the top-level of biodb, and thus it not related to any particular connector. The definition includes: a name, a description, a class (integer, double, character, logical), a cardinality (single value or vector), a list of allowed values, a class (to group similar fields like “mass”), etc.
For a particular connector, when an entry object is created in memory, a file containing the values is obtained from the database and a parsing is run in order to extract those values and affect them to associated biodb entry fields inside the biodb entry object. Thus the parsing of the value of a biodb entry field is different for each connector, while the biodb entry field is used by several different connectors.
No biodb connector use all available biodb entry fields. However it can happen that a connector does not implement the parsing of some available data inside a database. The reason is that, in most cases, the amount of available data, and the diversity of it, inside a single entry would require an excessive amount of coding. As a consequence, we often restrict our development onto a subset of the available data, in which we are interested.
When one particular data from the database is not present inside the entries of the corresponding biodb connector, this means that no parsing has been written for it inside the connector. Moreover it could also mean that no biodb entry field is defined to handled this particular type of data. Fortunately, biodb offers you a way to correct dynamically, inside your code, this shortage, creating a new biodb entry field if necessary and creating the corresponding parsing of the data for the connector.
Follow the subsequent explanations in order to learn how to define a new parsing of a value for a connector and assign it to an existing entry field, and how to define a new entry field.
First we instantiate the package:
mybiodb <- biodb::BiodbMain$new()
## INFO [16:27:14.037] Loading definitions from package biodb version 1.15.0.
Before going with the creation of a new field, we will look at different ways
of parsing a value for an existing biodb field that is not handled by a
connector.
Two connector cases will be used as examples: the ChebiExConn
connector
defined for the
Creating a new connector.
vignette and the CompCsvFileConn
connector from the biodb package.
The ChebiExConn
class implements an example connector to the ChEBI
(Hastings et al. 2012) remote database.
See vignette
Creating a new connector.
for the creation of this connector.
We load dynamically the definition of this connector inside biodb as explained in the Creating a new connector. vignette:
chebiexDefFile <- system.file("extdata", "chebi_ex.yml", package='biodb')
connClass <- system.file("extdata", "ChebiExConn.R", package='biodb')
entryClass <- system.file("extdata", "ChebiExEntry.R", package='biodb')
source(connClass)
source(entryClass)
mybiodb$loadDefinitions(chebiexDefFile)
For our demonstration we will suppose this connector has been created by somebody else, and we have no access to the implementation code.
We create a connector to this database:
conn <- mybiodb$getFactory()$createConn('chebi.ex')
And get one entry:
entryIds <- c('17001', '40304', '64679')
entriesDf <- mybiodb$entriesToDataframe(conn$getEntry(entryIds))
That you can see in table 1.
accession | formula | inchi | inchikey | molecular.mass | monoisotopic.mass | name | smiles | chebi.ex.id |
---|---|---|---|---|---|---|---|---|
17001 | C9H13N5O4 | InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1 | YQIFAMYNGGOTFB-XINAWCOVSA-N | 255.2308 | 255.0967 | 7,8-dihydroneopterin | Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO |
17001 |
40304 | C10H13N5O5 | InChI=1S/C10H13N5O5/c11-9-13-7-6(8(18)14-9)12-10(19)15(7)5-1-3(17)4(2-16)20-5/h3-5,16-17H,1-2H2,(H,12,19)(H3,11,13,14,18)/t3-,4+,5+/m0/s1 | HCAJQHYUCKICQH-VPENINKCSA-N | 283.2407 | 283.0917 | 8-hydroxy-2’-deoxyguanosine | Nc1nc2n([C@H]3C[C@H](O)[C@@H](CO)O3)c(O)nc2c(=O)[nH]1 |
40304 |
64679 | C9H18NO11P | InChI=1S/C9H18NO11P/c10-3(8(15)16)2-19-22(17,18)21-9-7(14)6(13)5(12)4(1-11)20-9/h3-7,9,11-14H,1-2,10H2,(H,15,16)(H,17,18)/t3-,4+,5+,6-,7-,9+/m0/s1 | JTBRVTASGISCGJ-RHNOWPELSA-N | 347.2131 | 347.0618 | O-(alpha-D-mannose-1-phosphoryl)-L-serine | N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O |
64679 |
You will notice that no electrical charge is mentioned for the molecules in the table, while it is present inside ChEBI database. Let us choose one of the entries:
id <- entryIds[[1]]
id
## [1] "17001"
And get the ChEBI web page of this entry:
conn$getEntryPageUrl(id)
## 17001
## "https://www.ebi.ac.uk/chebi/searchId.do?chebiId=17001"
Go on this page (
https://www.ebi.ac.uk/chebi/searchId.do?chebiId=17001
) to check that the electrical charge information is indeed given by ChEBI (Net Charge 0
).
To integrate this data inside the biodb entry, we need to extract it from the file returned by ChEBI.
When asked for an entry on its web service interface, ChEBI returns an XML file that biodb stores in its cache.
By calling the following method on your connector, you can get the path to the biodb cache file:
conn$getCacheFile(id)
## [1] "/home/biocbuild/.cache/R/biodb/chebi.ex-0c5076ac2a43d16dbce503a44b09f649/17001.xml"
If you take a look to this file with your favourite editor, you will see the following text:
<charge>0</charge>
This the XML tag that stores the value of the electrical charge.
To extract values from XML, biodb uses the XPath query language.
In XPath language, the expression //chebi:charge
means to get the value
inside the charge
tag wherever it is (//
) inside the tree structure of the
XML.
See XPath Tutorial for an
introduction to XPath.
We need to give this XPath expression to the biodb instance, and explain
to which entry field the extracted value must be affected.
This is done by defining a small YAML file:
chargeParsingDefFile <- system.file("extdata", "chebi_ex_charge_parsing.yml", package='biodb')
Whose content is as follow:
databases:
chebi.ex:
parsing.expr:
charge: //chebi:charge
In this file we define a new parsing expression inside the parsing.expr
section for the chebi.ex
database connector.
The definition of the parsing expression consists of two values: the targeted biodb entry field (charge
) and the XPath expression (//chebi:charge
).
Now we just have to load this new definition:
mybiodb$loadDefinitions(chargeParsingDefFile)
Delete the existing connector:
mybiodb$getFactory()$deleteConn(conn)
## INFO [16:27:16.453] Connector "chebi.ex" deleted.
Recreate the connector and reload the same entries:
conn <- mybiodb$getFactory()$createConn('chebi.ex')
entriesDf <- mybiodb$entriesToDataframe(conn$getEntry(entryIds))
You can see in 2 that the electrical charge is now indicated for each entry.
accession | formula | inchi | inchikey | molecular.mass | monoisotopic.mass | name | smiles | charge | chebi.ex.id |
---|---|---|---|---|---|---|---|---|---|
17001 | C9H13N5O4 | InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1 | YQIFAMYNGGOTFB-XINAWCOVSA-N | 255.2308 | 255.0967 | 7,8-dihydroneopterin | Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO |
0 | 17001 |
40304 | C10H13N5O5 | InChI=1S/C10H13N5O5/c11-9-13-7-6(8(18)14-9)12-10(19)15(7)5-1-3(17)4(2-16)20-5/h3-5,16-17H,1-2H2,(H,12,19)(H3,11,13,14,18)/t3-,4+,5+/m0/s1 | HCAJQHYUCKICQH-VPENINKCSA-N | 283.2407 | 283.0917 | 8-hydroxy-2’-deoxyguanosine | Nc1nc2n([C@H]3C[C@H](O)[C@@H](CO)O3)c(O)nc2c(=O)[nH]1 |
0 | 40304 |
64679 | C9H18NO11P | InChI=1S/C9H18NO11P/c10-3(8(15)16)2-19-22(17,18)21-9-7(14)6(13)5(12)4(1-11)20-9/h3-7,9,11-14H,1-2,10H2,(H,15,16)(H,17,18)/t3-,4+,5+,6-,7-,9+/m0/s1 | JTBRVTASGISCGJ-RHNOWPELSA-N | 347.2131 | 347.0618 | O-(alpha-D-mannose-1-phosphoryl)-L-serine | N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O |
0 | 64679 |
The CompCsvFileConn
class implements a connector to a local CSV file database
of chemical compounds, as explained inside vignette
. For a database stored inside a CSV file, the data parsing is very simple. It consists in associating each biodb entry field with a column name. By default biodb will define associations for each entry field whose name is used for a column. The columns whose names are not the names of existing biodb entry fields are not associated and thus you cannot access their values from biodb.
If you want to access those values, you have the define manually the
associations, using the setField()
method.
For our example we use an extract from ChEBI database as the input CSV database file:
fileUrl <- system.file("extdata", "chebi_extract_with_unknown_column.tsv", package='biodb')
See table 3 for the content of this file.
accession | elecCharge | formula | monoisotopic.mass | molecular.mass | kegg.compound.id | name | smiles |
---|---|---|---|---|---|---|---|
1018 | 0 | C2H8AsNO3 | 168.97201 | 169.012 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
1390 | 0 | C8H8O2 | 136.05243 | 136.148 | C06224 | 3,4-Dihydroxystyrene | Oc1ccc(C=C)cc1O |
1456 | 0 | C3H9NO2 | 91.06333 | 91.109 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
1549 | 0 | C3H5O3R | 89.02387 | 89.070 | C03834 | 3-hydroxymonocarboxylic acid | OC([*])CC(O)=O |
1894 | 0 | C5H11NO | 101.08406 | 101.147 | C10974 | 4-Methylaminobutanal | CNCCCC=O |
1932 | 0 | C6H6NR | 92.05002 | 92.119 | C03084 | 4-Substituted aniline | Nc1ccc([*])cc1 |
In this file, the column name elecCharge
will not be associated to any
biodb entry field.
Indeed, the biodb entry field the electrical charge of a molecule is
charge
, not elecCharge
.
Let us verify that.
We first create the connector to this CSV file:
conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=fileUrl)
And get the content of some of the entries:
entriesDf <- mybiodb$entriesToDataframe(conn$getEntry(conn$getEntryIds()))
## INFO [16:27:16.806] Loading file database "/tmp/Rtmpecc6GF/Rinst483e65d9fc697/biodb/extdata/chebi_extract_with_unknown_column.tsv".
## WARN [16:27:16.809] Column "elecCharge" does not match any biodb field.
## Warning in warn("Column \"%s\" does not match any biodb field.", colname):
## Column "elecCharge" does not match any biodb field.
See table 4 for the content of this entry.
As you can see, no charge
field is listed.
accession | formula | monoisotopic.mass | molecular.mass | kegg.compound.id | name | smiles | comp.csv.file.id |
---|---|---|---|---|---|---|---|
1018 | C2H8AsNO3 | 168.97201 | 169.012 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
1018 |
1390 | C8H8O2 | 136.05243 | 136.148 | C06224 | 3,4-Dihydroxystyrene | Oc1ccc(C=C)cc1O |
1390 |
1456 | C3H9NO2 | 91.06333 | 91.109 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
1456 |
1549 | C3H5O3R | 89.02387 | 89.070 | C03834 | 3-hydroxymonocarboxylic acid | OC([*])CC(O)=O |
1549 |
1894 | C5H11NO | 101.08406 | 101.147 | C10974 | 4-Methylaminobutanal | CNCCCC=O |
1894 |
1932 | C6H6NR | 92.05002 | 92.119 | C03084 | 4-Substituted aniline | Nc1ccc([*])cc1 |
1932 |
Now we call the method to define the new association:
conn$setField('charge', 'elecCharge')
The first parameter is the name of the biodb entry field, the second the name of the column inside the CSV file
The new column will now be parsed when getting the entry. But before we must remove all entries from memory:
conn$deleteAllEntriesFromVolatileCache()
And then reload the same entries again:
entries2Df <- mybiodb$entriesToDataframe(conn$getEntry(conn$getEntryIds()))
See table 5 for the content of this entry.
A new data frame column is present, named charge
.
accession | formula | monoisotopic.mass | molecular.mass | kegg.compound.id | name | smiles | charge | comp.csv.file.id |
---|---|---|---|---|---|---|---|---|
1018 | C2H8AsNO3 | 168.97201 | 169.012 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
0 | 1018 |
1390 | C8H8O2 | 136.05243 | 136.148 | C06224 | 3,4-Dihydroxystyrene | Oc1ccc(C=C)cc1O |
0 | 1390 |
1456 | C3H9NO2 | 91.06333 | 91.109 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
0 | 1456 |
1549 | C3H5O3R | 89.02387 | 89.070 | C03834 | 3-hydroxymonocarboxylic acid | OC([*])CC(O)=O |
0 | 1549 |
1894 | C5H11NO | 101.08406 | 101.147 | C10974 | 4-Methylaminobutanal | CNCCCC=O |
0 | 1894 |
1932 | C6H6NR | 92.05002 | 92.119 | C03084 | 4-Substituted aniline | Nc1ccc([*])cc1 |
0 | 1932 |
Sometimes you just do not need to parse some value for setting an existing biodb field, but you need to get a value that does not correspond to any defined biodb field. In this case, you need to define a new field alongside defining your parsing.
For this demonstration we will use again the ChebiExConn
connector example from the
Creating a new connector.
vignette.
In the ChEBI database, each entry (i.e.: molecule) gets a score (a number of
stars) reflecting its curation status.
This field is not present inside the current ChebiExConn
connector example.
Let us see that by displaying the content of some entries:
conn <- mybiodb$getFactory()$getConn('chebi.ex')
entryIds <- c('17001', '40304', '64679')
entriesDf <- mybiodb$entriesToDataframe(conn$getEntry(entryIds))
See table 6.
accession | formula | inchi | inchikey | molecular.mass | monoisotopic.mass | name | smiles | charge | chebi.ex.id |
---|---|---|---|---|---|---|---|---|---|
17001 | C9H13N5O4 | InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1 | YQIFAMYNGGOTFB-XINAWCOVSA-N | 255.2308 | 255.0967 | 7,8-dihydroneopterin | Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO |
0 | 17001 |
40304 | C10H13N5O5 | InChI=1S/C10H13N5O5/c11-9-13-7-6(8(18)14-9)12-10(19)15(7)5-1-3(17)4(2-16)20-5/h3-5,16-17H,1-2H2,(H,12,19)(H3,11,13,14,18)/t3-,4+,5+/m0/s1 | HCAJQHYUCKICQH-VPENINKCSA-N | 283.2407 | 283.0917 | 8-hydroxy-2’-deoxyguanosine | Nc1nc2n([C@H]3C[C@H](O)[C@@H](CO)O3)c(O)nc2c(=O)[nH]1 |
0 | 40304 |
64679 | C9H18NO11P | InChI=1S/C9H18NO11P/c10-3(8(15)16)2-19-22(17,18)21-9-7(14)6(13)5(12)4(1-11)20-9/h3-7,9,11-14H,1-2,10H2,(H,15,16)(H,17,18)/t3-,4+,5+,6-,7-,9+/m0/s1 | JTBRVTASGISCGJ-RHNOWPELSA-N | 347.2131 | 347.0618 | O-(alpha-D-mannose-1-phosphoryl)-L-serine | N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O |
0 | 64679 |
In the XML entry content returned by the ChEBI server, this field is stored
inside the entityStar
element as shown here:
<entityStar>3</entityStar>
You can check that directly inside the XML content of one of the entries, as explained earlier.
To get this number of stars we define the new field and its parsing expression inside the following YAML file:
nStarsDefFile <- system.file("extdata", "chebi_ex_stars_field.yml", package='biodb')
Here is its content:
databases:
chebi.ex:
parsing.expr:
n_stars: //chebi:return/chebi:entityStar
fields:
n_stars:
description: The ChEBI example stars indicator.
class: integer
You already know how to define the parsing expression inside the YAML file The value of the XPath expression is a bit longer than for the electrical charge, but the principle is the same.
What is new, is the fields
section, in which we define the new fields.
The name of the field (n_stars
) is used as a key inside the section.
Then several keys are used to define the field, see table
7 for a description of those keys.
Key | Description |
---|---|
alias | Other possible names of the field. |
description | A description of the field. |
class | The R class. One of integer , character , double , logical . |
type | A name of a group for related fields. Existing ones are name and mass , but you can create your owns. |
card | The cardinality. Either one (single value) or many (vector). |
case.insensitive | If true then the value is case insensitive. |
forbids.duplicates | If true and the cardinality is many , no duplicate values will be accepted. |
lower.case | If true , the value will be put in lower case. |
allowed.values | If this vector is not empty, then only the values listed in this vector will be allowed for this field. |
We can now load the new definition:
mybiodb$loadDefinitions(nStarsDefFile)
Delete the existing connector:
mybiodb$getFactory()$deleteConn(conn)
## INFO [16:27:18.308] Connector "chebi.ex" deleted.
Recreate the connector and reload the same entries:
conn <- mybiodb$getFactory()$createConn('chebi.ex')
entriesDf <- mybiodb$entriesToDataframe(conn$getEntry(entryIds))
See table 8.
Now a column named n_stars
indicates the number of stars for each entry in the data frame.
accession | formula | inchi | inchikey | molecular.mass | monoisotopic.mass | name | smiles | charge | n.stars | chebi.ex.id |
---|---|---|---|---|---|---|---|---|---|---|
17001 | C9H13N5O4 | InChI=1S/C9H13N5O4/c10-9-13-7-5(8(18)14-9)12-3(1-11-7)6(17)4(16)2-15/h4,6,15-17H,1-2H2,(H4,10,11,13,14,18)/t4-,6+/m1/s1 | YQIFAMYNGGOTFB-XINAWCOVSA-N | 255.2308 | 255.0967 | 7,8-dihydroneopterin | Nc1nc2NCC(=Nc2c(=O)[nH]1)[C@H](O)[C@H](O)CO |
0 | 3 | 17001 |
40304 | C10H13N5O5 | InChI=1S/C10H13N5O5/c11-9-13-7-6(8(18)14-9)12-10(19)15(7)5-1-3(17)4(2-16)20-5/h3-5,16-17H,1-2H2,(H,12,19)(H3,11,13,14,18)/t3-,4+,5+/m0/s1 | HCAJQHYUCKICQH-VPENINKCSA-N | 283.2407 | 283.0917 | 8-hydroxy-2’-deoxyguanosine | Nc1nc2n([C@H]3C[C@H](O)[C@@H](CO)O3)c(O)nc2c(=O)[nH]1 |
0 | 3 | 40304 |
64679 | C9H18NO11P | InChI=1S/C9H18NO11P/c10-3(8(15)16)2-19-22(17,18)21-9-7(14)6(13)5(12)4(1-11)20-9/h3-7,9,11-14H,1-2,10H2,(H,15,16)(H,17,18)/t3-,4+,5+,6-,7-,9+/m0/s1 | JTBRVTASGISCGJ-RHNOWPELSA-N | 347.2131 | 347.0618 | O-(alpha-D-mannose-1-phosphoryl)-L-serine | N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O |
0 | 3 | 64679 |
Do not forget to terminate your biodb instance once you are done with it:
mybiodb$terminate()
## INFO [16:27:18.544] Closing BiodbMain instance...
## INFO [16:27:18.545] Connector "comp.csv.file" deleted.
## INFO [16:27:18.547] Connector "chebi.ex" deleted.
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodb_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 bitops_1.0-9 stringi_1.8.4
## [7] RSQLite_2.3.7 hms_1.1.3 digest_0.6.37
## [10] magrittr_2.0.3 evaluate_1.0.1 bookdown_0.41
## [13] fastmap_1.2.0 blob_1.2.4 plyr_1.8.9
## [16] jsonlite_1.8.9 progress_1.2.3 DBI_1.2.3
## [19] BiocManager_1.30.25 httr_1.4.7 fansi_1.0.6
## [22] XML_3.99-0.17 jquerylib_0.1.4 cli_3.6.3
## [25] rlang_1.1.4 chk_0.9.2 crayon_1.5.3
## [28] dbplyr_2.5.0 bit64_4.5.2 withr_3.0.2
## [31] cachem_1.1.0 yaml_2.3.10 tools_4.5.0
## [34] memoise_2.0.1 dplyr_1.1.4 filelock_1.0.3
## [37] curl_5.2.3 vctrs_0.6.5 R6_2.5.1
## [40] BiocFileCache_2.15.0 lifecycle_1.0.4 stringr_1.5.1
## [43] bit_4.5.0 pkgconfig_2.0.3 pillar_1.9.0
## [46] bslib_0.8.0 glue_1.8.0 Rcpp_1.0.13
## [49] lgr_0.4.4 xfun_0.48 tibble_3.2.1
## [52] tidyselect_1.2.1 knitr_1.48 htmltools_0.5.8.1
## [55] rmarkdown_2.28 compiler_4.5.0 prettyunits_1.2.0
## [58] askpass_1.2.1 RCurl_1.98-1.16 openssl_2.2.2