The goal of the {dm} package and the dm
class that comes with it, is to make your life easier when you are dealing with data from several different tables.
Let’s take a look at the dm
class.
dm
The dm
class consists of a collection of tables and metadata about the tables, such as
All tables in a dm
must be obtained from the same data source; csv files and spreadsheets would need to be imported to data frames in R.
dm
objectsThere are currently three options available for creating a dm
object. The relevant functions for creating dm
objects are:
dm()
as_dm()
new_dm()
dm_from_src()
To illustrate these options, we will now create the same dm
in several different ways. We can use the tables from the well-known {nycflights13} package.
Create a dm
object directly by providing data frames to dm()
:
library(nycflights13)
#> Error in library(nycflights13): there is no package called 'nycflights13'
library(dm)
dm(airlines, airports, flights, planes, weather)
#> Error in .f(.x[[i]], ...): object 'airlines' not found
dm
Start with an empty dm
object that has been created with dm()
or new_dm()
, and add tables to that object:
library(nycflights13)
#> Error in library(nycflights13): there is no package called 'nycflights13'
library(dm)
<- dm()
empty_dm
empty_dm#> dm()
dm_add_tbl(empty_dm, airlines, airports, flights, planes, weather)
#> Error in list2(...): object 'airlines' not found
Turn a named list of tables into a dm
with as_dm()
:
as_dm(list(airlines = airlines,
airports = airports,
flights = flights,
planes = planes,
weather = weather))
#> Error in as_dm(list(airlines = airlines, airports = airports, flights = flights, : object 'airlines' not found
src
into a dm
Squeeze all (or a subset of) tables belonging to a src
object into a dm
using dm_from_src()
:
<- dbplyr::nycflights13_sqlite()
sqlite_src #> Error in (function (cond) : error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called 'RSQLite'
<- dm_from_src(sqlite_src)
flights_dm #> Error in dm_from_src(sqlite_src): object 'sqlite_src' not found
flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4
The function dm_from_src(src, table_names = NULL)
includes all available tables on a source in the dm
object. This means that you can use this, for example, on a postgres database that you access via src_postgres()
(with the appropriate arguments dbname
, host
, port
, …), to produce a dm
object with all the tables on the database.
Another way of creating a dm
object is calling new_dm()
on a list of tbl
objects:
<- new_dm(list(trees = trees, mtcars = mtcars))
base_dm base_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `trees`, `mtcars`
#> Columns: 14
#> Primary keys: 0
#> Foreign keys: 0
This constructor is optimized for speed and does not perform integrity checks. Use with caution, validate using validate_dm()
if necessary.
validate_dm(base_dm)
We can get the list of tables with dm_get_tables()
and the src
object with dm_get_src()
.
In order to pull a specific table from a dm
, use:
tbl(flights_dm, "airports")
#> Warning: `tbl.dm()` was deprecated in dm 0.2.0.
#> Use `dm[[table_name]]` instead to access a specific table.
#> # A tibble: 86 x 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 ALB Albany Intl 42.7 -73.8 285 -5 A America/New…
#> 2 ATL Hartsfield Jackson At… 33.6 -84.4 1026 -5 A America/New…
#> 3 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chi…
#> 4 BDL Bradley Intl 41.9 -72.7 173 -5 A America/New…
#> 5 BHM Birmingham Intl 33.6 -86.8 644 -6 A America/Chi…
#> 6 BNA Nashville Intl 36.1 -86.7 599 -6 A America/Chi…
#> 7 BOS General Edward Lawren… 42.4 -71.0 19 -5 A America/New…
#> 8 BTV Burlington Intl 44.5 -73.2 335 -5 A America/New…
#> 9 BUF Buffalo Niagara Intl 42.9 -78.7 724 -5 A America/New…
#> 10 BUR Bob Hope 34.2 -118. 778 -8 A America/Los…
#> # … with 76 more rows
But how can we use {dm}-functions to manage the primary keys of the tables in a dm
object?
dm
objectsSome useful functions for managing primary key settings are:
dm_add_pk()
dm_get_all_pks()
dm_rm_pk()
dm_enum_pk_candidates()
If you created a dm
object according to the examples in “Examples of dm
objects”, your object does not yet have any primary keys set. So let’s add one.
We use the nycflights13
tables, i.e. flights_dm
from above.
dm_has_pk(flights_dm, airports)
#> [1] TRUE
<- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key #> Error: Table `airports` already has a primary key. Use `force = TRUE` to change the existing primary key.
flights_dm_with_key#> Error in eval(expr, envir, enclos): object 'flights_dm_with_key' not found
The dm
now has a primary key:
dm_has_pk(flights_dm_with_key, airports)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found
To get an overview over all tables with primary keys, use dm_get_all_pks()
:
dm_get_all_pks(flights_dm_with_key)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found
Remove a primary key:
dm_rm_pk(flights_dm_with_key, airports) %>%
dm_has_pk(airports)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found
If you still need to get to know your data better, and it is already available in the form of a dm
object, you can use the dm_enum_pk_candidates()
function in order to get information about which columns of the table are unique keys:
dm_enum_pk_candidates(flights_dm_with_key, airports)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found
The flights
table does not have any one-column primary key candidates:
dm_enum_pk_candidates(flights_dm_with_key, flights) %>% dplyr::count(candidate)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found
dm_add_pk()
has a check
argument. If set to TRUE
, the function checks if the column of the table given by the user is unique. For performance reasons, the default is check = FALSE
. See also [dm_examine_constraints()] for checking all constraints in a dm
.
dm_add_pk(flights_dm, airports, tzone, check = TRUE)
#> Error: (`tzone`) not a unique key of `airports`.
Useful functions for managing foreign key relations include:
dm_add_fk()
dm_get_all_fks()
dm_rm_fk()
dm_enum_fk_candidates()
Now it gets (even more) interesting: we want to define relations between different tables. With the dm_add_fk()
function you can define which column of which table points to another table’s column.
This is done by choosing a foreign key from one table that will point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk()
. dm_add_fk()
will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.
%>% dm_add_fk(flights, origin, airports)
flights_dm_with_key #> Error in is_dm(dm): object 'flights_dm_with_key' not found
This will throw an error:
%>% dm_add_fk(flights, origin, airports)
flights_dm #> Error: (`origin`) is alreay a foreign key of table `flights` into table `airports`.
Let’s create a dm
object with a foreign key relation to work with later on:
<- dm_add_fk(flights_dm_with_key, flights, origin, airports)
flights_dm_with_fk #> Error in is_dm(dm): object 'flights_dm_with_key' not found
What if we tried to add another foreign key relation from flights
to airports
to the object? Column dest
might work, since it also contains airport codes:
%>% dm_add_fk(flights, dest, airports, check = TRUE)
flights_dm_with_fk #> Error in is_dm(dm): object 'flights_dm_with_fk' not found
Checks are opt-in and executed only if check = TRUE
. You can still add a foreign key with the default check = FALSE
. See also [dm_examine_constraints()] for checking all constraints in a dm
.
Get an overview of all foreign key relations withdm_get_all_fks()
:
dm_get_all_fks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 5 x 4
#> child_table child_fk_cols parent_table parent_key_cols
#> <chr> <keys> <chr> <keys>
#> 1 flights carrier airlines carrier
#> 2 flights origin airports faa
#> 3 flights dest airports faa
#> 4 flights tailnum planes tailnum
#> 5 flights origin, time_hour weather origin, time_hour
Remove foreign key relations with dm_rm_fk()
(parameter columns = NULL
means that all relations will be removed, with a message):
%>%
flights_dm_with_fk dm_rm_fk(table = flights, column = dest, ref_table = airports) %>%
dm_get_fk(flights, airports)
#> Error in is_dm(dm): object 'flights_dm_with_fk' not found
%>%
flights_dm_with_fk dm_rm_fk(flights, origin, airports) %>%
dm_get_fk(flights, airports)
#> Error in is_dm(dm): object 'flights_dm_with_fk' not found
%>%
flights_dm_with_fk dm_rm_fk(flights, columns = NULL, airports) %>%
dm_get_fk(flights, airports)
#> Error in is_dm(dm): object 'flights_dm_with_fk' not found
Since the primary keys are defined in the dm
object, you do not usually need to provide the referenced column name of ref_table
.
Another function for getting to know your data better (cf. dm_enum_pk_candidates()
in “Primary keys of dm
objects”) is dm_enum_fk_candidates()
. Use it to get an overview over foreign key candidates that point from one table to another:
dm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> Error in is_dm(dm): object 'flights_dm_with_key' not found