Analysis on the heaviness of R package dependencies

R packages under analysis were retrieved from CRAN/Biocoductor on 2021-10-28. There are <%=sum(!grepl('bioconductor', df$repository))%> packages from CRAN and <%=length(grep('bioconductor', df$repository))%> packages from Bioconductor (bioc version 3.14).

Measures in the table

In the DESCRIPTION file of a package denoted as P, its direct dependency packages are listed in the Depends, Imports, LinkingTo, Suggestes and Enhances fields. We define the following dependency categories for package P:

  • Parent packages: the packages listed in the Depends, Importsand LinkingTo fields (package category B in the following diagram, the same as the packages in the red box).
  • Strong dependency packages: the total packages by recursively looking for parent packages (package category A and B). They are also called upstream packages.
  • All dependency packages: the total packages by recursively looking for parent packages, but on the level of package P, parents for packages in Suggests and Enhances are also included (package category A, B, C and D). It simulates when all packages are put into Depends/Imports, the number of strong dependencies.
  • Child packages: the packages whose parent packages include package P (package category E).
  • Downstream packages: the total packages by recursively looking for child packages (package category E and F).
<%= paste(readLines("~/project/development/pkgndep/inst/website/dependency_diagram.svg"), collapse = "\n") %>

Next we define various measures for heaviness:

  • Heaviness of a package on its child package: If package A is a parent of package P (i.e. package P strongly and directly depends on A), the heaviness of A on P is calcualted as n1 - n2 where n1 is the number of parent packages for P and n2 is the number of parent packages for P if moving A to Suggests. In other words, the heaviness measures the number of additional required packages that A brings to P.
  • Heaviness of a package on all its child packages: For package P, assume it has K child packages and the kth child is denoted as A_k, denote n_1k as the number of parent packages for package A_k and n_2k as the number of parent packages for A_k if moving P to its Suggesets, the heaviness of P on its child packages is calculated as sum(n_1k - n_2k)/K. So here the heaviness measures the average number of additional packages P brings to its child packages.
  • Heaviness of a package on all its downstream packages: The definition is similar to the heaviness of a package on all its child packages, except here "child pakages" are replaced with "downstream packages".

When plotting the heaviness on child packages verse the number of child packages (see the "Dependency plot" tab), since the heaviness here is an averaged measure, it is easy to gain large value for small number of child packages. Thus, when ordering the dependency table, packages on the top with the highest heaviness values are most likely those with small number of child dependencies (You can try to order the dependency table below by the column "Heaviness on child packages"). These packages, although with high heaviness, only contain very few child packages, which means, their effects on other packages are very small. What is more important for this analysis is to pick those packages which affect more other packages. Therefore, we adjusted the original definition of "heaviness on children" to sum(n_1k - n_2k)/(10 + K) where 10 is an empirical value and it greatly decreases the heaviness for packages with small number of children. The adjustment is done similarly for the heaviness on downsteam packages.

Other measures are:

  • Gini index: Gini index on the heaviness from the parent packages. To get rid of the scenario where majority of the values are zero and only a few are 1, the heaviness are added with 2 for calculating the Gini index. If the Gini index is close to 1 for a package, it most likely means there might be a few heavy parent packages.
  • High heaviness Packages with adjusted heaviness on child packages higher than 20.
  • Median heaviness Packages with adjusted heaviness on child packages between 10 and 20.

<% improvable_str = ifelse(only_improvable, 'on', '') %> <%= as.character(knitr::kable(df2, format = "html", row.names = FALSE, escape = FALSE, table.attr = "id='dependency-table' class='table table-striped'", col.names = c(qq("Package"), "Repository", qq("Number of strong dependency packages"), qq("Number of all dependency packages"), qq("Number of parent packages"), qq("Max heaviness from parent packages"), qq("Heaviness on child packages"), qq("Number of child packages"), qq("Heaviness on downstream packages"), qq("Number of downstream packages")), align = c("l", rep("r", ncol(df2) - 1)))) %> <% if(package == "") { %> <% if(order_by == "adjusted_heaviness_on_children") order_by = "" %>
records per page, showing <%=ind[1]%> to <%=ind[length(ind)]%> of <%=nrow(df)%> pacakges.
<% nr = nrow(df) if(nr > records_per_page) { %> <%= page_select(page, ceiling(nr/records_per_page), qq("order_by=@{order_by}&improvable=@{improvable_str}")) %> <% } %> <% } %>

Loading plot...