Methodology

Tensor Eigenvalue Decomposition (EVD)

Recall the eigenvalue decomposition (EVD) of an \(N\times N\) positive semi-definite matrix \(\textbf{K}\) which has the form

\[\begin{equation*} \textbf{K} = \textbf{V}\textbf{D}\textbf{V}' = \sum_{k=1}^N{d_k}\boldsymbol{v}_k\boldsymbol{v}'_k \end{equation*}\]

where \(\textbf{V}=[\boldsymbol{v}_1,...,\boldsymbol{v}_N]\) is an orthonormal matrix (i.e., \(\textbf{V}'\textbf{V}=\textbf{I}\)) whose columns \(\boldsymbol{v}_k\) (\(k=1,...,N\)) are the eigenvectors and \(\textbf{D}=diag(d_1,...,d_N)\) is a diagonal matrix with the eigenvalues \(d_1\geq\cdots \geq d_N\geq0\).

Kronecker product

Assume that \(\textbf{K}\) can be expressed as the Kronecker product (‘\(\otimes\)’) of two symmetric positive semi-definite matrices, \(\textbf{K}_1\) and \(\textbf{K}_2\) of dimensions \(n_1\) and \(n_2\) such that \(n_1\times n_2=N\), respectively. This is

\[\begin{equation*} \textbf{K} = \textbf{K}_1\otimes\textbf{K}_2 \end{equation*}\]

Let the EVD of the two matrices in the right-hand side be \(\textbf{K}_1 = \textbf{V}_1\textbf{D}_1\textbf{V}'_1\) and \(\textbf{K}_2 = \textbf{V}_2\textbf{D}_2\textbf{V}'_2\), respectively. Using properties of Kronecker products (e.g., Searle 1982, p. 265), it can be shown that the eigenvectors and eigenvalues of \(\textbf{K}\) are Kronecker products of the eigenvectors and eigenvalues of \(\textbf{K}_1\) and \(\textbf{K}_2\). Specifically, we have that:

\[\begin{equation*} \textbf{K} = (\textbf{V}_1\otimes\textbf{V}_2)(\textbf{D}_1\otimes\textbf{D}_2)(\textbf{V}_1\otimes\textbf{V}_2)' \end{equation*}\]

where \(\textbf{V} = \textbf{V}_1\otimes\textbf{V}_2\) are the eigenvectors and \(\textbf{D} = \textbf{D}_1\otimes\textbf{D}_2\) are the eigenvalues.

Hadamard product

Consider an \(n\times n\) matrix formed as a Hadamard product (‘\(\odot\)’) of two matrices involving \(\textbf{K}_1\) and \(\textbf{K}_2\),

\[\begin{equation*} (\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2) \end{equation*}\]

where \(\textbf{Z}_1\) and \(\textbf{Z}_2\) are (\(n\times n_1\) and \(n\times n_2\), respectively) incidence matrices connecting rows/columns in the Hadamard with the rows/columns of \(\textbf{K}_1\) and \(\textbf{K}_2\), respectively. This Hadamard \((\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2)\) is a sub-matrix of the Kronecker \(\textbf{K}_1\otimes\textbf{K}_2\).

Therefore, it can be shown that

\[\begin{equation*} (\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2) = \tilde{\textbf{V}}\tilde{\textbf{D}}\tilde{\textbf{V}}' \end{equation*}\]

where \(\tilde{\textbf{V}} = (\textbf{Z}_1\star\textbf{Z}_2)(\textbf{V}_1\otimes\textbf{V}_2) = [\tilde{\boldsymbol{v}}_1,...,\tilde{\boldsymbol{v}}_N]\) are the eigenvectors (each of length \(n\) and not necessary orthogonal/orthonormal) and \(\tilde{\textbf{D}} = \textbf{D}_1\otimes\textbf{D}_2 = diag(\tilde{d}_1,...,\tilde{d}_N)\) are the eigenvalues.

Here, the term \(\textbf{Z}_1\star\textbf{Z}_2\) is an \(n\times N\) matrix obtained as the “face-splitting product” (aka “transposed Khatri–Rao product”) of matrices \(\textbf{Z}1\) and \(\textbf{Z}_2\), which is defined as a row-by-row Kronecker product

\[\begin{equation} \textbf{Z}_1\star\textbf{Z}_2 = \begin{pmatrix} \boldsymbol{z}_{11}\otimes\boldsymbol{z}_{12} \\ \boldsymbol{z}_{21}\otimes\boldsymbol{z}_{22}\\ \vdots \\ \boldsymbol{z}_{n1}\otimes\boldsymbol{z}_{n2} \end{pmatrix} \end{equation}\]

with \(\boldsymbol{z}_{i1}\) and \(\boldsymbol{z}_{i2}\) being the \(i^{th}\) row of \(\textbf{Z}1\) and \(\textbf{Z}_2\), respectively.

Low-rank approximation from the EVD

It holds that the sum of the \(N\) eigenvalues of the matrix \(\textbf{K}\) equals to the sum of the diagonal values of \(\textbf{K}\), this is \(\sum_{k=1}^N{d_k} = trace(\textbf{K})\).

Therefore, for \(\textbf{K}\) being a variance structure matrix, we say that the term \(d_k/\sum_{k=1}^N{d_j}\) is the proportion of the total variability of \(\textbf{K}\) associated to the \(k^{th}\) eigenvector. A low-rank approximation (\(\tilde{\textbf{K}}\)) of \(\textbf{K}\) can be formed by retaining the top \(m\) eigenvectors (with \(m<N\)), those that jointly explain a proportion \(\alpha\) of the total variance (\(0<\alpha\leq1\), e.g., \(\alpha=0.95\)). This approximation is obtained by summing over the first \(m\) terms in the EVD of \(\textbf{K}\) above described, this is \(\tilde{\textbf{K}}_{\alpha}=\sum_{k=1}^m{d_k}\boldsymbol{v}_k\boldsymbol{v}'_k\). This is also equivalent to

\[\begin{equation*} \tilde{\textbf{K}}_{\alpha} = \tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}\tilde{\textbf{V}}_{\alpha} \end{equation*}\]

where \(\tilde{\textbf{V}}_{\alpha}=[\tilde{\boldsymbol{v}}_1,...,\tilde{\boldsymbol{v}}_m]\) and \(\tilde{\textbf{D}}_{\alpha}=diag(d_1,...,d_m)\) are matrices formed with the top \(m\) eigenvectors and eigenvalues in \(\textbf{V}\) and \(\textbf{D}\), respectively, explaining a proportion α of the variance.

Algorithm

The tensorEVD algorithm derives an approximate decomposition of a Hadamard product \((\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2) = \tilde{\textbf{V}}\tilde{\textbf{D}}\tilde{\textbf{V}}'\) using as inputs:

  • Covariance structures: K1 and K2 of dimensions \(n_1\) and \(n_2\), respectively.
  • IDs: ID1 and ID2 are \(n\)-vectors (\(n\) here is the sample size) mapping from observations to the rows and columns of K1 and K2, respectively. (The row- and column-numbers of K1 and K2 are the unique entries of ID1 and ID2, respectively.) These IDs are used to form the incidence matrices \(\textbf{Z}_1\) and \(\textbf{Z}_1\). For instance, the matrix \(\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1\) can be obtained by indexing rows and columns of K1 by ID1, this is K1[ID1,ID1].
  • Proportion of variance: alpha is a proportion (\(0<\alpha\leq1\), e.g., \(\alpha=0.95\)) to build only the eigenvectors needed to achieve such proportion of variance. A value alpha = 1 will build all the \(N\) eigenvectors.

The following script shows how to perform EVD using the tensorEVD() function

EVD = tensorEVD(K1, K2, ID1, ID2, alpha = 0.95)
ncol(EVD$vectors)             # Number of eigenvectors
sum(EVD$values)/EVD$totalVar  # Variance explained

Approximation accuracy

We measure the approximation error by calculating two metrics:

  1. Frobenius norm \(‖\textbf{A}‖_F\) (Golub and Van Loan 1996) of the matrix \(\textbf{A} = \textbf{K}-\tilde{\textbf{K}}_{\alpha}\) which is defined as the squared root of the sum of the squares of all the entries, i.e., \[\begin{equation*} ‖\textbf{A}‖_F=\sqrt{\sum_i^N\sum_j^N{a_{ij}^2}} \end{equation*}\] where \(a_{ij}\) is the entry in the \(i^{th}\) row and \(j^{th}\) column of \(\textbf{A}\).
  2. Correlation Matrix Distance \(dist(\textbf{K},\tilde{\textbf{K}}_{\alpha})\)) (CMD, Herdin et al. 2005). This is defined as \[\begin{equation*} dist(\textbf{A},\textbf{B}) = 1-\frac{trace(\textbf{A}\textbf{B})}{(‖\textbf{A}‖_F ‖\textbf{B}‖_F} \end{equation*}\] which ranges between 0 and 1.

The Frobenius norm and CMD were calculated after transforming the covariance matrices \(\textbf{K}\) and \(\tilde{\textbf{K}}_{\alpha}\) into correlation matrices.

Implementation

We benchmarked the tensorEVD() routine against the eigen() function of the ‘base’ R-package (R Core Team 2021) in terms of the computational time used to derive eigenvectors of the Hadamard \(\textbf{K}\), the accuracy of the approximation provided by tensorEVD, and the dimension of the resulting basis. Likewise, we evaluated the performance of the approximation of the Hadamard \(\textbf{K}\) provided by the tensorEVD method in Gaussian linear models in terms of variance components estimates and cross-validation prediction accuracies.

The data used in these benchmarks was generated by the Genomes-To-Fields (G2F) Initiative (Lima et al. 2023) which was curated and expanded by adding environmental covariates (EC) by Lopez-Cruz et al. (2023). We used the subset of the data corresponding to the northern testing locations that includes \(n=59069\) records for 4 traits (grain yield, anthesis, silking, and anthesis-silking interval) from \(n_G = 4344\) hybrids and \(n_E = 97\) environments.

Data analysis pipeline

We used a data analysis pipeline as shown below with folders code, data, output, parms, and source.

pipeline
├── code
│   ├── 1_simulation.R
│   ├── 2_model_components.R
│   ├── 3_ANOVA_GxE_model.R
│   ├── 4_get_variance_GxE_model.R
│   └── 5_10F_CV_GxE_model.R
├── data
├── output
├── parms
└── source
    ├── ECOV.csv
    ├── GENO.csv
    └── PHENO.csv

Genomes-to-Field data

Folder source contains the phenotypic (file PHENO.csv), SNPs (file GENO.csv), and ECs (file ECOV.csv) data from G2F. These files can be downloaded from the Figshare repository (https://doi.org/10.6084/m9.figshare.22776806).

R-scripts

Folder code contains the R-scripts to implement the sequence of analyses detailed in the next sections and can be downloaded from this link. The scripts can be downloaded using the following instruction in the command line

cd /mnt/scratch/quantgen/TENSOR_EVD/pipeline
curl https://codeload.github.com/MarcooLopez/tensorEVD/tar.gz/main | tar -xz --strip=2 tensorEVD-main/misc/code

The R-scripts were run on the MSU high-performance computing center (HPCC) (https://icer.msu.edu/hpcc/hardware) as a batch job script. The header of the scripts contains the shebang line #!/usr/bin/env Rscript in the first line followed by job requirements (e.g., memory, number of CPUs, run time) that are specified using the SLURM scheduler by adding the prefix #SBATCH at the beginning of each request instruction line, for example

#!/usr/bin/env Rscript
#SBATCH --time=03:59:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=84G

Each R-script is submitted to the HPCC using the sbatch command, for instance

cd /mnt/scratch/quantgen/TENSOR_EVD/pipeline/code
sbatch 1_simulation.R

Data subset and relationship matrices computation

The R-code below show how to obtain the subset of the data corresponding to the northern locations. Next, this data subset is used to derive a genetic (GRM, VanRaden 2008) for the \(n_G = 4344\) hybrids as \(\textbf{K}_G = \textbf{X}\textbf{X}'/trace(\textbf{X}\textbf{X}')\), where \(\textbf{X}\) is the matrix of centered SNPs (hybrids in rows, SNPs in columns). Likewise, an environmental relationship matrix (ERM) is derived for the \(n_E = 97\) environments as \(\textbf{K}_E = \textbf{W}\textbf{W}'/trace(\textbf{W}\textbf{W}')\) where \(\textbf{W}\) is the matrix of centered and scaled ECs (environments in rows, ECs in columns).

library(data.table)
setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")

PHENO <- read.csv("source/PHENO.csv")
GENO <- fread("source/GENO.csv", data.table=FALSE)
ECOV <- read.csv("source/ECOV.csv", row.names=1)

# Select North region
PHENO <- PHENO[PHENO$region %in% 'North',]
PHENO$year_loc <- factor(as.character(PHENO$year_loc))
PHENO$genotype <- factor(as.character(PHENO$genotype))

save(PHENO, file="data/pheno.RData")

# Calculate the GRM
ID <- GENO[,1]
GENO <- as.matrix(GENO[,-1])
rownames(GENO) <- ID
X <- scale(GENO, center=TRUE, scale=FALSE)
KG <- tcrossprod(X)
KG <- KG[levels(PHENO$genotype),levels(PHENO$genotype)]
KG <- KG/mean(diag(KG))

save(KG, file="data/GRM.RData")

# Calculate the ERM
ECOV <- ECOV[,-grep("HI30_",colnames(ECOV))]

KE <- tcrossprod(scale(ECOV))
KE <- KE[levels(PHENO$year_loc),levels(PHENO$year_loc)]
KE <- KE/mean(diag(KE))

save(KE, file="data/ERM.RData")

After running this code, R-files pheno.RData, GRM.RData, and ERM.RData with the phenotypic northern subset, GRM, and ERM, respectively, are to be saved in the folder data.

pipeline
:
├── data
:   ├── ERM.RData
    ├── GRM.RData
    └── pheno.RData

Simulation experiments

We formed Hadamard products \(\textbf{K} = (\textbf{Z}_1\textbf{K}_1\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_2\textbf{Z}'_2)\) between the GRM (as \(\textbf{K}_1\)) and the ERM (as \(\textbf{K}_2\)) of various sizes by sampling hybrids (\(n_G=100,500,1000\)), environments (\(n_E=10,30,50\)), and the level of replication needed to complete a total sample size of \(n=10000,20000,30000\). Then, we factorized the resulting Hadamard product matrix using the R-base function eigen as well as using tensorEVD, deriving as many eigenvectors as needed to explain a proportion \(\alpha=0.90,0.95,0.98\) of the total variance. We implemented 10 replicates of each experiment.

The R-script 1_simulation.R was used to perform the analysis for a given combination of parameters (nG, nE, n, alpha, and replicate). To do this, first, we created an array with all combinations of the parameters. A data.frame object called JOBS containing \(3\times3\times3\times3\times10=810\) rows and the \(5\) parameters in columns, was created using the expand.grid R-function and saved in folder parms

JOBS <- expand.grid(nG = c(100,500,1000),
                    nE = c(10,30,50),
                    n = c(10000,20000,30000),
                    alpha = c(0.90,0.95,0.98),
                    replicate = 1:10)
dim(JOBS); head(JOBS)
#[1] 810   5
#    nG nE     n alpha replicate
#1  100 10 10000   0.9         1
#2  500 10 10000   0.9         1
#3 1000 10 10000   0.9         1
#4  100 30 10000   0.9         1
#5  500 30 10000   0.9         1
#6 1000 30 10000   0.9         1

save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS1.RData")

To perform all the \(3\times3\times3\times3\times10=810\) cases (each row of the object JOBS) we submitted the R-script for multi-job implementation by specifying a job array through the SLURM option #SBATCH --array=1-810, each value of the array is read in the R-code with the instruction Sys.getenv("SLURM_ARRAY_TASK_ID").

The full header added to the R-script containing all the job requirements is the following

#!/usr/bin/env Rscript
#SBATCH --job-name=simulation
#SBATCH --output=../log/%x_%A_%a
#SBATCH --time=03:59:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=84G
#SBATCH --constraint=intel18
#SBATCH --array=1-810

The output file simulation_results.txt generated after running the previous R-script contains at each row the results from each experiment case. The information of the experiment are given in the first columns followed by the results on computation time, approximation accuracy (measured with the Frobenius and CMD metrics), and the number of eigenvectors (associated to the \(\alpha\)-value) for the eigen and tensorEVD methods.

pipeline
:
├── output
:   └── simulation
        └── simulation_results.txt

The first rows of this file are shown below for the results presented in the manuscript (the file can be found in this link).

out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_simulation.txt"))
head(out[,1:8])
##   job alpha   nG nE     n replicate nComb nPosEigen
## 1   1   0.9  100 10 10000         1  1000      1000
## 2   2   0.9  500 10 10000         1  4314      4314
## 3   3   0.9 1000 10 10000         1  6290      6290
## 4   4   0.9  100 30 10000         1  2904      2904
## 5   5   0.9  500 30 10000         1  7295      7295
## 6   6   0.9 1000 30 10000         1  8509      8509
head(out[,9:13])  # results from the eigen function
##   time_eigen Frobenius_eigen   CMD_eigen nPC_eigen pPC_eigen
## 1    255.960        148.8676 0.003611943       404 0.4040000
## 2    170.438        193.1440 0.004002020       936 0.2169680
## 3    171.677        148.1347 0.002807314      1417 0.2252782
## 4    166.713        122.9745 0.003348740       780 0.2685950
## 5    178.486        118.3712 0.003481097      1895 0.2597670
## 6    185.491        128.3336 0.003024160      2160 0.2538489
head(out[,14:18])  # results from the tensorEVD function
##   time_tensorEVD Frobenius_tensorEVD CMD_tensorEVD nPC_tensorEVD pPC_tensorEVD
## 1          0.165            146.7451   0.003268131           419     0.4190000
## 2          0.283            184.9391   0.003106224          1096     0.2540566
## 3          0.772            136.8840   0.001843963          1881     0.2990461
## 4          0.248            119.0399   0.002760687           853     0.2937328
## 5          0.911            100.7295   0.001879908          2749     0.3768334
## 6          2.056            102.6804   0.001429371          3779     0.4441180

Visualizing results

The following R-code chunks can be used to the reproduce the plots that are presented in the manuscript. The code for the plots can be found in this link and can be loaded into R using

source("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/misc/functions.R")

Computation time

R-code below creates a plot for the EVD computation times of the eigen and tensorEVD methods (Supplementary Figure 1) and a plot for the computation time ratio (eigen/tensorEVD) (Figure 1).

# Some data edits
out$alpha <- factor(100*out$alpha)
out$nGE <- paste0("n[G]*' = '*",out$nG,"L*', '*n[E]*' = '*",out$nE,"L")
out$nGE <- factor(out$nGE, levels = unique(out$nGE))

# Reshaping the data
measure <- c("time","Frobenius","CMD","nPC","pPC")
dat <- melt_data(out, id=c("nGE","n","alpha"),
                 measure=paste0(measure,"_"),
                 value.name=measure, variable.name="method")
## Loading required namespace: reshape2
color1 <- c('90%'="skyblue1",'95%'="royalblue1",'98%'="navy")
color2 <- c(eigen="#E69F00", tensorEVD="#009E73")

# Supplementary Figure 1: Computation times of the eigen and tensorEVD
figureS1 <- make_plot(dat, x='alpha', y='time',
                      group="method", by="n", facet="nGE",
                      xlab=bquote(alpha~"x100% of variance of K"),
                      by.label="Sample size", ylab="Time (seconds) eigen",
                      ylab2="Time (seconds) tensorEVD",
                      sec.axis=TRUE, group.color=color2,
                      scales="fixed", ylim=c(0,NA))
## Loading required namespace: RColorBrewer
## Loading required namespace: ggplot2
# Figure 1: Computation time ratio (eigen/tensorEVD)
out$alpha2 <- factor(paste0(out$alpha,"%"))
out$time_ratio <- out$time_eigen/out$time_tensorEVD

figure1 <- make_plot(out, type="line", x='n', y='time_ratio',
                     group="alpha2", group.label=NULL, facet="nGE",
                     xlab="Sample size",
                     ylab="Computation time ratio (eigen/tensorEVD)",
                     group.color=color1, nSD=0, errorbar.size=0,
                     hline=1, scales="free_y", ylim=c(0,NA))
print(figure1)

Approximation accuracy

R-code below produces the approximation accuracy plots using the Frobenius norm (Figure 2) of the difference between the Hadamard matrix and the approximation, and using the CMD metric (Supplementary Figure 2), provided by the eigen and tensorEVD.

# Supplementary Figure 2: Approximation accuracy using CMD
dat$CMD <- 1000*dat$CMD
figureS2 <- make_plot(dat, x='alpha', y='CMD',
                      group="method", by="n", facet="nGE",
                      xlab=bquote(alpha~"x100% of variance of K"),
                      ylab=expression("Correlation Matrix Distance (x"~10^{-3}~")"),
                      by.label="Sample size", group.color=color2,
                      scales="free_y", ylim=c(0,NA))

# Figure 2: Approximation accuracy using Frobenious norm
figure2 <- make_plot(dat, x='alpha', y='Frobenius',
                     group="method", by="n", facet="nGE",
                     xlab=bquote(alpha~"x100% of variance of K"),
                     ylab=expression("Frobenius norm ("~abs(abs(K-hat(K)))[F]~")"),
                     by.label="Sample size", group.color=color2,
                     scales="free_y", ylim=c(0,NA))
print(figure2)

Dimension reduction

The following R-code creates the plot (Figure 3) showing the number of eigenvectors produced by the eigen and tensorEVD methods, relative to the rank of the Hadamard matrix (number of eigenvectors with positive eigenvalue).

figure3 <- make_plot(dat, x='alpha', y='pPC',
                     group="method", by="n", facet="nGE",
                     xlab=bquote(alpha~"x100% of variance of K"),
                     ylab="Number of eigenvectors/rank",
                     by.label="Sample size", group.color=color2,
                     hline=1, scales="fixed")
print(figure3)

Application in Genomic Prediction

We analyzed each trait (grain yield, anthesis, silking, and anthesis-silking interval) with a Gaussian reaction norm \(G\times E\) model (Jarquín et al. 2014) in which the trait phenotype (\(y_{ijk}\)) is modeled as the sum of the main effect of hybrid (\(G_i\)), main effect of environment (\(E_j\)), and the hybrid\(\times\)environment interaction (\(GE_{ij}\)) term, this is

\[\begin{equation*} y_{ijk} = \mu + G_i + E_j + GE_{ij} + \varepsilon_{ijk}. \end{equation*}\]

Above, \(\mu\) is an intercept and \(i\), \(j\), and \(k\) are indices for the hybrids, environment, and replicate, respectively. The term \(\varepsilon_{ijk}\) is an error term assumed to be independently and identically Gaussian distributed as \(\varepsilon_{ijk} \sim N(0,\sigma_{\varepsilon}^2)\), with \(\sigma_{\varepsilon}^2\) variance parameter associated to the error. Hybrid, environment, and interaction effects were assumed to be multivariate normally distributed with zero mean and effect-specific covariance matrices, specifically \(\textbf{G}\sim MVN(\textbf{0},\sigma_G^2 \textbf{K}_G)\), \(\textbf{E}\sim MVN(\textbf{0},\sigma_E^2 \textbf{K}_E)\), and \(\textbf{GE}\sim MVN(\textbf{0},\sigma_{GE}^2 \textbf{K})\), where

\[\begin{equation*} \textbf{K} = (\textbf{Z}_1\textbf{K}_G\textbf{Z}'_1)\odot(\textbf{Z}_2\textbf{K}_E\textbf{Z}'_2) \end{equation*}\]

is a Hadamard product between the GRM \(\textbf{K}_G\) and the ERM \(\textbf{K}_E\), and \(\sigma_G^2\), \(\sigma_E^2\), and \(\sigma_{GE}^2\) are variance parameters associated to \(\textbf{G}\), \(\textbf{E}\) and \(\textbf{GE}\), respectively.

Bayesian Ridge Regression (BRR) implementation

We used a ‘BRR’ equivalence of the above model by fitting the model

\[\begin{equation*} \boldsymbol{y} = \boldsymbol{\mu}+\textbf{X}_G\boldsymbol{\beta}_1+\textbf{X}_E\boldsymbol{\beta}_2 + \textbf{X}_{GE}\boldsymbol{\beta}_3 + \boldsymbol{\varepsilon} \end{equation*}\]

where the predictors \(\textbf{X}_G\), \(\textbf{X}_E\), and \(\textbf{X}_{GE}\) are the scaled eigenvectors of the covariance matrices \(\textbf{K}_G\), \(\textbf{K}_E\), and \(\textbf{K}\), respectively, and the regression coefficients are assumed to be distributed \(\boldsymbol{\beta}_1\sim MVN(\textbf{0},\sigma_{G}^2 \textbf{I})\), \(\boldsymbol{\beta}_2\sim MVN(\textbf{0},\sigma_{E}^2 \textbf{I})\), and \(\boldsymbol{\beta}_3\sim MVN(\textbf{0},\sigma_{GE}^2 \textbf{I})\). First, we obtained the decomposition \(\textbf{K}_G=\textbf{V}_1\textbf{D}_1\textbf{V}'_1\) and \(\textbf{K}_E=\textbf{V}_2\textbf{D}_2\textbf{D}'_2\) using the eigen R-function. Next, for a given proportion \(\alpha\) of variance, we obtained the decomposition of the Hadamard \(\tilde{\textbf{K}}_{\alpha} = \tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}\tilde{\textbf{V}}_{\alpha}\) using the eigen and tensorEVD approaches. Finally, we derived the scaled eigenvectors \(\textbf{X}_G = \textbf{V}_1\textbf{D}_1^{1/2}\), \(\textbf{X}_E = \textbf{V}_2\textbf{D}_2^{1/2}\), and \(\tilde{\textbf{X}}_{GE,\alpha} = \tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}^{1/2}\). The later was done for values \(\alpha=0.90,0.95,0.98,1.00\).

The calculation of all these matrices was carried out using the R-script 2_model_components.R for a given \(\alpha\)-value. We replicated this task 5 times to obtain an average (and SD) computing time of the decomposition, to this end, we created a job array for each combination of parameters alpha and replicate as follows

JOBS <- expand.grid(alpha = c(0.90,0.95,0.98,1.00),
                    replicate = 1:5)
dim(JOBS); head(JOBS)
#[1] 20  2
#  alpha replicate
#1  0.90         1
#2  0.95         1
#3  0.98         1
#4  1.00         1
#5  0.90         2
#6  0.95         2

save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS2.RData")

The script was submitted using a job array for the \(4\times5=20\) combination of parameters (each row of the object JOBS) using the following job requirements

#!/usr/bin/env Rscript
#SBATCH --job-name=comps
#SBATCH --output=../log/%x_%A_%a
#SBATCH --time=06:59:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=160G
#SBATCH --constraint=intel18
#SBATCH --array=1-20

After running the R-script, the following files are created in the folder output/genomic_prediction/model_comps.

pipeline
:
├── output
:   └── genomic_prediction
        └── model_comps
            ├── XE.RData
            ├── XG.RData
            ├── XGE_90_eigen.RData
            ├── XGE_90_tensorEVD.RData
            ├── XGE_95_eigen.RData
            ├── XGE_95_tensorEVD.RData
            ├── XGE_98_eigen.RData
            ├── XGE_98_tensorEVD.RData
            ├── XGE_100_eigen.RData
            └── timing_EVD.txt

Analysis of variance

The \(G\times E\) model was implemented as a BRR using the BLRXy() function from the ‘BGLR’ R-package (Pérez-Rodríguez and de los Campos 2022) for each combination of trait, method, and \(\alpha\)-value, with 5 replicates each to present an average (and SD) of the results. The model was run for \(\alpha=1.0\) for the eigen method only. The BLRXy() function fits the model and generates samples from the posterior distribution of the regression coefficients \(\boldsymbol{\beta}_1\), \(\boldsymbol{\beta}_2\), and \(\boldsymbol{\beta}_3\) using the Gibbs sampler. We run the BGLR with nIter=50000 and burnIn=5000 parameters.

We implemented the model using the R-script 3a_fit_GxE_model.R for a given combination of parameters (trait, method, alpha, and replicate). We created an array with all combinations of parameters as follows

JOBS <- expand.grid(trait = c("yield","anthesis","silking","ASI"),
                    method = c("eigen","tensorEVD"),
                    alpha = c(0.90,0.95,0.98,1.00),
                    replicate = 1:5)
JOBS <- JOBS[-which(JOBS$alpha==1.00 & JOBS$method=="tensorEVD"),]
dim(JOBS); head(JOBS)
#[1] 140   4
#     trait    method alpha replicate
#1    yield     eigen   0.9         1
#2 anthesis     eigen   0.9         1
#3  silking     eigen   0.9         1
#4      ASI     eigen   0.9         1
#5    yield tensorEVD   0.9         1
#6 anthesis tensorEVD   0.9         1

save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS3.RData")

The script was submitted using a job array for all the \((4\times2\times4\times5) - (4\times1\times1\times5)=140\) combination of parameters (each row of object JOBS). The full job requirements used are the following

#!/usr/bin/env Rscript
#SBATCH --job-name=ANOVA
#SBATCH --output=../log/%x_%A_%a
#SBATCH --time=20:59:00
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=96G
#SBATCH --constraint=intel16
#SBATCH --array=1-140

The outputs generated by the R-script are saved in folder output/genomic_prediction/ANOVA in a sub-folder corresponding to each trait, method, alpha, and replicate combination as shown below. The posterior samples for coefficients \(\boldsymbol{\beta}_1\), \(\boldsymbol{\beta}_2\), and \(\boldsymbol{\beta}_3\) are stored in files ETA_G_b.bin, ETA_E_b.bin, and ETA_GE_b.bin, respectively. For instance, the file ETA_G_b.bin is a \(q\times p\) matrix containing at each row the samples \(\hat{\boldsymbol{\beta}}_1^{(1)},\hat{\boldsymbol{\beta}}_1^{(2)},...,\hat{\boldsymbol{\beta}}_1^{(q)}\).

pipeline
:
├── output
:   └── genomic_prediction
        :
        └── ANOVA
            ├── yield
            :   ├── eigen
                :   ├── alpha_90
                    :   ├── rep_1
                        :   ├── ETA_E_b.bin
                            ├── ETA_E_varB.dat
                            ├── ETA_G_b.bin
                            ├── ETA_G_varB.dat
                            ├── ETA_GE_b.bin
                            ├── ETA_GE_varB.dat
                            ├── fm.RData
                            ├── mu.dat
                            └── varE.dat

Obtaining total variance

We used the sample files to obtain the total variance of hybrid (\(\textbf{X}_G\boldsymbol{\beta}_1^{(k)}\)), environment (\(\textbf{X}_E\boldsymbol{\beta}_2^{(k)}\)), hybrid\(\times\)environment interaction (\(\textbf{X}_{GE}\boldsymbol{\beta}_3^{(k)}\)), and error (\(\boldsymbol{\varepsilon}\)) terms in the BRR model, and reported the average across all samples. As we standardized the phenotype to have unit variance, these variances can be seen as the proportion of the phenotypic variance explained by each model component. We performed this task using the R-script 3b_get_variance_GxE_model.R using a job array for all the \(140\) jobs (each row in the object JOBS). The code will create a data.frame stored in the file VC.RData in the corresponding sub-folder in folder output/genomic_prediction/ANOVA.

Collecting results

Once the previous R-script has been run for all the jobs, the following code can be used to collect into a single table all the individual variance components results from all jobs.

setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
load("parms/JOBS3.RData")
prefix <- "output/genomic_prediction/ANOVA"

out <- c()
for(k in 1:nrow(JOBS))
{
  trait <- as.vector(JOBS[k,"trait"])
  method <- as.vector(JOBS[k,"method"])
  alpha <- as.vector(JOBS[k,"alpha"])
  replicate <- as.vector(JOBS[k,"replicate"])

  suffix <- paste0(trait,"/",method,"/alpha_",100*alpha,"/rep_",replicate,"/VC.RData")
  filename <- paste0(prefix,"/",suffix)
  if(file.exists(filename)){
    load(filename)
    out <- rbind(out, VC)
  }else{
    message("File not found: '",suffix,"'")
  }
}

The first rows of this data.frame are displayed below for the results presented in the manuscript (the file can be found in this link).

out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_ANOVA.txt"))
head(out)
##      trait method alpha replicate source       mean          SD
## 1    yield  eigen   0.9         1      G 0.06956522 0.004202744
## 2    yield  eigen   0.9         1      E 0.48419062 0.011957294
## 3    yield  eigen   0.9         1     GE 0.07582496 0.004068338
## 4    yield  eigen   0.9         1  Error 0.39192312 0.002409890
## 5 anthesis  eigen   0.9         1      G 0.12780639 0.004926835
## 6 anthesis  eigen   0.9         1      E 0.85284787 0.008838661

Visualizing results

R-code below can be used to create a plot showing the average proportion of phenotypic variance of grain yield explained by each model (Figure 4). The same code can be used for traits anthesis, silking, and anthesis-silking interval (Supplementary Figures 4, 5, and 6, respectively).

out$alpha <- factor(paste0(100*out$alpha,"%"), levels=c("100%","98%","95%","90%"))
out$source <- factor(out$source, levels=c("G","E","GE","Error"))

trait <- c("yield", "anthesis", "silking", "ASI")[1]

myfun <- function(x) sprintf('%.3f', x)

# Figure 4: Phenotypic variance of yield
dat <- out[out$trait==trait,]
figure4 <- make_plot(dat, x='alpha', y='mean', SD="SD",
                     group="method", facet="source",
                     xlab=bquote(alpha~"x100% of variance of K"),
                     ylab=paste0("Proportion of variance of ",trait),
                     group.color=color2, scales="free_y",
                     ylabels=myfun, text=myfun)
print(figure4)

Cross-validation

We evaluated the performance of the approximation \(\tilde{\textbf{K}}_{\alpha} = \tilde{\textbf{V}}_{\alpha}\tilde{\textbf{D}}_{\alpha}\tilde{\textbf{V}}_{\alpha}\) of the kernel \(\textbf{K}\) provided by the eigen and tensorEVD methods in the \(G\times E\) model in terms of prediction accuracy. We conducted a 10-fold cross-validation (CV) with hybrids assigned to folds. We predicted all the records of hybrids in the \(k^{th}\) fold using a model trained with all records from hybrids in the remaining 9 folds. The model was implemented for each combination of trait and method for \(\alpha=0.90,0.95,0.98\).

The folds were previously created by Lopez-Cruz et al. (2023) and are provided in the column CV_10fold of the phenotypic data file. The number records in each fold ranges between 5436 and 6277.

setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
load("data/pheno.RData")
table(PHENO$CV_10fold)
#   1    2    3    4    5    6    7    8    9   10
#6180 6277 6246 5785 6160 5858 5492 5660 5436 5975

We implemented this CV using the R-script 4_10F_CV_GxE_model.R which fits the model for a given fold for a given combination of trait, method, and \(\alpha\)-value. To this end, we created an array with all combinations of parameters (trait, method, alpha, and fold) as follows

JOBS <- expand.grid(trait = c("yield","anthesis","silking","ASI"),
                    method = c("eigen","tensorEVD"),
                    alpha = c(0.90,0.95,0.98),
                    fold = 1:10)
dim(JOBS); head(JOBS)
#[1] 240   4
#     trait    method alpha fold
#1    yield     eigen   0.9    1
#2 anthesis     eigen   0.9    1
#3  silking     eigen   0.9    1
#4      ASI     eigen   0.9    1
#5    yield tensorEVD   0.9    1
#6 anthesis tensorEVD   0.9    1

save(JOBS, file="/mnt/scratch/quantgen/TENSOR_EVD/pipeline/parms/JOBS4.RData")

The script was submitted using a job array for all the \(4\times2\times3\times10=240\) combination of parameters (each row of object JOBS) using the following job requirements

#!/usr/bin/env Rscript
#SBATCH --job-name=10F_CV
#SBATCH --output=../log/%x_%A_%a
#SBATCH --time=20:59:00
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=96G
#SBATCH --constraint=intel16
#SBATCH --array=1-240

The outputs generated by the R-script are saved in folder output/genomic_prediction/10F_CV in a sub-folder corresponding to each trait, method, and alpha combination as shown below. Each file results_fold_*.RData contains a table with the predicted and observed values within each fold.

pipeline
:
├── output
:   └── genomic_prediction
        :
        └── 10F_CV
            ├── yield
            :   ├── eigen
                :   ├── alpha_90
                        ├── results_fold_1.RData
                        ├── results_fold_2.RData
                        ├── results_fold_3.RData
                        ├── results_fold_4.RData
                        ├── results_fold_5.RData
                        ├── results_fold_6.RData
                        ├── results_fold_7.RData
                        ├── results_fold_8.RData
                        ├── results_fold_9.RData
                        └── results_fold_10.RData

Within-environment prediction accuracy

The following R-code can be used to calculate the within environment (column year_loc) correlation between observed and predicted phenotypes. The code will collect first the results in files results_fold_*.RData for all folds from each job (a combination of trait, method, and alpha). The results from all jobs are to be saved in a single table.

setwd("/mnt/scratch/quantgen/TENSOR_EVD/pipeline")
source("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/misc/functions.R")
load("parms/JOBS4.RData")
prefix <- "output/genomic_prediction/10F_CV"

dat <- c()
for(trait in levels(JOBS$trait)){
  for(method in levels(JOBS$method)){
    for(alpha in unique(JOBS$alpha)){
      out0 <- c()
      for(fold in unique(JOBS$fold)){
        suffix <- paste0(trait,"/",method,"/alpha_",100*alpha,"/results_fold_",fold,".RData")
        filename <- paste0(prefix,"/",suffix)
        if(file.exists(filename)){
          load(filename)
          out0 <- rbind(out0, out)
        }else{
          message("File not found: '",suffix,"'")
        }
      }
      tmp <- get_corr(out0, by="year_loc")
      dat <- rbind(dat, data.frame(trait,method,alpha,tmp))
    }
  }
}

# Reshaping the data
dat$trait <- factor(dat$trait, levels=levels(JOBS$trait))
out <- reshape2::dcast(dat, trait+alpha+year_loc+nRecords~method, value.var="correlation")
tmp <- reshape2::dcast(dat, trait+alpha+year_loc+nRecords~method, value.var="SE")[,levels(JOBS$method)]
colnames(tmp) <- paste0(colnames(tmp),".SE")
out <- data.frame(out, tmp)

The first rows of this table are displayed below for the results presented in the manuscript (the file can be found in this link).

out <- read.csv(url("https://raw.githubusercontent.com/MarcooLopez/tensorEVD/main/inst/extdata/results_10F_CV.txt"))
head(out)
##   trait alpha  year_loc nRecords     eigen tensorEVD   eigen.SE tensorEVD.SE
## 1 yield   0.9 2014-IAH1     1557 0.3847606 0.3837099 0.02340692   0.02341801
## 2 yield   0.9 2014-ILH1      462 0.3956203 0.3982038 0.04282128   0.04276919
## 3 yield   0.9 2014-INH1      477 0.5379250 0.5497122 0.03867916   0.03832868
## 4 yield   0.9 2014-MNH1      443 0.5814797 0.5814698 0.03874100   0.03874133
## 5 yield   0.9 2014-MOH2      495 0.5256614 0.5253639 0.03831333   0.03832160
## 6 yield   0.9 2014-NEH1      448 0.2133686 0.2117438 0.04626095   0.04627769

Visualizing results

R-code below can be used to create a plot showing the within environment prediction correlation using the tensorEVD and eigen methods for each combination of trait and \(\alpha\)-value (Figure 5).

out$trait <- factor(out$trait, levels=unique(out$trait))
out$alpha <- factor(paste0(100*out$alpha,"%"), levels=c("98%","95%","90%"))

# Figure 5: Within environment prediction correlation
rg <- range(c(out$eigen,out$tensorEVD))
if(requireNamespace("ggplot2")){
  figure5 <-  ggplot2::ggplot(out, ggplot2::aes(tensorEVD, eigen)) +
              ggplot2::geom_abline(color="gray70", linetype="dashed") +
              ggplot2::geom_point(fill="#56B4E9", shape=21, size=1.4) +
              ggplot2::facet_grid(trait ~ alpha) +
              ggplot2::theme_bw() + ggplot2::xlim(rg) + ggplot2::ylim(rg)
}

print(figure5)

References

Golub G. H., and C. F. Van Loan, 1996 Matrix computations. Johns Hopkings University, Baltimore, MD.

Henderson C. R., 1985 Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J. Anim. Sci. 60: 111–117.

Herdin M., N. Czink, H. Özcelik, and E. Bonek, 2005 Correlation Matrix Distance, a Meaningful Measure for Evaluation of Non-Stationary MIMO Channels, pp. 136–140 in IEEE 61st Vehicular Technology Conference, Stockholm, Sweden.

Jarquín D., J. Crossa, X. Lacaze, P. Du Cheyron, J. Daucourt, et al., 2014 A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 127: 595–607.

Lima D. C., J. D. Washburn, J. I. Varela, Q. Chen, J. L. Gage, et al., 2023 Genomes to Fields 2022 Maize genotype by Environment Prediction Competition. BMC Res. Notes 16.

Lopez-Cruz M., F. Aguate, J. Washburn, S. K. Dayane, C. Lima, et al., 2023 Leveraging Data from the Genomes to Fields Initiative to Investigate G×E in Maize in North America. Nat. Comm. (in press)

Perez-Rodriguez P., and G. de los Campos, 2022 Additions to the BGLR R-package: a new function for biobank size data and Bayesian multivariate models, pp. 1486–1489 in Proceedings of 12th World Congress on Genetics Applied to Livestock Production (WCGALP), Rotterdam.

R Core Team, 2021 R: A Language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Searle S. R., 1982 Matrix Algebra Useful for Statistics. John Wiley & Sons, Inc, New Jersey.