Find the information such as human life, natural resource,agriculture,forestry, biotechnology, biodiversity, wood and non-wood materials.
Blog List
Sunday, 3 July 2016
Estimation of genetic parameters and their sampling variances for quantitative traits in the type 2 modified augmented design
Published Date
April 2016, Vol.4(2):107–118, doi:10.1016/j.cj.2016.01.003
Open Access, Creative Commons license, Funding information
Title
Estimation of genetic parameters and their sampling variances for quantitative traits in the type 2 modified augmented design
Author
Frank M. You a,,
Qijian Song b
Gaofeng Jia a,c
Yanzhao Cheng a
Scott Duguid a
Helen Booker c
Sylvie Cloutier d
aMorden Research and Development Centre, Agriculture and Agri-Food Canada, Morden, MB R6M 1Y5, Canada
bSoybean Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD 20705, USA
cCrop Development Centre, University of Saskatchewan, 51 Campus Drive, Saskatoon, SK S7N 5A8, Canada
dOttawa Research and Development Centre, Agriculture and Agri-Food Canada, Ottawa, ON K1A 0C6, Canada
Received 11 November 2015. Revised 16 December 2015. Accepted 2 February 2016. Available online 11 February 2016.
Abstract
The type 2 modified augmented design (MAD2) is an efficient unreplicated experimental design used for evaluating large numbers of lines in plant breeding and for assessing genetic variation in a population. Statistical methods and data adjustment for soil heterogeneity have been previously described for this design. In the absence of replicated test genotypes in MAD2, their total variance cannot be partitioned into genetic and error components as required to estimate heritability and genetic correlation of quantitative traits, the two conventional genetic parameters used for breeding selection. We propose a method of estimating the error variance of unreplicated genotypes that uses replicated controls, and then of estimating the genetic parameters. Using the Delta method, we also derived formulas for estimating the sampling variances of the genetic parameters. Computer simulations indicated that the proposed method for estimating genetic parameters and their sampling variances was feasible and the reliability of the estimates was positively associated with the level of heritability of the trait. A case study of estimating the genetic parameters of three quantitative traits, iodine value, oil content, and linolenic acid content, in a biparental recombinant inbred line population of flax with 243 individuals, was conducted using our statistical models. A joint analysis of data over multiple years and sites was suggested for genetic parameter estimation. A pipeline module using SAS and Perl was developed to facilitate data analysis and appended to the previously developed MAD data analysis pipeline (http://probes.pw.usda.gov/bioinformatics_tools/MADPipeline/index.html).
Keywords
Broad-sense heritability
Genetic correlation
Sampling variance
Modified augmented design
1 Introduction
In the early stages of breeding programs, a considerable number of test lines and a limited seed supply constrain the use of complete experimental designs with replications. Augmented designs, a class of unreplicated experimental designs, are a potential solution to this problem [1], [2] and [3]. The augmented design usually has control lines arranged in a standard design such as a Latin square with several replications in soil-homogeneous blocks. Then the blocks are augmented to accommodate unreplicated test lines. Since control lines are in a standard design, the block effects can be estimated to adjust the observations of the test lines, and the error effects within control lines can be used to test the significance of performance differences among lines. Lin and Poushinsky [4] and [5] proposed a modified augmented design (MAD) with two subtypes. The type 1 MAD is used for square plots [4] and the type 2 MAD (MAD2) for rectangular plots [5]. This modified design is superior to the general augmented design in systematic placement of control and test genotypes within a block to enhance adjustment for soil heterogeneity [4].
MAD2 is used largely for early evaluation of breeding lines in crops such as wheat [6]and [7], potato [8], soybean [9], barley [10] and [11], sugarcane [12] and [13], and maize [14]. It is also used in flax breeding programs in Canada for field evaluation of flax yield, seed oil component, disease resistance, and other traits of agronomic and economic importance and for purposes of QTL identification, association mapping, and genomic selection [15], [16], [17] and [18]. In genetic experiments, individuals may have adequate amounts of seed for replicated trials, but it may be impractical to accommodate hundreds of genotypes in one homogeneous block of a field, owing to soil heterogeneity. Our earlier study [19] indicated that soil heterogeneity can be sufficiently adjusted for traits in MAD2 trials, suggesting that genetic variance of traits can be determined using a MAD2 approach.
Heritability and genetic correlation are crucial genetic parameters for quantitative traits because they can be used to predict the response to selection in plant breeding. Because the theoretical statistical distributions of these genetic parameter estimators are unknown, approximate tests of significance can be performed only on the basis of sampling errors. Methods for estimating sampling variances of the genetic correlation coefficient and heritability in some replicated experimental designs have been reported [20], [21], [22], [23] and [24].
We have improved upon previous methods of MAD2 statistical analysis in adjusting for soil heterogeneity [19]. Owing to the lack of replication of test genotypes in the design, however, the total variance for test genotypes cannot be partitioned into its genetic and error components, and for this reason the method is unable to estimate genetic parameters. Here we present a method for estimating broad-sense heritability (H2) and genetic correlation coefficients (rg) of quantitative traits in the MAD2. We also derive the statistical formulas for estimating their sampling variances. We used computer simulations to evaluate the reliability of the proposed methods. As a case study using flax, we estimated the genetic parameters of three quantitative traits in a biparental recombinant inbred line (RIL) population of 243 lines.
2 Methods
2.1 Experimental design and statistical analysis
A typical MAD2 has r * c whole plots structured as a grid of r rows and c columns. Each whole plot is split into k (an odd number, usually five or seven) parallel rectangular subplots. The whole experiment has a total of r * c * k subplots. A control genotype is assigned to the central subplot of each whole plot (plot control). Two additional control genotypes serve as subplot controls randomly assigned to subplots in randomly selected whole plots with n replicates. Thus, the entire trial accommodates rck − rc − 2n test genotypes that are randomly allocated to the remaining subplots (see Fig. 1 in [19] for the field layout).
Control plots are used to estimate row (R), column (C) and R × C interaction effects and to test for additive soil variation in the row and/or column directions. The two subplot controls plus one plot control are used to estimate the subplot error and test for non-additive soil variation in multiple directions across the field [9] and [19]. The test results are used to determine whether data adjustment is needed and which method of adjustment should be used. Three methods have been proposed to adjust test genotypes to reduce or remove effects due to soil heterogeneity [4], [5] and [9]. For MAD2, method 1 is used if the row or column effects or both are significant, method 3 is used if the R × C interaction is significant [5], [9] and [25] and a combined methods 1 and 3 approach is suggested in most cases [19]. A detailed statistical analysis for MAD2 trials has been described [19].
2.2 Case study
An RIL population with 243 lines derived from a cross between “CDC Bethune” and “Macbeth” (BM) was used to evaluate genetic variation. The single MAD2 trial consisted of 49 whole plots (7 × 7 grids), each splits into seven parallel subplots (1.5 m × 2.0 m with a 20-cm row spacing). CDC Bethune with 49 replicates was used as the plot control, and 7 replicates of both Hanley and Macbeth served as subplot controls. Field trials with the same design were conducted at two locations in Canada (Morden, Manitoba and Kernen Farm near Saskatoon, Saskatchewan) from 2009 to 2012 [18]. Genetic parameters and their sampling variances were estimated for three traits: oil content (OIL), iodine value (IOD), and linolenic acid content (LIN). The raw phenotypic data are presented in Table S1.
An RIL population with 243 lines derived from a cross between “CDC Bethune” and “Macbeth” (BM) was used to evaluate genetic variation. The single MAD2 trial consisted of 49 whole plots (7 × 7 grids), each splits into seven parallel subplots (1.5 m × 2.0 m with a 20-cm row spacing). CDC Bethune with 49 replicates was used as the plot control, and 7 replicates of both Hanley and Macbeth served as subplot controls. Field trials with the same design were conducted at two locations in Canada (Morden, Manitoba and Kernen Farm near Saskatoon, Saskatchewan) from 2009 to 2012 [18]. Genetic parameters and their sampling variances were estimated for three traits: oil content (OIL), iodine value (IOD), and linolenic acid content (LIN). The raw phenotypic data are presented in Table S1.
2.3 Estimation of genetic parameters
Observations of test genotypes and control genotypes after statistical adjustment [19]are expected to exclude the effect of soil heterogeneity; thus, the variation among replications of each control genotype should be caused only by random errors. The adjusted dataset in the trials corresponds to that obtained from a completely random design. Because each test genotype has a single adjusted observation, the total variance among test genotypes cannot be partitioned into genetic and error variances. However, the total variance within each control genotype, which is caused by random error, can be treated as the error variance of the test genotypes because it is reasonable to assume that any error effect of test genotypes or control genotypes follows the same normal distribution with N(0, σe2), where σe2 is the error variance. Accordingly, the genetic variance can be estimated by subtraction of the error variance from the total variance of the test genotypes.
Thus, the genetic correlation coefficient ( ), error correlation coefficient ( ), phenotypic correlation coefficient ( ) between two traits i and j (i, j = 1, 2), and the broad-sense heritability ( ) of any single trait can be defined as.
equation1
equation2
equation3
equation4
where , , and represent the phenotypic, genetic and error variances of single traits (i = j) or covariances of two traits (i ≠ j), respectively. Estimation of these variances and covariances is dependent on statistical models.
2.3.1 Model 1: Single trial
For a single trial with g test genotypes and t control genotypes (including main plot controls and subplot controls), the adjusted observation of any test genotype with no replication can be expressed as.
equation5
where yi ~ N(μ, σP2), Gi ~ N(0, σG2) and εi ~ N(0, σe2). σP2, σG2, and σe2 are phenotypic, genetic and error variances, respectively. The error variance σe2 is estimated based on t replicated control genotypes. For a given trait i (i = 1, 2), the analyses of variance and covariance are shown in Table 1.
Table 1. Analyses of variance and covariance for model 1.
Source
df
MS
EMS
COV
ECOV
Genotype variance and covariance analyses
Genotype (G)
g − 1
Aii
σe2 + σG2
Aij
COVe + COVG
Error variance and covariance analyses
Control (C)
t − 1
Bii
σe2 + nκC2
Bij
COVe + nCOVC
Error
rc + 2m − t
Cii
σe2
Cij
COVe
DF: degrees of freedom; MS: mean square; EMS: expected mean square; COV: covariance; ECOV: expected covariance; g: number of genotypes; t: number of control genotypes; n: average number of replicates for each control genotype (see Formula (7) in text); r and c are the number of rows and columns, respectively; and m is the number of replicates for two subplot controls.
For the two traits i and j (i, j = 1, 2), the error, genetic and phenotypic variances and covariances can be estimated as , , and as follows:
equation6
where n is the number of replicates and Cij and Aij are the error and genotype covariance for trait i and j in Table 1, respectively. Because the number of replicates per control genotype differs in the MAD2 design, the number of replicates used for phenotypic variance estimation as described above is estimated [26] and [27] as
equation7
where nk is the number of replicates for the kth control genotype and t is the number of control genotypes used, usually 3 in MAD2.
2.3.2 Model 2: Trials in multiple environments
For the joint analysis of data in multiple environments or trials with the same design (each trial from different years and sites treated as environments), the adjusted observation of any test genotype with e environments and without replication can be expressed as
equation8
where yij ~ N(μ, σP2), Gi ~ N(0, σG2), Ej ~ N(0, σE2), (GE)ij ~ N(0, σGE2), and εij ~ N(0, σe2). σP2, σG2, σE2, σGE2, and σe2 are the phenotypic, genetic, environmental, genotype-by-environment interaction (G × E), and error variances, respectively. σe2 is jointly estimated based on e trials with t replicated control genotypes in each trial. For a given trait i (i = 1, 2), the analyses of variance and covariance are shown in Table 2.
Table 2. Analyses of variance and covariance for model 2.
Source
DF
MS
EMS
COV
ECOV
Genotype variance and covariance analyses
Genotype (G)
g − 1
Aii
σe2 + σGE2 + eσG2
Aij
COVe + COVGE + eCOVG
Environment (E)
e − 1
Bii
σe2 + σGE2 + gσE2
Bij
COVe + COVGE + gCOVE
G × E
(g − 1)(e − 1)
Cii
σe2 + σGE2
Cij
COVe + COVGE
Error variance and covariance analyses
Control (C)
t − 1
Dii
σe2 + enκC2
Dij
COVe + enCOVC
Environment (E)
e − 1
Eii
σe2 + tnσE2
Eij
COVe + tnCOVE
C × E
(t − 1)(e − 1)
Fii
σe2 + nσCE2
Fij
COVe + nCOVCE
Error
e(rc + 2 m − t)
Gii
σe2
Gij
COVe
e: number of environments. See Table 1 for other notes.
For the two traits i and j (i, j = 1, 2), the error, genetic, G × E, and phenotypic variance and covariance can be estimated as , , , and as follows:
equation9
where Gij, Cij, and Aij are the covariances for error, G × E, and genotype for trait i and j in Table 2, respectively. Genetic parameters can be estimated using Formulas (1)–(4) and (9).
2.3.3 Model 3: Trials in multiple years and sites
Specifically for the joint analysis of data in multiple years and sites, the adjusted observation of any test genotype during y years at s sites with no replication can be expressed as
equation10
where yijk ~ N(μ, σP2), Gi ~ N(0, σG2), Yj ~ N(0, σY2), (GY)ij ~ N(0, σGY2), Sk ~ N(0, σS2), (GS)ik ~ N(0, σGS2), (YS)jk ~ N(0, σYS2), (GYS)ijk ~ N(0, σGYS2), and εijk ~ N(0, σe2). σP2, σG2, σY2,σGY2, σS2σGS2, σYS2, σGYS2, and σe2 are the variances for phenotype, genotype (G), year (Y), G × Y, site (S), G × S, Y × S, G × Y × S, and error, respectively. σe2 is jointly estimated based on t replicated control genotypes during yyears at s sites. For a given trait i (i = 1, 2), the analyses of variance and covariance are shown in Table 3.
Table 3. Analyses of variance and covariance for model 3.
Source
DF
MS
EMS
COV
ECOV
Genotype variance and covariance analyses
Genotype (G)
g − 1
Aii
σe2 + σGYS2 + sσGY2+ yσGS2 + ysσG2
Aij
COVe + COVGYS + sCOVGY + yCOVGS + ysCOVG
Year (Y)
y − 1
Bii
σe2 + σGYS2 + sσGY2+ gσYS2 + gsσY2
Bij
COVe + COVGYS + sCOVGY + gCOVYS + gsCOVY
Site (S)
s − 1
Cii
σe2 + σGYS2 + yσGS2+ gσYS2 + gyσS2
Cij
COVe + COVGYS + yCOVGS + gCOVYS + gyCOVS
G × Y
(g − 1)(y − 1)
Dii
σe2 + σGYS2 + sσGY2
Dij
COVe + COVGYS + sCOVGY
G × S
(g − 1)(s− 1)
Eii
σe2 + σGYS2 + yσGS2
Eij
COVe + COVGYS + yCOVGS
Y × S
(y − 1)(s− 1)
Fii
σe2 + σGYS2 + gσYS2
Fij
COVe + COVGYS + gCOVYS
G × Y × S
(g − 1)(y− 1)(s − 1)
Gii
σe2 + σGYS2
Gij
COVe + COVGYS
Error variance and covariance analyses
Control (C)
t − 1
Hii
σe2 + ysnκC2
Hij
COVe + ysncCOVC
Year (Y)
y − 1
Iii
σe2 + nσCYS2 + snσCY2 + gnσYS2 + gsnσY2
Iij
COVe + nCOVCYS + snCOVCY + gnCOVYS + gsnCOVY
Site (S)
s − 1
Jii
σe2 + nσCYS2 + ynσCS2 + gnσYS2 + gynσS2
Jij
COVe + nCOVCYS + ynCOVCS + gnCOVYS + gynCOVS
C × Y
(t − 1)(y− 1)
Kii
σe2 + nσCYS2 + snσCY2
Kij
COVe + nCOVCYS + snCOVCY
C × S
(t − 1)(s− 1)
Lii
σe2 + nσCYS2 + ynσCS2
Lij
COVe + nCOVCYS + ynCOVCS
Y × S
(y − 1)(s− 1)
Mii
σe2 + nσCYS2 + tnσYS2
Mij
COVe + nCOVCYS + tnCOVYS
C × Y × S
(t − 1)(y− 1)(s − 1)
Nii
σe2 + nσCYS2
Nij
COVe + nCOVCYS
Error
ys(rc + 2 m − t)
Oii
σe2
Oij
COVe
y: number of years; s: number of sites. See Table 1 and Table 2 for other notes.
For the two traits i and j (i, j = 1, 2), the variances and covariances for error, G, Y, G × Y, G × S, and G × Y × S can be estimated separately as , , , , , and , respectively:
equation11
where Oij, Aij, Dij, Eij, and Gij are the covariance for error, G, G × Y, G × S, and G × Y × S for traits i and j in Table 3, respectively. Similarly, several genetic parameters can be estimated by applying Formula (11) to Formulas (1)–(4).
2.4 Estimation of sampling variances
The Delta method [28] and [29] was used to derive the formulas for sampling errors for several genetic parameters. General formulas for sampling errors of several genetic parameters are available [22], [24] and [30]:
equation12
We noticed that , , and in Formulas (6), (9), and (11) are linear functions of moments, θ (m1, m2, …, mk):
equation13
where mi corresponds to the mean square of a variation source in Table 1, Table 2and Table 3. Then the variance of θ in Formula (14) can be estimated [31]:
equation14
Similarly, the approximate covariance between two functions of moments θl(m1 , … , mk) (l = 1, 2) is given by [31]:
equation15
V(mi) and COV(mi, mj) in Formulas (15) and (16) can be calculated using the following formulas [32] and [33]:
equation16
where q, r, s, t = 1, 2 and df are the degrees of freedom. The denominator value df + 2 has been suggested [34] to yield unbiased estimates.
Suppose that genotype and environment are independent. By applying Formulas (14)–(16) to Formulas (6), (9), and (11), we can calculate the variances of , , and (i, j = 1, 2; i = j or i ≠ j), and covariances of any two of them, which are finally used to estimate the variances of correlation coefficients ( , , ), and .
For model l, we derived a general formula to calculate the variances of , , and (i, j = 1, 2; i = j or i ≠ j) and the covariances between any two of them:
equation17
where represents , , ; q, r, s, t = 1, 2; n is the number of replicates of control genotypes estimated from Formula (7); dA = (g − 1) + 2 and dC = (rc + 2 m − t) + 2 from Table 1; and C1 and C2 are the correction coefficients listed in Table 4 for calculation of different variances or covariances.
Table 4. Correction coefficients in Formula (17) for sampling variance estimation for model 1.
For model 2, a similar general formula was derived to calculate variances of , , and and covariances of any two of them:
equation18
where e is the number of environments; n is the number of replicates estimated with Formula (7); dA = (g − 1) + 2, dC = (g − 1)(e − 1) + 2 and dG = e(rc + 2 m − t) + 2 in Table 2; and C1, C2, and C3 are the correction coefficients listed in Table 5 for calculation of different variances or covariances.
Table 5. Correction coefficients in Formula (18) for sampling variance estimation for model 2.
For model 3, we derived the following general formula to calculate variances of , , and and covariances of any two of them:
equation19
where y is the number of years; s is the number of sites; n is the number of control replicates estimated with Formula (7); dA = (g − 1) + 2, dD= (g − 1)(y − 1) + 2, dE= (g − 1)(s − 1) + 2, dG= (g − 1)(y − 1)(s − 1) + 2 and dO = ys(rc + 2 m − t) + 2 from Table 3; and C1, C2, C3, C4, and C5 are the correction coefficients listed in Table 6 for calculation of different variances or covariances.
Table 6. Correction coefficients in Formula (19) for sampling variance estimation for model 3.
Based on the MAD2 design scheme, we simulated single MAD2 trials for estimation of two genetic parameters: H2 of a trait and rg between two traits. The purposes of the simulations were to (1) validate the proposed method for estimating genetic parameters in the MAD2 trials and (2) assess the accuracy of the derived theoretical formulas of the sampling variances for the two genetic parameters. We compared H2and rg values with the simulated and to determine whether these parameters were accurately estimated.
A single MAD2 trial with 10 × 10 whole plots and five subplots in each whole plot was simulated. The dataset of 390 test genotypes with one observation, one main plot control with 100 replicates, and two subplot controls with five replicates each were generated based on assumptions in Formula (5) and given values for heritability and genetic correlation of the test genotype population. All simulations were performed using R software (https://www.r-project.org/), and the R code is available upon request.
2.5.1 Broad-sense heritability (H2)
Given the σG2 and H2 of a trait, we can calculate the error variances as σe2 = σG2 (1 - H2)/H2 on a plot basis. Thus, we can simulate the effect of different error variances on the estimation of H2 in MAD2. Data generation was performed as follows: (1) given the μ and σG2 of a trait, we generated a set of normal random numbers for 390 test genotypes plus three control genotypes following N(μ, σG2), corresponding to the genetic values of test and control genotypes; (2) given H2, we calculated σe2 and generated 100 sets of normal random numbers with N(0, σe2), corresponding to the error effect of 100 replicates; and (3) we merged genetic values and error effects to generate phenotypic values of test and control genotypes of 100 replicates, creating a matrix of 393 rows and 100 columns, following N(μ, σP2) and representing phenotypic values of the single MAD2 trial; (4) we randomly chose 390 rows with one column to simulate test genotypes without replication, one row with all 100 columns to simulate the plot control with 100 replicates, and two rows with five columns to simulate two subplot controls with five replicates. For each given H2 value from 0.1 to 0.9 with an interval of 0.1, a total of 1000 simulations were performed. For each, the data were analyzed using model 1 (Table 1) and and its sampling error were estimated using Formulas (4) and (12). The standard deviation of in 1000 samples was calculated to represent an actual sampling error (henceforth termed “simulated” sampling error) for comparison with those calculated based on Formula (12).
2.5.2 Genetic correlation (rg)
Given two traits (1 and 2) following N(μ1, σG12) and N(μ2, σG22) with rg, we generated two sets of correlated random numbers to simulate genetic values of traits as follows: (1) we generated two sequences of uncorrelated standard normal distributed random numbers X1 and X2; (2) we defined a new variable that had a genetic correlation of rg with X1; and (3) we transformed X1 and Y into two new variables following the given normal distribution: X1' = X1σG1 + μ1 and X2' = YσG2 + μ2. To simplify the simulation, we set the error correlation re between the two traits to zero. We then generated two sets of independent random numbers for the error effects of the two traits. All other procedures followed the principles described above.
2.5.3 Simulation of trial data from multiple years and sites
When trial data from multiple years and sites are available, both models 2 and 3 can be used for genetic parameter estimation. Model 1 can also be applied for analysis of single trials. To compare these three statistical models, we simulated trial data from four years and two sites per year that were similar to those of the case study. The same trial design and simulation procedure as the single trial were used but several major effects for years and sites, and some interaction effects, were added to the linear model (Formula (10) and Table 3). A total of eight trials were produced for a given H2. All three models were used to estimate H2.
2.6 Pipeline programs
The ANOVA and covariance analyses in Table 1, Table 2 and Table 3 were implemented using SAS software (SAS Institute Inc., Cary, USA). The results from SAS served as input to a Perl program and were further analyzed to estimate several genetic parameters and their sampling variances. A new module including a SAS and a Perl program was appended to the MAD pipeline [19].
3 Results
3.1 Computer simulations
3.1.1 Estimation of genetic parameters and their sampling errors
Given the different H2 values from 0.1 to 1.0, the average estimates of 1000 simulated datasets were highly correlated (R2 = 0.998) with H2 (Fig. 1A); both theoretical and simulated sampling errors (s( ) decreased with increasing H2 (Fig. 1B); and the simulated s( was highly correlated with the theoretical s( (Fig. 1C). The s( values estimated from the two methods were consistent except when H2was less than 0.3. These results indicate that estimation of H2 and its s( using the derived theoretical formula is reliable and that the reliability of the estimates increases with H2.
Fig. 1. Simulation of broad-sense heritability ( ).
(A) Simulation-based estimated and its sampling error s( ) in relation to H2. (B) Simulated and theoretical sampling errors (s( ) in relation to H2. (C) Relationship between simulated s( and theoretical s( .
Similarly, we simulated trial data for estimation of rg for values ranging from 0.1 to 0.9. rg was calculated based on the genetic covariance and variance of two traits. Considering that two traits may have different heritabilities, we generated data for 729 parameter combinations of different rg (0.1–0.9), H12 (0.1–0.9) and H22 (0.1–0.9) each with 1000 simulations. A significant correlation between rg and (R2 = 0.7242) was observed (Fig. 2A), but this relationship was more complex than that between and H2 (Fig. 1A). Large sampling errors were observed for any given rg, which may result from the bias caused by the correlated errors of two traits (re). We also noticed that the theoretical s( was slightly higher than the simulated s( (Fig. 2B), though the theoretical s( was also highly correlated with the simulated s( (Fig. 2C).
Fig. 2. Simulation of genetic correlation coefficient ( ).
(A) Estimated based on simulation data in relation to given rg. (B) Simulated and theoretical s( ) in relation to . (C) Relationship between simulated s( ) and theoretical s( ). The dots in plots represent averages of estimates from 1000 simulations.
3.1.2 Sampling distribution of genetic parameters
Using 1000 simulations (or samples) for each given parameter or a combination of parameters, we can calculate the sampling error for each simulation and assess the sampling distribution of the parameter. Most samples appeared to be near- or normally distributed for and . Fig. 3A and B shows several typical examples of the sampling distributions for and , respectively. Based on all the simulated samples in two simulation experiments, 97% and 92% of the samples for and , respectively, were normally distributed (P > 0.05) and the remainders followed an approximate normal distribution, suggesting that the theoretically estimated sampling error of a parameter estimate can be used to derive an approximate assessment of the significance of an estimate different from zero with a Z test.
Fig. 3. Sampling distributions of broad-sense heritability ( ) (A), and genetic correlation coefficient ( ) (B) at several parameter values.
3.1.3 Comparison of statistical models
For the joint data analysis of trials from multiple years and sites, two statistical models, model 2 (Table 2) and model 3 (Table 3), are suitable. Technically, model 1 (Table 1) can also be used for a single-trial analysis. The question was whether all three models could accurately estimate the genetic parameters when significant genotype-by-environment interaction effects were present. To compare the three statistical models for the same sets of data, we simulated trial data from 4 years and two sites (similarly to the case study). The results showed that only model 3 produced accurate H2estimates, whereas models 2 and 1 overestimated H2, especially at low H2 values (Fig. 4A). The theoretically estimated sampling errors of fitted the simulated ones well in all three models (Fig. 4B). The sampling errors of in model 3 were higher than those in models 2 and 1. Although in model 1 had the lowest sampling errors, they deviated greatly from the correct values.
Fig. 4. Simulation of broad-sense heritability ( ) under different statistical models, assuming significant genotype-by-environment (year and site) interaction effects.
(A) Estimated in relation to H2. (B) Simulated and theoretical sampling errors (s( ) in relation to H2.
3.2 Case study
OIL, IOD and LIN are three phenotypic traits important in flax breeding for flaxseed or linseed. For the trial data of the BM population from 4 years at two sites, we first performed data adjustment using the MAD pipeline [19]. Then, using the adjusted observations, we also calculated the (Table 7) and (Table 8) for the three traits and their sampling errors on a single-plot and a genotype mean basis. Two statistical models (models 2 and 3) were applied to the same dataset. We also estimated the genetic parameters using model 1 independently for each of eight trials. Similar estimates for all two parameters were obtained using both model 2 and model 3 to account for the possibility of their high heritability. As expected, higher estimates of and of the three traits were obtained from model 1 (Table 7 and Table 8). The sampling error estimates from model 3 were consistently higher than those from models 2 and 1 (Table 7 and Table 8), in accordance with the simulation results (Fig. 4). Because the two genetic parameters follow a normal sampling distribution (Fig. 3), we could perform an approximate Z test to determine whether the estimates of the parameters were significantly different from zero. All three traits had high and statistically significant (P < 0.01) heritability estimates. For , the estimates of all possible trait pairs were significant in model 2 and model 1, but the estimates of some trait pairs in model 3 were not significant because of their higher sampling errors. In addition, the estimates of based on the genotype mean were larger than those based on single plots because the estimation of phenotypic variances differed (Formulas (2), (9), and (11)).
Table 7.and s( for three traits (OIL, IOD and LIN) in the BM population.
Model 3: joint analysis of 4 years × 2 sites; Model 2: joint analysis using eight environments (each site/year as an environment); and Model 1: one single trial (2012 at Morden) is shown as an example.
b
Genotype mean: on an entry-mean basis; Plot: on a plot basis.
⁎⁎
Represents statistical significance at the 0.01 probability level.
Table 8. and s( ) between three traits (OIL, IOD and LIN) in the BM population.
An augmented design is usually applied by breeders to a large number of lines that are to be planted in a field of limited size. Error variance and genetic parameters may be estimated from replicated controls in unreplicated experimental designs such as MAD2. In the present study, genetic variance (covariance) was calculated based on total phenotypic variance (covariance) estimated from the test genotypes minus error variance (covariance) estimated from the control genotypes. This separate analysis approach provides approximate estimates of genetic parameters based on the MAD2 design, although it is not optimal for some cases. Our simulation results suggest that the method we propose is highly accurate for estimating H2 with the reliability of the estimates increasing with trait heritability. Estimates of rg had larger sampling errors than those of H2, indicating that the latter is less subject to environmental effects.
We derived approximate theoretical sampling error formulas for the two genetic parameters using the Delta method [28] and [29]. We found that the theoretical sampling errors of all two genetic parameters were highly consistent with the simulated sampling errors, except for a few cases at very low heritability (Fig. 1, Fig. 2and Fig. 3) suggesting that estimation of the sampling errors for two genetic parameters in MAD2 is reliable and that it can be used to test whether the estimated genetic parameters are significantly different from zero.
Theoretically, the total variance of the test genotypes (the mean square Aii in Table 1) will be greater than the error variance in a single trial. Accordingly, we were able to obtain genetic variance as total variance minus the error variance. However, because a limited number of control genotypes (three in our case) were used to estimate the error variance, the latter estimate is occasionally greater than the total variance of the test genotypes as a consequence of sampling bias. This results in negative genetic variance estimates and failure to estimate genetic parameters. In our simulation, when H2 = 0.1,22.5% of simulation data sets failed to yield estimates of genetic parameters, but when H2 = 0.3, only 0.6% of simulation data sets failed; and when H2 > 0.3, none failed. When the heritability of a trait is very low (e.g. < 0.1), the method proposed in this paper is sometimes unable to estimate genetic parameters precisely. In addition, there is some risk of misadjustment in this design if control genotypes show a different error variance or perform differently from the unreplicated entries [35]. Some alternatives have been proposed to reduce this risk, such as partially replicated (p–rep) designs, where a proportion of the test entries are replicated at each location [36], [37] and [38].
There are two units used to measure phenotypic variances: one based on the single plot and the other based on the genotype mean. The two measurement units will generate different estimates of H2; however, the estimation of rg is not affected because the numerator and denominator of Formula (1) for calculating involve only genetic components. The estimates of phenotypic variance based on the genotype mean were always larger than those based on the plot (Table 7 and Table 8) because the error and interaction variance components were divided by the corresponding number of observations in the measurement unit on a genotype mean basis (Formulas (6), (9), and (11)). Because MAD2 is an unreplicated unbalanced design, each adjusted observation comes from single plots only, and estimates based on the plot may be more reasonable estimates of genetic variation.
Three statistical models were considered. Because model 1 deals only with single-trial data, the genetic variance contains an undecomposable genotype-by-environment interaction and consequently H2 and rg are always overestimated (Fig. 4A, Table 7 and Table 8). For this reason, we suggest a joint analysis of trials from multiple environments (different years and/or sites) with model 2 or model 3. However, in the presence of significant genotype-by-environment effect, H2 is generally overestimated in both models 1 and 2 (Fig. 4A). Theoretically, in model 2, the total variation of the test genotypes is partitioned into three components: G, E, and G × E (Table 2), whereas in model 3, E is further partitioned into Y, S, and Y × S, and G × E into G × Y, G × S, and G × Y × S. Hence, ANOVA of the same dataset had identical sum of squares (SS) of G in models 2 and 3; the SS of E was equal to the summation of the SS of Y, S and Y × S; and the SS of G × E was equal to the summation of the SS of G × Y, G × S, and G × Y × S. Both models also yielded the same error variances. The two models applied different formulas (Formulas (9) and (11)) to estimate σG2, σGE2 or σGY2, σGS2, and σGYS2 that resulted in higher σG2 and lower σGE2 in model 2 than in 3 (Fig. 5) and consequently in the overestimation of genetic parameters in model 2. However, because more partitioned variance components in model 3 are indirectly estimated, higher sampling errors usually ensue—the major reason for the higher sampling variance of the genetic parameters estimated from model 3. Model 2 yields reasonable estimation accuracy and low sampling variance. Because model 2 treats all years, sites or their combinations as environments, it can be applied when complete data missing for a year or a site occurs, or data from only years or sites are available. Thus, model 2 is a more practical and flexible statistical model for genetic parameter estimation using datasets from multiple years and sites.
Fig. 5. Partition of genetic variance (A) and genotype-by-environment variance (B) in 1000 simulation replicates at H2 = 0.1 for models 2 and 3.
5 Conclusions
We have proposed an approximation method to estimate H2 and rg and their respective sampling variances for MAD2 trials. The simulation results suggest that H2can be reliably estimated in the MAD2 trial. The sampling error estimates based on the derived theoretical formulas coincide with the simulated values and can be applied to statistical tests of estimated genetic parameters.
Acknowledgments
This work was partly supported by an A-base project funded by Agriculture and Agri-Food Canada, the TUFGEN project funded by Genome Canada and other stakeholders, and funds from the Western Grains Research Foundation. The authors thank Andrzej Walichnowski for manuscript editing.
A.R. Golparvar, M.M. Gheisari, D. Naderi, A.M. Mehrabi, A. Hadipanah, S. Salehi
Determination of the best indirect selection criteria in Iranian durum wheat (Triticum aestivum L.) genotypes under irrigated and drought stress conditions
Field evaluation of type 2 modified augmented designs for non-replicated yield trials in the early stages of a wheat breeding program
Bericht uber die Arbeitstagung 2002 der Vereinigung der Pflanzenzuchter und Saatgutkaufleute Osterreichs gehalten vom 26. bis 28 November 2002 in Gumpenstein, P. Ruckenbauer, F. Raab, R. Kern, K. Buchgraber, A. Schaumberger, 2003
Statistical analysis and field evaluation of the type 2 modified augmented design (MAD) in phenotyping of flax (Linum usitatissimum) germplasms in multiple environments
Simulation study of three adjustment methods for the modified augmented design and comparison with the balanced lattice square design soil variation, statistical models
No comments:
Post a Comment