Rice (Oryza sativa L.) is one of the most important crops in the
world and is considered one of the main annual crops in Brazil. With the
increase in population, demand has increased throughout the years, and
it is estimated that by 2050 global rice production must increase from
60 to 110% to supply the population demand (Godfray et al., 2010;
Tilman, Balzer, Hill, & Befort, 2011; Ray, Mueller, West, &
Foley, 2013). However, this will only be possible as long as genetic
variability is maintained. In Brazilian irrigated rice breeding
programs, genotypic variability is restricted (Rabelo, Guimaraes,
Pinheiro, & Silva, 2015; Streck, Aguiar, Magalhaes Junior,
Facchinello, & Oliveira, 2017); therefore, investigating the genetic
diversity of rice genotypes is critical.

The rice cultivar indication process for commercial plantations is
dynamic, and periodically new cultivars are recommended as substitutes
for those less productive or with less commercial acceptance (Soares et
al., 2008). New cultivar development is crucial to help increase food
availability, and the success of breeding programs relies on the
existence of genetic variability. Breeders have recommended the
formation of a base population based on intercrossing the superior and
genetically divergent cultivars. This is essential for the success of
breeding programs (Cruz, Ferreira, & Pessoni, 2011).

One of the first steps in the formation of a base population is to
guarantee genotypic variability through the morphological, physiological
and molecular differences of genitors, generally expressed by a
dissimilarity measure (Cruz, 2012). Genetic distance estimations are
dependent on the data set available, as well as their phenotypic,
genotypic, molecular or geographic features (Cruz et al., 2011).

Multivariate techniques such as the Unweighted Pair Group Method
with Arithmetic Mean (UPGMA), Tocher method, Principal Components
Analysis (PCA) and Canonical Variables (CV) have been used as
alternatives to simultaneous comparisons of qualitative and quantitative
traits, resulting in more precise distance estimates and more accurate
genetic diversity predictions among genotypes (Barbosa, Viana, Quintal,
& Pereira, 2011; Preisigke et al., 2015). Another promising approach
for genetic diversity studies is the self-organizing maps (SOM) method,
which consists of a computational intelligence technique that allows for
the visualization of similar patterns and data classification based on
the distances between them (Kohonen, 2014).

SOM is a type of two-dimensional artificial neural network that
organizes data from an unsupervised learning process and preserves
notions of neighborhoods using Euclidean distance. The learning begins
with the attribution of synaptic weights, and then a competition process
is started in which each data sample is allocated to the neuron that
best represents it. This neuron is called the "winner". Then
the cooperation begins, in which the winning neuron determines the
approximation of the other neurons in the order of proximity. Finally,
the neurons that establish their neighborhood go to the adaptation
phase, where there are weight adjustments. After all iterations, the map
is organized in a topological structure that reflects the proximity of
the elements under study.

SOM can present hexagonal topology, where each neuron has at most
six direct neighbors, or quadratic topology with at most four direct
neighbors. In addition, different arrangements are established that
define the number of neurons available on the map. For example, a map
with a two by three arrangement presents six neurons arranged in two
columns and three rows. This technique is widely used in the various
branches of science such as engineering (Akkiraju, Keskinocak, Murthy,
& Wu, 2001), industry (Liukkonen, Laakso, & Hiltunen, 2013) and
economics (Louis, Seret, & Baesens, 2013; Sarlin, 2013). Although
self-organizing maps are widely used, this methodology is still
relatively under-explored in plant breeding.

This work aims to test and present the self-organizing maps
technique as an alternative method to evaluate the genetic diversity in
plant breeding programs.

Material and methods

Twenty-five genotypes were evaluated (Table 1) from the irrigated
rice breeding program of the Empresa de Pesquisa Agropecuaria de Minas
Gerais (EPAMIG), in partnership with Embrapa Arroz e Feijao, of which
five were checks (Rio Grande, Ourominas, Seleta, Predileta, and
Rubelita). Experiments were carried out in lowland soils under
continuous flooding conditions in a randomized block design with three
replications in the harvest of 2012/2013, in two municipalities of the
state of Minas Gerais: Leopoldina (21[degrees]31'12"S,
42[degrees]38'43"W) and Lambari (21[degrees]58'32"S
e 45[degrees]21'01"W). The experimental plots consisted of
five-meter plant rows with 0.30 m row spacing, in a total plot area of
7.5 [m.sup.2] and with a useful area of 3.60 [m.sup.2]. The plant
density was 300 seeds [m.sup.-2].

The agronomic traits evaluated were grain yield (kg/ha), plant
height (cm), flowering (days), 100 grain weight (g), grain size (length,
width and thickness), length/width ratio, panicle length, number of full
grains/panicle, full grains percentage and number of stems/[m.sup.2].
All cultural practices were carried out as recommended for the culture
(Borem & Nakano, 2015).

Joint analysis of variance was performed for each trait,
considering the effects of genotypes to be fixed and environments to be
random, according to Equation 1:

For the genetic diversity study, dissimilarity matrices for each
environment were obtained based on the average Euclidean distance. The
clustering method using the conventional statistical approach was the
Unweighted Pair Group Method with Arithmetic Mean (UPGMA). The Mojena
(1977) criterion was used to define the optimal number of dendrogram
groups, adopting k = 1.25. To control cluster consistency and quality,
the cophenetic correlation coefficient (CCC), the distortion between the
dissimilarity matrix and the matrix obtained after dendrogram (graphical
matrix), and the stress (adjustment precision obtained with the
projection of

dissimilarity matrix in the dendrogram) were obtained. The CCC was
given by the correlation between the elements of the dissimilarity
matrix and the elements from the matrix produced by the phenogram
(cophenetic matrix) (Silva & Dias, 2013).

The genotypes were also clustered according to the technique of
unsupervised learning machine of self-organizing maps. The replication
averages for each genotype evaluated for all 11 variables in each assay
were used as inputs for this approach. No outputs were stipulated a
priori for each genotype, because this is an unsupervised technique. To
evaluate the SOM consistency and the best configuration to be used when
performing the clustering, eight scenarios were established that varied
according to the number, the neurons conformation and the topology in
use in the system. For each experiment, eight scenarios were tested in a
total of 16 analyses.

Five or six neurons were initially adopted as possible centroids of
the genotype groups to be formed. Because this technique is influenced
both by the number of neurons adopted a priori, as well as the
arrangement of these neurons, this study also evaluated whether there
were any arrangements that allowed better visualization of the diversity
among the genotypes. These arrangements varied with the number of
neurons, and whether that amount was a prime number or not.

Thus, in scenarios for which a prime number of neurons was adopted,
such as the arrangements containing five neurons, only one by five or
the inverse could be used. In the condition of six neurons, a larger
number of possibilities could be used; however, only the arrangements of
two by three or the inverse were adopted. The decision was made not to
use the other possible arrangements because they resembled the
arrangements with five neurons.

Although Kohonen (2014) affirms that the hexagonal topology allows
for better visualization of the general data structure, in addition to
minimizing the errors, it is not yet known if there is more adequate
topology that could be used in genetic diversity studies. Therefore, in
this study we tested the grid or hexagonal topologies.

For SOM processing in the different scenarios, the standardized
average Euclidean distance was used. For the iterative process, the
number of 1000 iterations for each scenario was stipulated. The software
Matlab (Matlab, 2010) and GENES (Cruz, 2012) were used to perform the
analysis.

Results and discussion

The F test revealed significant effects for genotypes (p < 0.05)
for the following traits: number of full grains per plant, full grains
percentage, grain width and grain length/width ratio (Table 2). The
coefficients of variation estimated for most traits were compatible with
those obtained in other studies of rice (Silva, Silva, Guimaraes, &
Moura, 2011; Hosan, Sultana, Iftekharuddaula, Ahmed, & Mia, 2010;
Streck et al., 2017), emphasizing the acceptable test quality.

The genotype x environment (G x E) interaction effect was
significant for grain yield, panicle length, length, width, and grain
thickness, in addition to the grain length / width ratio, indicating
that the genotypes exhibit differential behavior in the evaluated
environments for these traits. Therefore, the decision was made to study
the genetic diversity among the genotypes separately for each
environment, because the clustering pattern may vary according to
environmental variation.

For the municipality of Leopoldina (Figure 1A), the UPGMA
clustering presented CCC values of 0.7306, with a distortion of 2.53%
and a stress of 15.90%. For Lambari (Figure 1B), the CCC, distortion and
stress values were 0.7025, 4.02% and 20.07%, respectively, showing
adequate adjustment values between the dissimilarity matrices and
dendrograms. By means of the global criteria introduced by Mojena
(1977), it was observed by reference to the last fusion level that at
the 80% level of similarity, five different groups were formed for both
dendrograms.

Although the number of groups was the same, the clustering method
gathered the genotypes differently in each environment, a result
expected due to the significant GxE interaction for most evaluated
traits. Despite this, the genotypes BRA 041099, MGI 0712-1, MGI 0901-5,
MGI 0713-17, RUBELITA, MGI 0607-1, BRA 02706, BRA 02708, MGI 0517-25,
OUROMINAS, and RIO GRANDE remained together in one unique group,
independent of the environment (Figure 1A and B), reinforcing the idea
of greater similarity among them.

In addition to cluster analysis under the conventional biometric
approach, a spatial ordering of genotypes was also performed through
self-organizing maps, which emphasizes the use of computational
intelligence looking for an optimal solution (Kohonen, 1990). Smith and
Ng (2003) affirm that it is difficult to quantify the efficiency of
clustering made from SOMs, but they were able to generate clearly
distinguishable groups.

Considering the experiment conducted in Leopoldina, it was verified
that the maps with five neurons (scenarios one, two, three and four)
showed genotypic organization with high agreement among each group
(Table 3). Considering this configuration, the scenarios one, three and
four used the same organization pattern for all genotypes. Only clusters
one, two and three, belonging to scenario two, diverged in relation to
the other scenarios with quadratic topologies. However, even if this
variation existed, there was some agreement between these groups,
because in all scenarios the genotype pairs six--18 and eight--13, for
example, remained in the same group. The genotypes that diverged most in
terms of classification in all the maps for this environment were the
16, 19, and 24 genotypes.

In Lambari, map results for the scenarios with five neurons were
similar (Table 4). In this case, the genotypes 16, 21, and 22 were those
that presented divergence regarding the allocated groups. In general, it
has been observed that the maps with the one by five and five by one
arrangements present similar classification results, mainly because they
are very simple and because the topology probably will not interfere in
these configurations. In scenarios five, six, seven and eight, the map
results presented low divergences among themselves, and in scenarios
six, seven and eight, the genotypes were allocated to identical groups.
Only groups two and three from scenario five differed from the other
scenarios, with emphasis on genotypes 21 and 23, which had a distinct
allocation pattern in other scenarios with five neurons. This result
highlights the possibility of anomalies that occur due to this being an
iterative process, and due to the genetic nature of these related and
very similar genotypes.

Although different clustering techniques provide different
diversity views, an agreement is expected to exist among them.
Therefore, the results obtained by the self-organizing maps were
compared to those obtained by the conventional statistical approach,
with a main goal to observe the clustering behavior associated with
these techniques and evaluate their complementarities.

The genotypes were identified in each map according to the UPGMA
clustering. Genotype ordering according the SOM technique was consistent
with the hierarchical clustering results, because the basic structure of
the UPGMA groups was preserved in each group of the maps (Figures 2 and
3). Considering genotype arrangements and the group neighbors, maps
involving five neurons presented inferior organization efficiency to the
six-map arrangements in both environments, as can be observed in Figure
2 where the genotypes eight, 13 and 16 were separated into groups
without neighborhoods in scenarios with five neurons, but remained in
neighboring groups in all scenarios with six neurons.

In scenarios with five neurons, each group has at most two direct
neighbors, while in the six-neuron arrays the neighborhood is determined
for up three groups; therefore, the technique is able to capture the
proximity among groups and to organize the genotypes more efficiently in
each one according to the grouping carried out. In addition, Figure 3
shows that genotypes one and nine of scenarios two, three, four, six and
eight were distinctly allocated to the groups consisting of their peers
according to UPGMA clustering. According to Kohonen (2014), hexagonal
topology allows for better neuron arrangements; moreover, the simpler
configurations like those used with five neurons may distort genotype
organization, especially in cases where the material studied is not
easily distinguishable.

Having a fixed number of groups that depends on the number of
neurons that are predefined when determining the SOM configuration can
lead to some abnormalities, such as the separation of some large groups
into two or more smaller subgroups and group union. In addition, it
should be noted that the rice genotypes evaluated in these assays are in
the final stage of breeding and are genetically close, which may lead to
some difficulty in allocating these genotypes. However, in general, it
is observed that the organization patterns among the rice genotypes
evaluated by the maps is complementary to the UPGMA approach, as
observed in all scenarios.

When evaluating the map arrangement in different scenarios, it was
noted in Leopoldina that a three by two arrangement with hexagonal
topology preserved the organization of UPGMA groups, a fact that was not
observed in scenario five (two by three arrangement), where the
genotypes 23 and 24 were allocated to distant groups (Figure 2). In
Lambari, the three by two and two by three arrangements with the same
topology were superior to the others because they permitted more
connections among groups. These topologies were able to organize the
genotypes into groups closer to those obtained by UPGMA clustering
(Figure 3). In particular, genotypes of the same group in the UPGMA
remained in the same group or in linked groups in the SOM. However, in
cases with higher numbers of neurons, there is a possibility of larger
variation amounts in the allocation of genotypes, as affirmed by Kohonen
(2014).

The SOM method has been shown to be an efficient way of identifying
patterns of similarity, as shown by Mwasiagi (2011), who used the SOM
technique to distinguish cotton genotypes. This author concluded that
the method was efficient for separating thin wires from coarser ones,
and the samples that were dispersed on the map would be outliers,
implying irregularity of the material. Smith et al. (2003), studying the
SOM efficiency for organizing web pages through navigation patterns,
obtained a satisfactory result and concluded that the method can be
easily incorporated; however, this needs to be developed for large scale
applications. A similar conclusion was found by Fritzke (1994), who
studied the map efficiency for supervised and unsupervised learning.

In addition to the strong agreement obtained by the SOM, this
technique presented high complementarity to the stochastic approach.
Thus, it is observed that the organization of genetic diversity through
self-organizing maps is efficient, and the SOM technique has the
potential to be useful for genetic diversity studies in breeding
programs.

Conclusion

Self-organizing maps have the potential to be useful for genetic
diversity studies in breeding programs.

Tilman, D., Balzer, C., Hill, J., & Befort, B. (2011). Global
food demand and the sustainable intensification of agriculture.
Proceedings of the National Academy of Sciences of the United States of
America, 108(50), 20260-20264. DOI: 10.1073/pnas.1116437108

Caption: Figure 2. Arrangement of the clustering scenarios
according to the SOM for Leopoldina. Genotypes belonging to the same
group according to the UPGMA method were identified on maps using equal
colors (i.e., numbers in blue, green, red, black, and purple represent a
specific UPGMA cluster). Numbers in parentheses refers to their
respective scenarios.

Caption: Figure 3. Arrangement of the clustering scenarios
according to the SOM for Lambari. Genotypes belonging to the same group
according to the UPGMA method were identified on maps using equal colors
(i.e., numbers in blue, green, red, black, and purple represent a
specific UPGMA cluster). Numbers in parentheses refers to their
respective scenarios.