Abstract

Investor trading networks are attracting growing attention in the financial market literature. In this paper, we propose three improvements to their analysis: information aggregation, transaction bootstrapping, and investor categorization. These components can be used individually or in combination. For information aggregation, we introduce a tractable multilayer aggregation procedure to integrate security-wise and time-wise information about investor category trading networks. We use transaction bootstrapping to capture the properties of the actual data generation process and to have a more robust statistical testing procedure. Investor categorization allows for inferring constant size networks and more observations for each node, which is important especially for less liquid securities.
We apply this procedure by analyzing a unique data set of Finnish shareholders during the period 2004–2009. We find that households play a central role in investor networks, as they have the most synchronized trading. Furthermore, we observe that the window size used for averaging has a substantial effect on the number of inferred relationships. Importantly, the use of our proposed aggregation framework is not limited to the field of investor trading networks; in fact, it can be used for different non-financial applications, with both observable and inferred relationships, spanning a number of different information layers.

An investor network is a representation of a real-world complex system where institutional and private investors indirectly interact with each other by trading or owning securities. In general, network science methods allow for analyzing and gaining a clearer understanding of the intricate relationships between the components of this system, and a key advantage of such an approach is that it allows for visualizing the resulting networks [11, 12]. However, estimating investor networks is not straightforward, as links between investors are not directly observable. Instead, a link represents the abstract similarity of a pair of investors in terms of trading behavior or portfolios. Therefore, the analysis requires investor-level transaction or portfolio data and an appropriate statistical inference method for inferring such networks from the data. Even though complex network methods have begun attracting attention to investor-level data[13], many methodological challenges remain, several of which we aim to address in this paper. For our analysis, we use data from a large shareholder registry to investigate the trading networks of different investor categories.

First, the main challenge in investor trading networks is considering multiple securities leading to a multilayer network representation. What if we wanted a simple network representation, which would have statistically significant relationships over multiple securities? Ever-changing investor behavior poses difficulties for correctly inferring their relationships. Most likely, performing network inference for a whole period will not reveal the whole picture, as localized relationships between investor categories occurring at different periods might be diluted when we look at longer horizons. At the same time, static networks inferred over a whole period do not provide information on how node relationships evolve over time. In order to analyze the varying associations between investor categories, we use a simple, window-based analysis to recover the time-evolving networks of investor category interactions. Moreover, having a sequence of network snapshots, one might want to summarize the most important reoccurring relationships over the whole period. Therefore, we propose a multilayer aggregation approach that can address this challenge and yield a network representation over multiple securities and/or estimation periods. We also consider the influence of window size on the resulting aggregated networks[14]. As we show, this approach allows for producing robust network structures using all of the transaction data over multiple securities and estimation periods without discarding a single transaction.

Second, we can think of investor trading as a data generation process that produces observations (transaction data) based on unobservable trading mechanisms. For example, trading algorithms have specific trading rules, and household investors with more or less intuitive trading strategies can have certain (stochastic) mechanisms, which are impossible to observe directly. The point is that the data set of observable transactions is just one realization of the underlying data generation process driven by certain mechanisms. Therefore, one might wonder which data sample to use for the network inferenceâall the transaction data together or one or more sub-samples of the full set of trading data. In addition, in our case, the investor category consists of many investors, and we want to prevent cases where a couple of active investors or investors who trade large volumes overshadow the behaviors of other investors in the category. In our approach, we address these problems by performing the lowest resolution bootstrapping at the investor transaction level. An empirical demonstration shows that the results clearly differ between the conventional approach of using the full data set directly and our data bootstrapping approach.

Third, the transaction data for network inference suffers from a high-dimension, low-sample size problem[15], as the number of investors exceeds the number of trading days. Estimating investor networks based on trading similarities requires long observation periods and sufficient data for each investor [8]. Since the majority of household investors are rather inactive, only a fraction of investorsâthe active onesâcan be included in the analysis. The exclusion of inactive investors leads to the description of a sub-system; therefore, the conclusions can be difficult to generalize at the market level. In this paper, we solve this problem by assigning investors to categories according to investor attributes that are available in the data set. Such a categorization allows us to reduce the number of variables in the system significantly, but we do not exclude data, as the categories contain aggregated data from the whole system. Importantly, this approach allows for considering inactive investors and less liquid stocks with fewer trading events. The size of such a categorized network remains the same over time, whereas the size of a network of individual investors can change over time, depending on the activeness of the investors. Since investor categories are based on real attributes, we can characterize the nature of each category. This makes the system interpretable in economic and sociological terms.

To demonstrate our multilevel aggregation approach, we use an investor-level transaction data set obtained from Euroclear Finland Ltd for our analysis. It includes transactions from 2004-01-01 to 2009-12-31 of all investors that traded stocks listed on the Nasdaq OMX Helsinki Exchange. Each transaction also contains meta-data about the investors (the same data set is used, for example, in refs. \citentumminello2012identification,grinblatt2000investment,berkman2014informed, while ref. \citenozsoylev2013investor uses a similar data set of trades on the Istanbul Stock Exchange). In this data set, the attributes used to categorize investors include gender, year of birth, and postal code for households and sector code for institutions. These attributes allow us to define 110 investor categories on which our analyses are performed.

Our framework is based on several building blocks, inspired by the bagged conservative causal core (BC3NET) method [18, 19], originally introduced to infer gene regulatory networks based on genomics data. The blocks consist of the following:

Investor categorization, where all investors in the analyzed data set are assigned to only 110 categories based on their economic and social attributes. (Investors → Investor Categories)

Data bootstrapping, where the analyzed data is resampled into B data sets. The advantage of the bootstrap is that it does not require any assumptions about the data distribution and it addresses the issue of finite time series. (Dataset → Resampled Datasets)

Network inference, where we apply a chosen inference technique to identify edges between investor categories for each resampled data set to produce an ensemble of networks. Any network inference method[20, 21, 22, 23, 24] that produces or can be converted into binary, non-weighted networks can be applied. Our main method choice in the results section for network inference is the conservative causal core (C3NET)[18] algorithm, for its computational efficiency (see methods section for more details). (Dataset → Network)

Aggregation, where a network ensemble is aggregated to identify significant relationships that appear across the set of networks. [25]. (Network Ensemble → Aggregated Network)

The novelty of our approach is that we can aggregate networks in the manner displayed in Fig. 1 to capture trading relationships over multiple securities and periods. Overall, we have two different layers. The first layer indicates securities and the second one indicates time. Interestingly, there are two different ways to integrate over these variables, indicated by the blue and red arrows. We show in the following that the results highlight different characteristics of the data.

For each time step t and each security, we want to extract a network. These networks cannot be directly observed, but they are estimated using the transaction data set. By bootstrapping this data set, we generate B bootstrap data sets. Network inference is applied to each of these B data sets, resulting in an ensemble of B networks. The aggregation of the B networks results in one network, indicated by (1) (see Figure 1). Adjacent networks in the main matrix are similarly inferred for other time steps and securities. Each column is an ensemble of networks that contains information about trading relationships for different securities during the same time step, while each row is a network ensemble that contains information about trading relationships in individual securities over different time steps. In the following, we first describe the security-wise integration and then the time-wise integration.

Initially, we integrate the security-wise information for each time step contained in the columns. Network (2) represents an aggregated network for time step 1 over all securities. Repeating a similar analysis for each of the T different time steps results in further networks for the corresponding cases. To combine these T networks, we perform aggregation again, resulting in one final network, indicated by (3) in the figure. The blue arrows in the figure represent the aforementioned steps. Alternatively, one can perform a time-wise integration first in a similar way. This type of integration follows the red arrows in the figure and applies 2 times the aggregation method because two integration steps are required. This leads to the final network indicated by (5).

Interestingly, even though the final networks (3) and (5) summarize the same information, because of the different aggregation order, the captured relationships might be different, as shown in the results section.

Figure 1: A set of networks in the main matrix containing two information layersâtime and securities (networks adjacent to network (1)). Each of these networks is a result of applying the bagged C3NET algorithm for the corresponding data sets. Two information integration approaches are possible: one-layer integration of either securities (2) or time (4), for each period and security, respectively, or a multilayer approach, where both information layers are fully integrated, leading to networks (3) and (5).

Scientific literature investigating a multilayer network aggregation is scarce, as the multilayer networks themselves have only recently started to gain more attention [26, 27, 28], especially in the financial area [29, 30]. The paper that mostly closely resembles ours regarding its topic proposes an ensemble-based network aggregation [31] method that leverages the rank-product method [32] to improve the accuracy of gene network reconstruction. However, the algorithm is intended to integrate gene networks inferred using different methods and genomics data sets. Other trivial network ensemble aggregation procedures include maximum and mean rules[33]. Another recent paper[34] proposes a method for reducing the complexity of multilayer networks by aggregating the redundant layers while retaining the pertinent information about the whole system. In practice, the goal of their method is to combine similar layers and keep dissimilar layers apart. The objective of our research is different; we are looking for the most important relationships that span multiple layers, rather than keeping information about different layers.

The main contribution of this paper to the field is two-fold. First, in terms of investor network inference, we consider investor categories, instead of individual investors, and second, in terms of network aggregation, we propose the use of a tractable multilayer and multistep aggregation procedure. Hence, our approach is aimed at integrating an ensemble of networks, resulting in a network that captures the most significant consistencies in investor relationships over multiple time snapshots and many securities. Methodologically, this framework can be used for different non-financial applications, with various network estimation methods, even for observable networks, such as social networks[35], different communication channels[36, 37, 38], transportation[39], and co-authorship[40] networks, where network estimation is not needed.

In the next section, we present the results from the method application to our data set. We begin by applying our proposed techniques to single security networks. We investigate the impact of transaction bootstrapping on the network inference problem and compare a network inferred over the whole period to an aggregated network from a set of network snapshots. Next, we investigate multiple security networks. First, we use the aggregation technique to summarize information about trading in multiple securities and then we perform a two-layer aggregation, summarizing the information given by a series of network snapshots for a set of securities.

Results

In this section, we describe the network inference and aggregation process over single and multiple securities by performing the analysis over the whole period of analysis and multiple non-overlapping sub-periods. Mutual information (MI) values are estimated from daily net volume time series for each investor group pair. We also choose C3NET algorithm for network inference from the MI estimates; however, other methods can be used. Combined with transaction bootstrapping, our method closely resembles the bagged conservative causal core (BC3NET) approach, except that in our case, the sampling is performed at a lower transaction level.

Single Security Networks

Network inference

We begin our results section by comparing inferred networks using C3NET algorithm with and without transaction bootstrapping. We perform this comparison for the most liquid security in the Helsinki stock exchangeânamely, Nokia. By definition, C3NET allows for establishing as many links as there are nodes in the network, if each investor group has at least one statistically significant MI estimate with some other group. In our data set from 2004-01-01 to 2009-12-31, using C3NET, we infer 95 links. Interestingly, even after completing the categorization, some investor categories do not have a sufficient number of Nokia transactions to estimate relationships. For the bootstrapped version of network inference, we perform 100 transaction sampling iterations and form a network for each of them using the C3NET algorithm. The resulting ensemble of 100 networks contains 9212 links with 1255 different relationships. As a statistical null model for our ensembles, we choose the canonical ErdősâRényi G(n,p) model, with a fixed number of nodes and an ensemble probability of a random link (see the methods section for more details). A fully connected ensemble would have n×(n−1)/2×B=5995×100 links; therefore, the probability of having a random link in the ensemble is estimated to be p=9212/599500=1.53×10−2. By choosing a significance of α=0.01 and adjusting it by the number of tests we perform (1255), we conclude, that a relationship must be observed in at least 9 networks for it to be considered non-randomly occurring. The bootstrapped version identifies a total of 221 relationships that are statistically significant. Hence, the topology is no longer limited to one link per node. Almost all relationships from the non-sampled C3NET network are found also in the bootstrapped versionâthat is, 83 out of 95.

The two networks are depicted in (a) and (b) sub-plots of Figure 2. Both networks identify the same nodes as most connected, and the four most connected nodes represent households. Specifically, the most connected node represents mature Helsinki households, followed by the same age group of western Tavastians, then middle-aged western Tavastians, and finally, mature northern Finnish households. The most connected non-household groups in the bootstrapped version are non-financial companies from western Tavastia , Ostrobothnia, and central Finland, with six relationships each. The most connected financial insurance group is from northern Savonia, with five relationships, followed by Helsinki, with four relationships in the bootstrapped version.

Figure 2: Four networks of investor group trading relationships in Nokia security. Investor group positions are fixed in all four plots. Node sizes depend on node degrees in each network. The first network (a) is inferred using the C3NET algorithm on the original data set. The second network (b) is inferred by bagging C3NET. For the third (c) and fourth (d) networks, the whole six-year period is divided into 12 six-month sub-periods. For each of those 12 sub-periods, a C3NET and bagged C3NET networks are inferred. Then those 12 networks are aggregated into a final network that covers the whole 6-year period.
- Households,
- Financial and insurance companies,
- Other companies
- Government institutions,
- Non-profit organizations,
- Rest-World.Figure 3: 62 most re-occurring links in 12 Nokia networks estimated over non-overlapping 6-month periods.
- inferred relationship,
- no relationship.

Time-wise network aggregation

The third and fourth networks for Nokia security in Figures 2 (c) and (d) are obtained by aggregating two 12-network ensembles inferred from non-overlapping 6-month periods covering the whole 6-year period analyzed. As in the previous section, we compare Nokia networks inferred with and without transaction bootstrapping. The number of relationships in the non-bootstrapped version ensemble varies from 63 to 82 and from 201 to 240 in the bootstrapped version. A total of 1582 different relationships are observed throughout the 12 networks in the transaction bootstrapped network ensemble and the total number of links in the ensemble is 2665, while in the non-bootstrapped version, the numbers of relationships and links are 685 and 870, respectively. Each network contains 110 nodes, and therefore, the total possible number of links in the ensemble is equal to 12×5995=71940 and the probabilities of having a random link are estimated to be p=2665/71940≈3.70×10−2 and p=870/71940≈1.21×10−2. Again, by choosing the statistical significance of α=0.01 and adjusting for the number of tests performed, a link must appear at least 5 times in order to be aggregated into the final network for the bootstrapped version and 3 times for the non-bootstrapped version. From Table 1, we see that in the bootstrapped version, 62 links appear at least 5 times in the 12 networks, and Figure 3 shows the link occurrence in the ensemble. In the latter figure, we can see that some relationships are accumulated in consecutive periods while others are more scattered over time.

\topruleNumber of occurrences

12

11

10

9

8

7

6

5

4

3

2

1

\midruleLinks

1

4

2

4

6

9

15

21

46

107

375

992

Cumulative

1

5

7

11

17

26

41

62

108

215

590

1582

\bottomrule

Table 1: Number of link occurrences in the Nokia ensemble inferred over non-overlapping six-month periods using the bootstrapped version of C3NET. We can see that only one link appears in all 12 networks, while 992 links appear only once.Figure 4: Link overlap in variously inferred networks for Nokia. 6y. is the network inferred using transaction bootstrapping over the whole period. All the other networks are inferred on shorter windows (1, 2, 3, 4, 6, 12, 24 months) and then aggregated into a network that covers the whole period under analysis. From the figure, we can observe that most of the relationships inferred over the longer observation windows are also found in the shorter window analyses.

From Figure 4 we can see that 46 links overlap with the bootstrapped version of C3NET for the whole period under analysis. Further, for the non-bootstrapped version, 30 relationships are inferred after time-wise aggregation. Of those 30 relationships, 26 also appear in the bootstrapped version. All nodes, but two non-financial investor groups that have relationships, are households. A visual inspection of all four networks in Figure 2 reveals that the most important set of nodes in both networks inferred from the whole transaction data set is also identified as central in networks aggregated from various time window analyses.

Multiple Security Networks

Security-wise aggregation

Here, we aim to incorporate information about investor group trading relationships in 100 securities over the whole 6-year period. We start by inferring the bootstrapped version of the C3NET network for each security. The number of inferred relationships across different securities ranges from 97 to 287, while the total number of detected relationships in the ensemble is 3388. Subsequently, for the ensemble of 100 security networks, we apply the same aggregation procedure as before. From the observed number of links in the ensemble and total possible number of links in a fully connected ensemble of this size, we estimate the probability of random links to be p=22082/599500=3.68×10−2. Then, for a significance level of α=0.01, we apply Bonferroni adjustment in 3388 tests and end up with a threshold of 15 link occurrences in the ensemble, which leaves 315 links in the aggregated network. Households represent the majority of groups with relationships over multiple securities. Furthermore, two of the most central nodes are mature and middle-aged household investor groups from Helsinki, with 52 and 38 relationships, respectively. The two most central non-household investor groups are financial and non-financial companies in Helsinki, both with 13 relationships to other investor groups.

Two level aggregation

Security-wise ⇒ Time-wise ⇒ (3)Time-wise ⇒ Security-wise ⇒ (5)

Figure 5: Networks summarizing investor group trading similarities in 100 securities over 6 years. The starting point for both networks is a set of 1200 networks inferred for each security over 12 six-month, non-overlapping periods. The difference between networks comes from the aggregation order. The first network is first aggregated security-wise and then time-wise, while the third network is aggregated in reverse order. Network (3) in Figure 1 represents network (a), while network (5) in the same figure represents network (b).
- Households,
- Financial and insurance companies,
- Other companies
- Government institutions,
- Non-profit organizations,
- Rest-World.

In this section, we leverage the previously introduced time-wise and security-wise network aggregation procedures. Our goal is to produce a single network that can summarize the trading relationship information inferred for 100 securities over multiple and various sizes time windows. We investigate networks inferred over seven different non-overlapping time windowsâthat is, 1, 2, 3, 4, 6, 12, and 24 months. Each security respectively has 72, 36, 24, 18, 12, 6, and 3 such networks, covering the whole 6-year period under analysis. Our starting point is a set of network ensembles inferred using bootstrapped C3NET algorithm for 100 securities for all analyzed time window sizes. For instance, in the case of the 6-month window, we have 12 networks for each of the 100 securitiesâthat is, an ensemble of 12×100=1200 networks (corresponding to the networks in the main matrix of Figure 1). We must also keep in mind that the aggregated network will differ depending on the order of information aggregationâthat is, if relationship time-wise or security-wise information is summarized first. Accordingly, we describe the results of using both approaches and compare the final results. By performing the time-wise aggregation first, we end up with a 100-network ensemble, with one network for each security. Links in each network represent the most important reoccurring relationships in corresponding securities. Conversely, if we start with security-wise aggregation, we end up with an ensemble of 12 networks. Each of the 12 networks contains the most important relationships that are present over multiple securities, but this might be a different set of securities in each period. Next, for the two ensembles stemming from the first aggregation procedure, we perform the final aggregation, yielding a network summarizing the relationships of investor groups in their trading behavior over 100 securities for the whole period under analysis. However, the two final networks are not the same (see the networks in Figure 5). Table 2 compares the links and nodes in the final networks for various window sizes. For each of the seven time windows, we obtain two networks, depending on the order of the aggregation procedure; thus, together with the security-wise aggregated network for the whole period from the previous section, we compare 15 networks. Figure 6 summarizes the node degrees in all 15 final networks. Node degree sequences are highly correlated, with Spearmanâs correlation ranging from 0.84 to 0.99. Similar to the whole period security-wise aggregated network, networks in Figure 5 identify mature and middle-aged household investor groups from Helsinki as the most central groups, while financial and non-financial company investor groups from Helsinki are most central non-household investor groups.

\topruleWindow

Nodes

Links

size

ST∖TS

ST∩TS

TS∖ST

Jaccard

ST∖TS

ST∩TS

TS∖ST

Jaccard

\midrule1

0

36

4

0.9000

2

335

87

0.7900

2

0

42

2

0.9545

8

339

33

0.8921

3

1

43

1

0.9555

65

314

4

0.8198

4

2

45

0

0.9574

67

304

3

0.8128

6

3

46

0

0.9387

98

275

3

0.7313

12

5

47

0

0.9038

150

221

6

0.5862

24

6

44

1

0.8627

101

132

17

0.5280

\bottomrule

Table 2: Summary of node and link overlap in various window size final networks. ST stands for the network where the first aggregation layer is security-wise (network (3) in Figure 1) and TS is the network where the first aggregated layer is time-wise (network (5) in Figure 1).Figure 6: Node degree comparison in final networks aggregated over 100 securities and non-overlapping time windows covering the whole 6-year period. T{W}_{O}, where W stands for the window size and O stands for aggregation order, either security-wise or time-wise first. The row with 6 y. inference strategy stands for the network aggregated from networks that were inferred using the whole six-year data set for each security.

Discussion

In this paper, we proposed some approaches to help circumvent the most common obstacles in investor network analysis. First, we extended the bootstrap aggregation approach to ensembles containing information about trading behavior in different securities and/or time windows. The advantage of the aggregation approach is that no arbitrary link-filtering threshold is needed. Instead, the algorithm adjusts this itself depending on a chosen significance level and the properties of the investigated network ensemble. We found that time-wise aggregated networks and networks inferred over the whole period significantly differed in the number of relationships inferred and the number of nodes having relationships. However, a similar set of nodes was identified as central in both cases. Security-wise aggregation revealed the investor category trading network not over one but over multiple securities. It is important to remember that the two-layer aggregation yielded different network descriptions depending on the order of information aggregation. It is worth mentioning that the aggregation of time-wise and security-wise trading relationships could be performed in a single step, in which case there would be no confusion about the aggregation order. However, in that case, the meaning of network relationships would be obscure. We would be neither certain that investor categories were similarly trading over a significant number of the same securities nor that they were trading similarly over a significantly large number of the same periods; further, the definition of a single step aggregation would be somewhere in-between, in some cases perhaps failing to meet both criteria.

Second, to the best of our knowledge, we are the first to propose the use of lowest resolutionâthat is, transaction-levelâbootstrapping as the means for statistically validating investor network relationships. Transaction bootstrapping also enables network inference over shorter time windows. Networks inferred at different time points can provide insight into the dynamics of these relationships. Most of the research has been focused on inferring static or time-invariant investor networks, and much less has been done to infer the dynamic relationships that are constantly evolving over time. Indeed, over the course of time, multiple interchanging processes may determine the behavior of investor categories, and such processes can be dynamic and stochastic. Therefore, investor behavior at each time point is dependent on these processes, and investor networks can undergo significant topological changes, rather than being invariant over time. Using our proposed network aggregation procedure, these network snapshots can be summarized into a single static network that covers the most important information for the whole period. Transaction bootstrapping is a viable strategy for network inference because it not only allows for assigning statistical significance to link existence but also enhances the robustness of the relationships to specific realizations of the trading outcome.

Finally, we introduced investor grouping into categories based on their attributes. This approach allows for performing any analysis by discarding less information. Investor category networks based on investor attributes have not been investigated previously in the literature. The vulnerability of the investor categorization approach is that the ensuing analysis is ultimately dependent on the category definition. In practice, it is possible that the investor transaction data sets would not contain meta-data about the investors, and therefore, it would be impossible to assign investors to categories or the arising categories would be economically meaningless or difficult to interpret. In that case, one can revert to the analysis of individual investors and use the multilayer aggregation procedure without a loss of generality.

In the results section, we observed that Helsinki households represented the most connected investor category, and this category, thus, has a central role in financial markets in terms of trading behavior. The central role of household investors has been identified in the literature[41, 42, 43, 44]. For example, according to \citenkaniel2008individual, households are contrarian traders (i.e., they sell when stock prices have increased and buy when prices have decreased), leading them to serve as liquidity providers to institutional investors. The contrarian nature of households is also identified in \citengrinblatt2000investment using the same data set that we used in this paper.

This method can be applied in different, even non-financial fields, in order to extract the most important reoccurring relationships in multilayer networks.

Methods

Dataset. We use an investor-level transaction data set covering the period from 2004-01-01 to 2009-12-31 of all trades executed on the Helsinki Stock Exchange. The data set is composed of transactions belonging to 489245 investors trading in 100 securities over 6 years. The analyzed security list includes the top 100 securities ranked by number of investors and transactions. Each investor in the data set is assigned to a sector group: Financial and Insurance, Government, Non-Financial, and Non-Profit companies, as well as Foreign investors, and Finnish Households. Households are further divided into five age groups: Under-Aged (0,18], Young (18,30], Middle-Aged (30,50], Mature (50,64], and Retired (64,+∞]. Age attributes are derived for each transaction separately, taking into account the difference between the transaction date and the year of birth of the corresponding investor. All of these groups are also distributed geographically by assigning investor postal codes to 11 regions using Table 3. Together, these assignment rules, shown in Figure 7, form 110 investor categories.

The data that support the findings of this study are available from Euroclear Finland Ltd. Data however are not available from the authors under the non-disclosure agreement signed with the data provider.

Figure 7: Each investor category has two attributesâgeographical location and sector codeâexcept for households, which also have the age attribute, which is calculated according to the year of birth and transaction date attributes. The sector code provides information about economic and sociological factors and the coloring is equivalent to that of the nodes in Figures 2 and 5. Investor categories are formed by combining sector code (and age for households) with geographical attributes; for example, Financial Insurance in Helsinki investor category or Retired Households in South East Finland.

Region

Postal code range

Helsinki

[0, 3000)

Rest-Uusimaa

[3000, 11000)

Eastern-Tavastia

[11000, 20000)

South-West

[20000, 30000)

Western-Tavastia

[30000, 40000)

Central-Finland

[40000, 50000)

South-East

[50000, 60000)

Ostrobothnia

[60000, 70000)

Northern-Savonia

[70000, 80000)

Eastern-Finland

[80000, 90000)

Northern-Finland

[90000, 100000)

Table 3: Postal code mapping to regions

Transaction bootstrapping. For network inference, we perform B bootstrap iterations. For each bootstrap iteration, we uniformly re-sample with replacement the whole transaction data set under investigation. Then, for each sampled transaction set, we aggregate daily transaction records for each category, resulting in net traded volume matrix Nb, where b∈{1…B} and wbij is the net traded volume on day i of investor group j.

Mutual information estimation. For simplicity, we assume that the joint distribution of net traded volumes is normal. Then we can calculate the MI analytically from Pearsonâs correlation using

I(i,j)=−12log(1−ρ2)

(1)

Network inference. Using the net traded volume data N, we apply a network inference method. A specific requirement for such a method is that it is computationally efficient for handling a large bootstrap ensemble. For this reason, we use the C3NET[18] inference method. C3NET is intended to infer a significant maximum MI network. This algorithm comprises three basic steps. First, MI values are estimated for each investor category pair. Second, each MI value estimate is tested against a null hypothesis of vanishing MI. Finally, each investor group is allowed to keep a single link, which is the strongest statistically significant MI value. The resulting binary network has at most M relationships in a system of M nodes.

Null distribution of mutual information values. In order to test the statistical significance of the MI estimates, we need to procure an appropriate null distribution. Therefore, we test the following null hypothesis:

H0: The MI between investor group i and j is zero.

For each transaction, we resample dates, traded volumes, and categories to which those transactions are assigned, eliminating any relationship between them. Then we aggregate daily transaction records for each category, resulting in a net traded volume matrix ~Nb. We do this multiple times and each time we estimate MI values for pairs of investor groups. These values result in an estimate of the null distribution, which we use to find statistically significant MI values.

Multiple hypothesis test correction (MTC). In order to control the family-wise error rate, we leverage the strict Bonferroni MTC procedure. We use MTC at each stage of aggregation when testing edge occurrences in the ensembles. Following the Bonferroni procedure, we adjust the chosen significance level by the number of tests we perform: αadjusted=α/ntests

Aggregation. Following ref. \citende2012bagging, the aggregation procedure takes an ensemble of N independent undirected binary networks {Gk}Nk=1 as an input and gives a single network G as an output. First, the network ensemble is aggregated into a weighted network {Gk}Nk=1→Gw. The edge weights in the weighted network Gw correspond to the number of particular edge occurrences in the ensemble. For example, the weight of an edge between investor groups i and j is defined as
nij=Gw(i,j)=∑Nk=1Gk(i,j), where nij may assume integer values between 0 and N. Next, we conduct a statistical hypothesis test to remove the need for an arbitrary link threshold parameter:

Hnij0: The number of networks nij in the ensemble with an edge between i and j is less than n0(α), where α is the significance level.

If we define p as the probability of two investor groups being randomly connected, then nij follows a binomial distribution, B(p,N). Then pij=P(n≥nij)=∑Nn=nij(Nn)pn(1−p)N−n is the probability of observing by chance the link between investor groups i and j more than nij times. Then the nodes in the final network G are connected if pij<α, where α is the significance level.

We estimate the probability of p for two groups to be connected by chance in an N network ensemble, as the fraction of the actual number of edges in the ensemble ∑i>j,k{Gk(i,j)} to the number of all possible links in the ensemble N×(n(n−1)/2), where n is the number of investor groups.

Multilayer aggregation procedure. For a set of securities S and a number of inference periods T, we infer S×T networks {Gst}(S×T). Each network represents significant relationships between investor groups for different securities at different periods. If we then apply the network aggregation procedure over securities for each period t,

{Gst}Ss=1aggregation−−−−−−−→Gt,

we end up with an ensemble of networks {Gt}Tt=1. Each of the {Gt} networks represents significant relationships between investor groups that occur over multiple securities during period t. Similarly, if we apply the network aggregation procedure over time for each security s,

{Gst}Tt=1aggregation−−−−−−−→Gs,

we end up with an ensemble of networks {Gs}Ss=1, where each of the networks {Gs} represents the most important over time reoccurring relationships between investor groups in security s.

Next, we aggregate the second layer of information. {Gt}Tt=1aggregation−−−−−−−→˜G and {Gs}Ss=1aggregation−−−−−−−→ˆG appropriately. Both aggregation sequences load to unique networks.

Tumminello, M., Aste, T.,
Di Matteo, T. & Mantegna, R. N.
A tool for filtering information in complex systems.
Proceedings of the National Academy of
Sciences of the United States of America102,
10421–10426 (2005).

Tumminello, M., Aste, T.,
Di Matteo, T. & Mantegna, R. N.
A tool for filtering information in complex systems.
Proceedings of the National Academy of
Sciences of the United States of America102,
10421–10426 (2005).