Abstract

Regression analytics has been the standard approach to modeling the relationship between input and output variables, while recent trends aim to incorporate advanced regression analytics capabilities within data management systems (DMS). Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute with a novel predictive analytics model and an associated statistical learning methodology, which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS. We focus on in-DMS piecewise linear regression and specifically in predicting the answers to mean-value aggregate queries, identifying and delivering the piecewise linear dependencies between variables to regression queries and predicting the data dependent variables within specific data subspaces defined by analysts and data scientists. Our goal is to discover a piecewise linear data function approximation over the underlying data only through query–answer pairs that is competitive with the best piecewise linear approximation to the ground truth. Our methodology is analyzed, evaluated and compared with exact solution and near-perfect approximations of the underlying relationships among variables achieving orders of magnitude improvement in analytics processing.

A major challenge in PMA is to model and learn the very local statistical information of analysts’ interested data subspaces, e.g., local regression coefficients and local data approximation functions, and then extrapolate such knowledge to predict such information for unexplored data subspaces [53]. Based on this abstraction of PMA, which is massively applied on the above-mentioned real-life applications, we focus on two important predictive analytics queries for in-DMS analytics: mean-value queries and linear regression queries.

Example 1

Consider the running example in Fig. 2. Seismologists issue a mean-value queryQ1 over a 3-dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the mean value y of the feature u (seismic signal; P-wave speed) of those spatial points \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta ) \subset \mathbb {R}^{2}\) projections (referring to surface longitude and latitude) within a disk of center \(\mathbf {x}_{0}\) and radius \(\theta \). The query Q1 is central to PMA because the average y is always used as a linear sufficient statistic for the data subspace \(\mathbb {D}\), and it is the best linear predictor of the seismic signal output u based on the region identified around the center point \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta )\) [32].

A linear regression query Q2 calculates the coefficients of a linear regression function within a defined data subspace. For example, in Fig. 2, consider geophysicists issuing queries Q2 over a 3-dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the seismic primary-wave (P-wave) velocity u-intercept (\(b_{0}\)) and the coefficients \(b_{1}\) and \(b_{2}\) for \(x_{1}\) (longitude) and \(x_{2}\) (latitude), where the \(\mathbf {x} = [x_{1},x_{2}]\) points belong to a subspace \(\mathbb {D}(\mathbf {x}_{0}, \theta ) \in \mathbb {R}^{2}\). By estimating the linear coefficients, e.g., the parameter row vector \(\mathbf {b} = [b_{0}, b_{1}, b_{2}]\), we can then interpret the relationships among the features \(\mathbf {x}\) and u and assess the statistical significance of each feature of \(\mathbf {x}\) within \(\mathbb {D}(\mathbf {x}_{0},\theta )\). The output of the Q2 query refers to the dependency of u with \(\mathbf {x}\), which in our example is approximated by a 2-dim. plane \(u \approx b_{0} + b_{1}x_{1} + b_{2}x_{2}\), and quantifies how well the local linear model fits the data.

The result is the intercept \(b_{0}\) and regression coefficients \(\mathbf {b} = [b_{1}, b_{2}]\).

To evaluate queries Q1 and Q2, the system must access the data to establish the data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\), and then take the average value of u in that subspace for query Q1 (e.g., the average seismic signal speed in San Andreas, CA, region) and invoke a multivariate linear regression algorithm [32] for Q2. The Q1 and Q2 type queries are provided by all modern PMA systems like Spark analytics [47], MATLAB1 and DMS systems, e.g., XLeratorDB2 of Microsoft SQL Server3 and Oracle UTL_NLA.4

Remark 1

Please refer to Table 2 in “Appendix” for a table of notations and symbols used in this paper.

1.1 Desiderata

We focus on in-DMS analytics with PMA using models and algorithms for the query types Q1 and Q2. The aim is to meet the following desiderata, providing answers to the following questions:

D1 Are there linear dependencies among dimensions in unexplored data subspaces, and which are such subspaces?

D2 If there are data subspaces, where linear approximations fit well with high confidence, can the system provide these yet unknown linear regression models efficiently and scalably to the analysts?

D3 If in some subspaces linear approximations do not fit well w.r.t. analysts needs, can the system provide fitting models through piecewise local linear approximations?

D4 A solution must meet scalability, efficiency, and accuracy desiderata as well.

Concerning desideratum D1 We study the regression problem—a fundamental inference task that has received tremendous attentions in data mining, data exploration, predictive modeling, machine and statistical learning during the past fifty years. In a regression problem, we are given a set of n observations of \((\mathbf {x}_{i},u_{i})\), where \(u_{i}\)’s are the dependent variables (outputs) and the \(\mathbf {x}_{i}\)’s are the independent variables (inputs); for instance, refer to the 3-dim. input–output points \((\mathbf {x}, u)\) with input \(\mathbf {x} = [x_{1},x_{2}]\) and output u shown in Fig. 2 in our Example 1.

We desire to model the relationship between inputs and outputs. The typical assumption is that there exists an unknown data function g that approximately models the underlying relationship and that the dependent observations are corrupted by random noise. Specifically, we assume that there exists a family of functions \(\mathcal {L}\) such that for some data function \(g \in \mathcal {L}\) it holds true the generative model:

Let us now move a step further to provide more information about the modeled relationship function g in (1). The derivation of several local linear approximations, as opposed to a single linear approximation over the whole data space, can provide more accurate and significant insights. The key issue to note here is that a global (single) linear approximation of g interpolating among all items of the whole data space \(\mathbb {D}\) leaves, in general, much to be desired: The analysts presented with a single global linear approximation might have an inaccurate view due to missing ‘local’ statistical dependencies within unknown local data subspaces that comprise\(\mathbb {D}\). This will surely lead to prediction errors and approximation inaccuracies when issuing queries Q1 and Q2 to the DMS.

Example 2

Consider the input–output in a (u, x) 2-dim. space in Fig. 3(upper) and the actual data function \(u = g(x)\) (in red). A Q2 query issued over the data subspace \(\mathbb {D}(x_{0},\theta )\) will calculate the intercept \(b_{0}\) and slope \(b_{1}\) of the linear approximation \(u \approx \hat{g}(x) = b_{0} + b_{1}x\) (the green line l) over those \(x \in \mathbb {D}(x_{0},\theta )\). Evidently, such a line shows a very coarse and unrepresentative dependency between output u and input x, since u and x do not linearly depend on each other within the entire data subspace \(\mathbb {D}(x_{0},\theta )\). The point is that we should obtain a finer grained and more accurate dependency between output u and input x. The principle of local linearity [26] states that linear approximations of the underlying data function in certain data subspaces fit the global nonlinearity better in the entire data subspace of interest. In Fig. 3(upper), we observe four local linear approximations \(l_{1}, \ldots , l_{4}\) in the data subspace. Therefore, it would be preferable if, as a result of query Q2, the analysts were provided with a list of the local line segments \(\mathcal {S} = \{l_{1}, \ldots , l_{4}\}\), a.k.a. piecewise linear regression. These ‘local’ segments better approximate the linearity of output u. Moreover, in Fig. 3(lower) the underlying data function \(u = g(x_{1},x_{2})\) in the 3-dim. data space does not exhibit linearity over the entire \((x_{1},x_{2})\) plane. We can observe how the global linear relationship \(\hat{g}(x_{1},x_{2})\) cannot capture the very local statistical dependencies between \(\mathbf {x} = [x_{1},x_{2}]\) and u, which are better captured in certain data subspaces by certain local line segments \(\hat{g}_{k}(x_{1},x_{2})\).

Concerning desiderata D2 and D3 Consider the notion of the mean squared error (MSE) [26] to measure the performance of an estimator. Given the n samples \((\mathbf {x}_{i},u_{i})\) and following the generative model in (1) having mean value \(\mathbb {E}[\epsilon _{i}] = 0\) and variance \(\mathbb {E}[\epsilon _{i}^{2}] = \sigma ^{2}\), our goal is to estimate a data function \(\hat{g}\) that is close to the true, unknown data function g with high probability over the noise terms \(\epsilon _{i}\). We measure the distance between our estimate \(\hat{g}\) and the unknown function g with the MSE:

In the general case, the data function g is nonlinear and satisfies some well-defined structural constraints. This has been extensively studied in a variety of contexts [11, 17, 37]. In our desiderata D2 and D3, we focus on the case that the data function g is nonlinear but can be approximated by a piecewise linear function through an unknown number K of unknown pieces (line segments). We then provide the following definition of this type of data function:

The case where K is fixed (given) has received considerable attention in the research community [54]. The special case of piecewise polynomial functions (splines) has been also used in the context of inference including density estimation, histograms, and regression [39].

Let us now denote with \(\mathcal {L}_{K}\) the space of K-piecewise linear functions. While the ground truth may be close to a piecewise linear function, even in certain subspaces, generally we do not assume that it exactly follows a piecewise linear function (yet unknown). In this case, our goal is to recover a piecewise linear function that is competitive with the best piecewise linear approximation to the ground truth.

Formally, let us define the following problem, where we assume that the generative model in (1) representing the data function g is any arbitrary function. We define:

to be the error of the best fit K-piecewise linear function to g and let \(g^{*}\) be any K-piecewise linear function that achieves this minimum. Then, the central goal of desiderata D2 and D3 is to discover \(g^{*}\), which achieves a MSE as close to OPT\(_{K}\) as possible, provided that we observe only queries and their answers and not having access to the input–output actual pairs \((\mathbf {x}_{i},u_{i})\).

Remark 2

If the segments of the data function g were known a priori, the segmented regression problem could have been immediately reduced to K independent linear regression problems. In the general case, where the location of the segment boundaries and their corresponding coefficients are unknown, one needs to discover them using information provided only by the observations of input–output pairs \((\mathbf {x}_{i},u_{i})\). To address this problem, previous works [13, 54] while being statistically efficient are computationally slow and prohibited for large-scale data sets, i.e., the running time for a given data subspace scales at least quadratically with the size of data points n in the queried data subspace, thus, being impractical for large data subspaces or even worse for the entire data space.

In our context, however, the analysts explore the data space only by issuing queries over specific data subspaces, thus, observing only the answers of the analytics queries. Specifically, the analysts do not know before issuing a query how the data function behaves within an ad hoc defined data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\). When a query Q2 is issued, it is not known whether the data function g behaves with the same linearity throughout the entire \(\mathbb {D}(\mathbf {x}_{0},\theta )\) or not, and within which subspaces, if any, g changes its trend and u and \(\mathbf {x}\) exhibit linear dependencies. Thus, the desiderata D2 and D3 focus on learning the boundaries of these local subspaces within \(\mathbb {D}(\mathbf {x}_{0},\theta )\) and within each local subspace, discovering the linear dependency (segment) between output u and input \(\mathbf {x}\). This would arm analysts and data scientists with much more accurate knowledge on how the data function \(g(\mathbf {x})\) behaves within a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\). Hence, decisions on further data exploration w.r.t. complex model selection and/or validation can be taken by the analysts.

Concerning desideratum D4 Our motivation comes from the availability of the past issued and executed queries over large-scale datasets. In the words of [34]: ‘As data grows, it may be beneficial to consider faster inferential algorithms, because the increasing statistical strength of the data can compensate for the poor algorithmic quality’, it seems to be advantageous to sacrifice statistical prediction accuracy in order to achieve faster running times because we can then achieve the desired error guarantee faster [35].

Our motivation rests on learning and predicting how the data function g behaves differently in certain data subspaces, within a greater data subspace defined by past issued and executed queries. The state-of-the-art methods leave a lot to be desired in terms of efficiency and scalability. Our key insight and contribution in this direction lies in the development of statistical learning models which can deliver the above functionality for queries Q1 and Q2 in a way that is highly accurate and insensitive to the sizes of the underlying data spaces in number of data points, and thus scalable. The essence of the novel idea we put forward rests on exploiting previously executedqueries and their answers obtained from the DMS/PMA system to train a model and then use that model to:

approximate the underlying data function g over the analysts’ queried data subspaces by estimating the unknown K segments and their unknown local model coefficients/PLR segments. This has to be achieved based only on the issued queries and their answers, where no data access is provided to analysts by the DMS;

predict the list \(\mathcal {S}\) of the linear models (segments) that model the PLR estimator data function \(\hat{g}\) minimizing the MSE in (3). Such models best explain (fit) the underlying data function g over a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\);

predict the output data value \(\hat{u}\) given an unseen input \(\mathbf {x}\) based on the approximate data function \(\hat{g}\).

Remark 3

In the prediction phase, that is, after training, no access to the underlying data systems is required, thus, ensuring desideratum D4.

1.2 Challenges and organization

In Fig. 4, we show the system context within which our rationale and contributions unfold. A DMS serves analytics queries from a large user community. Over the time, all users (data scientists, statisticians, analysts, applications) will have issued a large number of queries (\(\mathcal {Q} = \{\)Q\(_{1}\), Q\(_{2}\), ...,Q\(_{n}\}\)), and the system will have produced responses (e.g., \(y_{1}, y_{2}, \ldots , y_{n}\) for Q1 queries). Our key idea is to inject a novel statistical learning model and novel query processing algorithms in between users and the DMS that monitors queries and responses and learns to associate a query with its response. After training, say after the first \(m < n\) queries \(\mathcal {T} =\{\)Q\(_{1}, \ldots ,\)Q\(_{m}\}\) then, for any new/unseen query Q\(_{t}\) with \(m < t \le n\), i.e., Q\(_{t} \in \mathcal {V} = \mathcal {Q} \setminus \mathcal {T}= \{\)Q\(_{m+1}, \ldots ,\)Q\(_{n}\}\), our model approximates the data function g with an estimator function \(\hat{g}\) through a list \(\mathcal {S}\) of local linear regression coefficients (line segments) that best fits the actual and unknown function g given the query Q\(_{t}\)’s data subspace and predicts its response \(\hat{y}_{t}\)without accessing the DMS. The efficiency and scalability benefits of our approach are evident. Computing the exact answers to queries Q1 and Q2 can be very time-consuming, especially for large data subspaces. So if this model and algorithms can deliver highly accurate answers, query processing times will be dramatically reduced. Scalability is also ensured for two reasons. Firstly, in the data dimension, as query Q1 and Q2 executions (after training) do not involve data accesses, even dramatic increases in DB size do not impact query execution. Secondly, in the query-throughput dimension, avoiding DMS internal resource utilization (that would be required if all Q1 and Q2 queries were executed over the DMS data) saves resources that can be devoted to support larger numbers of queries at any given point in time. Viewed from another angle, our contributions aim to exploit all the work performed by the DMS engine when answering previous queries, in order to facilitate accurate answers to future queries efficiently and scalably.

The research challenge of this rationale is the problem of non-fixed designed segmented regression exploiting only queries and answers by not accessing the underlying data anymore. Specifically, the challenges are:

Identify the number and boundaries of the data subspaces with local linearities and deliver the local linear approximations for each subspace identified, i.e., predict the list \(\mathcal {S}\) for an unseen query Q2, thus, no need to execute the query Q2. Clearly, this challenge copes further with the following problems: the boundaries of these data subspaces are unknown and cannot be determined even if we could scan all of the data, which in any case would be inefficient and less scalable.

Predict the average value of answer y for an unseen query Q1, thus, no need to execute the query Q1.

It is worth noting that these cannot be achieved solely by accessing the data, as we need information on which are the users’ ad hoc defined subspaces of interest. It is possible to provide this information a priori for all possible data subspaces of interest to analysts, i.e., consider all possible center points \(\mathbf {x}_{0}\) and all possible radii \(\theta \) values. However, this is clearly impractical—this knowledge is obtained after the analysts have issued queries over the data, thus, reflecting their subspaces of interest and exploration.

The paper is organized as follows: Sect. 2 reports on the related work and provides our major contribution of this work. In Sect. 3, we formulate our problems and provide preliminaries, while Sect. 4 provides our novel statistical learning algorithms for large-scale predictive modeling. Section 5 introduces our proposed query-driven methodologies, corresponding algorithms and analyses, while in Sect. 6, we report on the piecewise linear approximation and query–answer prediction methods. The convergence analysis and the inherent computational complexity are elaborated in Sect. 7, and we provide a comprehensive performance evaluation and comparative assessment of our methodology in Sect. 8. Finally, Sect. 9 summarizes our work and discusses our future research agenda in the direction of query-driven predictive modeling.

2 Related work and contribution

2.1 Related work

Out-with DMS environments, statistical packages like MATLAB and R5 support fitting regression functions. However, their algorithms for doing so are inefficient and hardly scalable. Moreover, they lack support for relational and declarative Q1 and Q2 queries. So, if data are already in a DMS, they would need to be moved back and forth between external analytics environments and the DMS, resulting in considerable inconveniences and performance overheads, (if at all possible for big datasets). At any rate, modern DMSs should provide analysts with rich support for PMA.

An increasing number of major database vendors include in their products data mining and machine learning analytic tools. PostgreSQL, MySQL, MADLib (over PostgreSQL) [21] and commercial tools like Oracle Data Miner, IBM Intelligent Miner and Microsoft SQL Server Data Mining provide SQL-like interfaces for analysts to specify regression tasks. Academic efforts include MauveDB [23], which integrates regression models into a DMS, while similarly FunctionDB [50] allows analysts to directly pose regression queries against a DMS. Also, Bismarck [27] integrates and supports in-DMS analytics, while [48] integrates and supports least squares regression models over training datasets defined by arbitrary join queries on database tables. All such works that also support Q1 and Q2 queries can serve as the DMS within Fig. 4. However, in the big data era exact Q1, Q2 computations leave much to be desired in efficiency and scalability, as the system must first execute the selection, establishing the data subspaces per query, and then access all tuples in Q1, Q2.

Apart from the standard multivariate linear regression algorithm, i.e., adopting ordinary least squares (OLS) for function approximation [29], related literature contains more elaborate piecewise regression algorithms, e.g., [10, 12, 18, 28], which can actually detect the nonlinearity of a data function g and provide multiple local linear approximations. Given an ad hoc exploration query over n points in a d-dim. space, the standard OLS regression algorithm has asymptotic computational complexity of \(O(n^{2}d^{5})\) and \(O(nd^{2}+d^{3})\), respectively [32]. Therefore, OLS algorithms suffer from poor scalability and efficiency—especially as n is getting larger, and/or in high-dimensional spaces, as will be quantified in Sect. 8. Such methodologies are suffering such overheads for every query issued, which is highly undesirable. To address this, one may think to perform data-space analyses only once, seeking to derive all local linear regression models for the whole of the data space, and use the derived models for all queries. Indeed, a literature survey reveals several methods like [12, 42], which identify the nonlinearity of data function g and provide multiple local linear approximations. Unfortunately, these methods are very computationally expensive and thus do not scale with the size n of the data points. All these methods execute queries like query Q2, going through a series of stages: partitioning the entire data space into clusters, assigning each data point to one of these clusters, and fitting a linear regression function to each of the clusters. However, data clustering cannot automatically guarantee that the within-cluster nonlinearity of the data function g is captured by a local linear fit. Hence, all these methods are iterative, repeating the above stages until convergence to minimize the residuals estimation error of all approximated local linear regression functions. For instance, the method in [10] clusters and regresses the entire data space against K clusters with a complexity of \(O(K(n^{2}d + nd^{2}))\). Similarly, the incremental adaptive controller method [20] using self-organizing structures requires \(O(n^{2}dT)\) for training purposes. The same holds for the methods [12, 19, 20, 28] that combine iterative clustering and classification for piecewise regression requiring also \(O(n^{2}dT)\). Linear regression methods indicate their high costs when computing exact answers. As all these methods derive regression models over the whole data space, e.g., over trillions of points, the scalability and efficiency desiderata are missed, as Sect. 8 will showcase.

Our approach accurately supports predicting the result of mean-value Q1 queries, approximating the underlying data function g based on (multiple) local linear models of regression Q2 queries, and predicting the output data values given unseen inputs by estimating the underlying data function. It does so while achieving high prediction accuracy and goodness of fit, after training without executing Q1 and Q2, thus, without accessing data. This ensures a highly efficient and scalable solution, which is independent of the data sizes, as Sect. 8 will show.

2.2 Contribution

The contribution of this work lies in efficient and scalable models and algorithms to obtain highly accurate results for mean-value and linear regression queries and PLR-based data function approximation. This rests on learning the principal local linear approximations of data function g. Our approach is query-driven, where past issued queries are exploited to partition the queried data space into subspaces in such a way of minimizing the induced regression error and the model fitting/approximation error. In each data subspace, we incrementally approximate the data function g based on a novel PLR approximation methodology only via query–answer pairs. Given a query over a data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\), we contribute how to:

deliver a PLR-based data approximation of the data function g over different unseen subspaces that best explain the underlying function g within \(\mathbb {D}(\mathbf {x}_{0},\theta )\),

A statistical learning methodology for query-driven PLR approximations of data functions over multi-dimensional data spaces. This methodology indirectly extracts information about the unknown data function g only by observing and learning the mapping between aggregation queries and their answers.

A joint optimization algorithm for minimizing the PLR data approximation error and answer prediction error in light of quantizing the query space.

Convergence analyses of the methodology including variants for supporting partial and global convergence.

for all possible query points \(\mathbf {x} \in \mathbb {R}^{d}\) and query radii \(\theta \in \mathbb {R}\). To calculate the expectation in (5), we approximate EPE by the MSE in (2) over a finite number of query–answer pairs \((y,[\mathbf {x},\theta ])\). Before finding the family of this function that minimizes the EPE in (5), we rest on the law of iterated expectations for the dependent variable y from query point \(\mathbf {x}\) and radius \(\theta \), i.e., \(\mathbb {E}[y] = \mathbb {E}[\mathbb {E}[y|\mathbf {x},\theta ]]\), where y breaks into two pieces, as follows:

For proof of Theorem 1, refer to [32]. According to Theorem 1, the aggregate output y can be decomposed into a conditional expectation function \(\mathbb {E}[y|\mathbf {x},\theta ]\), hereinafter referred to as query regression function, which is explained by \(\mathbf {x}\) and \(\theta \), and a left over (noisy component) which is orthogonal to (i.e., uncorrelated with) any function of \(\mathbf {x}\) and \(\theta \).

In our context, the query regression function is a good candidate for minimizing the EPE in (5) envisaged as a local representative value for answer y over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\). Therefore, the conditional expectation function is the best predictor of answer y given \(\mathbb {D}(\mathbf {x},\theta )\):

Remark 4

We rely on Theorems 1 and 2 to build our statistical learning methodology \(\mathcal {F}\) for estimating a query-PLR \(\hat{f}\) and then, based on Theorem 1, we will be estimating the data-PLR \(\hat{g}\)only through \(\hat{f}\) and the answer–query pairs \((\mathbf {q},y) = ([\mathbf {x},\theta ],y)\), without accessing the actual data pairs \((\mathbf {x},u)\).

The query-PLR estimate function \(\hat{f}(\mathbf {x},\theta )\) from challenge CH1’s outcome is used for estimating the multiple local line segments (i.e., local linear regression coefficients intercept and slope) of the data-PLR estimate function \(\hat{g}\). This is achieved by a novel statistical learning methodology \(\mathcal {F}\), which learns from a continuous query–answer stream \(\{(\mathbf {q}_{1},y_{1}), \ldots , (\mathbf {q}_{t},y_{t})\}\) through the interactions between the users and the system. We can then formulate our problems are:

Problem 1

Given a finite number of query–answer pairs, approximate the query-PLR function \(\hat{f}(\mathbf {x},\theta )\) and predict the aggregate answer \(\hat{y}\) of an unseen query \(\mathbf {q} = [\mathbf {x},\theta ]\).

Problem 2

Given only the query-PLR function \(\hat{f}(\mathbf {x},\theta )\) from Problem 1, approximate the data-PLR function \(\hat{g}(\mathbf {x})\) and predict the data output \(\hat{u}\) of an unseen data input \(\mathbf {x}\).

3.3 Preliminaries

3.3.1 Incremental learning and stochastic gradient descent

The stochastic gradient descent (SGD) [14] is an optimization method for minimizing an objective function \(\mathcal {E}(\alpha )\), where \(\alpha \) is a parameter and optimal parameter \(\alpha ^{*}\) minimizes the objective \(\mathcal {E}\). SGD leads to fast convergence to the optimal parameter \(\alpha ^{*}\) by adjusting the estimated parameter \(\alpha \) so far in the direction (negative gradient \(-\nabla \mathcal {E}\)) that improves the minimization of \(\mathcal {E}\). SGD gradually changes the parameter \(\alpha \) upon reception of a new training sample. The standard gradient descent algorithm updates the parameter \(\alpha \) in \(\mathcal {E}(\alpha )\) as: \(\varDelta \alpha = - \eta \nabla _{\alpha }\mathbb {E}[\mathcal {E}(\alpha )]\), where the expectation is approximated by evaluating the objective function \(\mathcal {E}\) and its gradient over all training pairs and \(\eta \in (0,1)\) is a learning rate. On the other hand, SGD computes the gradient of \(\mathcal {E}\) using only a single training pair at step t, that is we incrementally optimize the objective \(\mathcal {E}\). The update of parameter \(\alpha _{t}\) at step t is given by: \(\varDelta \alpha _{t} = - \eta _{t} \nabla _{\alpha _{t}}\mathcal {E}(\alpha _{t})\). The learning rate\(\{\eta _{t}\} \in (0,1)\) is a step-size schedule, which defines a slowly decreasing sequence of scalars that satisfy:

where \(F(\mathbf {x})\) is the cumulative distribution of the vectors in \(\mathbb {R}^{d}\) and \(\mathbf {w}(\mathbf {x})\) refers to the prototype selected by the VQ \(v(\mathbf {x})\). For each random vector \(\mathbf {x}\), the optimal VQ that minimizes (8) determines the best matched prototype from the codebook w.r.t. the Euclidean distance:

4 Solution fundamentals

4.1 Methodology overview

We first proceed with a solution of Problem 1 to approximate the query function f through a query-PLR function \(\hat{f}\). Then, we use the approximate \(\hat{f}\) to address Problem 2 to approximate the data function g by a data-PLR function \(\hat{g}\).

where \(n_{\theta }(\mathbf {x})\) varies depending on the location of the query point \(\mathbf {x}\) in the input data space \(\mathbb {R}^{d}\) and the query radius \(\theta \). Moreover, the nonlinearity of function g over certain subspaces is further propagated to f by definition of the aggregate answer y in (4). Hence, we must identify those data subspaces where data function g behaves almost linearly, which should be reflected in the function f approximation by \(\hat{f}\). This will provide the key insight on approximating both functions f and g through a PLR family functions by learning the unknown finite set of local linear functions. We call those local linear functions as local linear mappings (LLMs) and derive the corresponding: query-space LLMs and data-space LLMs for the query function f and data function g, respectively.

In Problem 1, we approximate the query function \(f(\mathbf {x},\theta )\) with a set of query-space LLMs (or query-LLMs), each of which is constrained to a local region of the query space \(\mathbb {Q}\), defined by similar queries w.r.t. \(L_{2}\) distance. Similar queries are those queries with similar centers \(\mathbf {x}\) and similar radii \(\theta \). Our general idea for those query-space LLMs is the quantization of the query space \(\mathbb {Q}\) into a finite number of query subspaces \(\mathbb {Q}_{k}\) such that the query function f can be linearly approximated by a query-LLM \(f_{k}, k = 1 \ldots , K\), that is the k-th PLR segment. Those query subspaces may be rather large in areas of the query vectorial space \(\mathbb {Q}\) where the query function f indeed behaves approximately linear and must be smaller where this is not the case. The total number K of such query subspaces depends on the desired approximation (goodness of fit) and the query–answer prediction accuracy, and may be limited by the available issued queries since over-fitting might occur.

Fundamentally, we incrementally quantize the query space \(\mathbb {Q}\) over a series of issued queries through quantization vectors, hereinafter referred to as query prototypes, in \(\mathbb {Q}\). Then, we associate each query subspace \(\mathbb {Q}_{k}\) with a query-LLM \(f_{k}\) in the query–answer space, where the query function f behaves approximately linear.

In Problem 2, principally each query subspace \(\mathbb {Q}_{k}\) is associated with a data subspace \(\mathbb {D}_{k}\), i.e., for a query \(\mathbf {q} \in \mathbb {Q}_{k} \subset \mathbb {R}^{d+1}\), its corresponding query point \(\mathbf {x} \in \mathbb {D}_{k} \subset \mathbb {R}^{d}\). This implies that the input vector \(\mathbf {x}\) (of the query \(\mathbf {q}\)) is constrained to be drawn only from the k-th data subspace \(\mathbb {D}_{k}\). Based on that association, we use the query-LLM \(f_{k}\) to estimate the data-LLM \(g_{k}\), i.e., estimate the local intercept and slope of the data function g over the k-th data subspace \(\mathbb {D}_{k}\).

As it will be analyzed and elaborated later, the query vector \(\mathbf {q}_{0} = [\mathbf {x}_{0},\theta _{0}]\) and the gradient of query function f at \(\mathbf {q}_{0}\) are not randomly selected. Instead, the proposed methodology \(\mathcal {F}\) attempts to find that query vector \(\mathbf {q}_{0}\) and that gradient vector of query function f at \(\mathbf {q}_{0}\), which satisfies the following optimization properties:

(OP3) extraction of the data-LLM estimator from the query-LLM estimator such that the query-driven PLR \(\hat{g}\) fits well the underlying data function g, as stated in challenges CH2 and CH3.

As it will be proved later by Theorems 7, 8, and 9, we require a query-LLM \(f_{k}\) to derive from a specific Taylor’s approximation around the local expectation query\(\mathbb {E}[\mathbf {q}] = [\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ]]\) of queries \(\mathbf {q} \in \mathbb {Q}_{k}\):

Up to now, our challenge is for each query-LLM \(f_{k}\) to estimate the parameter \(\alpha _{k} = (y_{k},\mathbf {b}_{k},\mathbf {w}_{k})\) in light of minimizing the EPE as stated in OP1 by the following constrained optimization problem:

Remark 5

It is worth mentioning that the constraint \(y_{k} = f_{k}(\mathbf {x}_{k},\theta _{k}), \forall k \in [K]\) in the optimization problem (15) requires that in each query subspace \(\mathbb {Q}_{k}\), the corresponding query-LLM \(f_{k}\) refers to a (hyper)plane that minimizes the EPE and, also, given a query \(\mathbf {q}\) with a query point \(\mathbf {x} = \mathbb {E}[\mathbf {x}|\mathbf {x} \in \mathbb {D}(\mathbf {x}_{k},\theta _{k})]\) being the centroid of the corresponding data subspace \(\mathbb {Q}_{k}\) and radius \(\theta = \mathbb {E}[\theta |\mathbf {x} \in \mathbb {D}(\mathbf {x}_{k},\theta _{k}), \mathbf {q} \in \mathbb {Q}_{k}]\) being the mean radius of all the queries from \(\mathbb {Q}_{k}\), it secures that \(f_{k}\) supports the OP2 and OP3.

However, we need to further optimize \(\alpha _{k}\) to satisfy also the optimization properties OP2 and OP3.

4.3 Our statistical learning methodology

Our statistical learning methodology \(\mathcal {F}\) departs from the optimization problem in (15) to additionally support the optimization properties OP2 and OP3. Our methodology is formally based on a joint optimization problem of optimal quantization and regression. This is achieved by incrementally identifying within-subspaces linearities in the query space and then estimating therein the query-LLM coefficients such that we preserve the optimization properties OP1, OP2, and OP3.

4.3.1 Joint quantization–regression optimization for query-LLMs

Firstly, we should identify the subspaces \(\mathbb {Q}_{k}\), i.e., determine their prototypes \(\mathbf {w}_{k}\), their number K, and their coefficients \(y_{k}\) and \(\mathbf {b}_{k}\), in which the query function f can be well approximated by LLMs. We identify the prototypes \(\mathbf {w}_{k}\) (associated with \(\mathbb {Q}_{k}, k \in [K]\)) by incrementally partitioning the query space \(\mathbb {Q} = \cup _{k=1}^{K}\mathbb {Q}_{k}\). Before elaborating on our methodology, we provide an illustrative example on query space quantization.

The introduction of the query space quantization before predicting the query’s answer, i.e., regression of aggregate answer y on the query vector \(\mathbf {q}\), raises a natural fundamental question:

Question:Since by query quantization will lose information and, thus, likely damage the prediction performance of query function approximate\(\hat{f}\), would it not be better to always proceed with regression based on the original, un-quantized, query vectors?

Answer: There is one response on that question: one can consider a VQ as part of the regression estimate function \(\hat{f}\). The overall goal is not purely regression, i.e., query–answer prediction using query function f, but also PLR fitting of the underlying data function g. The VQ yields several benefits starting from constructing the query prototypes \(\{\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}]\}_{k=1}^{K}\) of the query-LLMs \(f_{k}\), that is minimizing the EQE (OP2), to constructing the intercepts and slopes \(\{(y_{k},\mathbf {b}_{k})\}_{k=1}^{K}\), which are needed to minimize the EPE (OP1) and also to derive the data-LLMs \(g_{k}\) (OP3). And, based on Theorem 7, the query prototypes \(\mathbf {w}_{k}\) converge to the optimal vector prototypes only when adopted by the VQ; specifically by an incrementally growing AVQ, as it will be elaborated later. The inclusion of estimating the query prototypes \(\mathbf {w}_{k}\) provides a methodology not suggested by the regression/prediction goal alone, which nonetheless allows one to weight the prediction performance as being the more important criterion and which may eventually yield better regression algorithms. However, in this case our goal has with one model to satisfy the optimization properties OP1, OP2, and OP3 simultaneously, and this can be viewed as finding an algorithm for jointly designing a VQ and PLR-based predictor to yield performance close to that achievable by an optimal PLR-based predictor operating on the original answer–query pairs and input–output data pairs, as it will be shown at our performance Sect. 8.

Given a finite and unknown number of query prototypes K and a VQ \(v(\mathbf {q})\) over the query space, the query quantization performance measured by mean squared distortion error \(\mathcal {J}\) is given by:

In parallel, within each \(\mathbb {Q}_{k}\), we incrementally estimate the PLR coefficients \((y_{k},\mathbf {b}_{k})\) of each query-LLM \(f_{k}\). These coefficients are learned only from similar query–answer pairs whose queries belong to the query subspace \(\mathbb {Q}_{k}\).

We propose a hybrid model by partitioning \(\mathbb {Q}\) into K (unknown) subspaces \(\mathbb {Q}_{k}\), i.e., unsupervised leaning of \(\mathbf {w}_{k}\) to minimize the EQE , and supervised learning of the coefficients \(y_{k}\) and \(\mathbf {b}_{k}\) to minimize the EPE. The idea is that each query subspace \(\mathbb {Q}_{k}\) associates the LLM \(f_{k}\) with the query prototype \(\mathbf {w}_{k}\), as shown in Fig. 6 (see Example 4), conditioned on the result of the VQ. In other words, the regression performance is provided by the conditional EPE \(\mathcal {H}\):

which is the lower bound achievable if the regression is chosen to minimize the prediction error derived by the k-th query-LLM \(f_{k}\), which corresponds to the closest query prototype \(\mathbf {w}_{k}\), i.e., the VQ chooses \(k = v(\mathbf {q})\) such that the query prototype \(\mathbf {w}_{k}\) is the winner prototype.

The objective function in (21) corresponds to optimal partitioning of the query space into K partitions (OP1), each with a prototype. The objective function in (22) corresponds to a conditional EPE conditioned on the k-th query prototype \(\mathbf {w}_{k}\), which is the closest to the query \(\mathbf {q}\) (OP2). The constraints will ensure the derivation of the data-LLM \(\hat{g}_{k}\) from the query-LLM \(f_{k}\) at it will be shown later (OP3).

The quantization of the query space \(\mathbb {Q}\) operates as a mechanism to project an unseen query \(\mathbf {q}\) to the closest query subspace \(\mathbb {Q}_{k}\) w.r.t. \(L_{2}\) distance from the prototype \(\mathbf {w}_{k}\), wherein we learn the dependency between the aggregate answer y with the query point \(\mathbf {x}\) and radius \(\theta \).

Example 4

Figure 6 depicts the association from the query space to the 3D data space. A query prototype \(\mathbf {w}_{j}\), a disk on the input space \((x_{1},x_{2})\), is now associated with the query-LLM \(f_{j}(\mathbf {x},\theta )\) and its corresponding regression plane \(u_{j} = f_{j}(\mathbf {x},\theta _{j})\) on the data space \((u, x_{1}, x_{2})\), which approximates the actual data function \(u = g(x_{1},x_{2}) = x_{1}(x_{2}+1)\). Note, in each local plane, we learn the local intercept \(y_{j}\) and slope \(\mathbf {b}_{j}\) where \(\mathbf {x}_{j}\) is the representative of the data subspace \(\mathbb {D}_{j}\) (see Theorems 7, 8 and 9).

4.3.2 Data-LLM function derivation from query-LLM function

Concerning Problem 1, the prediction of the aggregate output \(\hat{y}\) of an unseen query \(\mathbf {q}\) is provided by neighboring query-LLM functions \(f_{k}\), as will be elaborated later. Concerning Problem 2, we derive the linear data-LLM function \(g_{k}\) (intercept and slope) between output u and input \(\mathbf {x}\) over the data subspace \(\mathbb {D}\) given the query-LLM function \(f_{k}\). Then, we approximate the PLR estimate of data function g by interpolating many data-LLMs.

Based on Theorems 1 and 2, we obtain that the data output \(u = g(\mathbf {x}) = \mathbb {E}[u|\mathbf {x}] + \epsilon \). In that context, we can approximate the data function \(g(\mathbf {x})\) over the data subspace \(\mathbb {D}_{k}\), i.e., the PLR segment \(g_{k}\) from the corresponding query-LLM function \(f_{k}\) conditioned on the mean radius \(\theta _{k}\).

Theorem 3

The data function \(g(\mathbf {x})\) in the data subspace \(\mathbb {D}_{k}\) is approximated by the linear regression function:

Example 5

We provide the following visualization in Fig. 7 to better explain and provide insights of the data-LLMs derivation from query-LLMs. Specifically, Fig. 7 interprets the mapping methodology \(\mathcal {F}\) from the query-LLMs to the data-LLMs after obtaining the optimal values for the parameters that satisfy the optimization properties OP1, OP2, and OP3. We observe three regression planes in the query–answer space \((x,\theta ,y)\), which are approximated by the three query-LLMs \(f_{1},f_{2}\) and \(f_{3}\). This indicates the PLR approximate of the query function f. Now, focus on the regression plane \(f_{k}(x,\theta )\) along with the query prototype \(\mathbf {w}_{k} = [x_{k},\theta _{k}]\). The corresponding data-LLM function \(g_{k}(x)\) for those data inputs \(x \in \mathbb {D}(x_{k},\theta _{k})\) derives from the query-LLM \(f_{k}(x,\theta _{k})\) since, as proved in Theorem 7, the radius \(\theta _{k}\) is the expected radius of all the queries \(\mathbf {q}\) with \(\mathbf {w}_{k} = \mathbb {E}[\mathbf {q}|v(\mathbf {q}) = k]\), i.e., \(\theta _{k} = \mathbb {E}[\theta |v(\mathbf {q}) = k]\). The data-LLM is represented by the linear regression approximation \(g_{k}(x)\) laying on the regression plane defined by the query-LLM \(f_{k}\). We obtain the PLR data approximate g over all data input x space by following the data-LLMs \(g_{k}\) over the planes defined by the query-LLMs \(f_{k}\), as illustrated in the inner plot in Fig. 7. As we are moving from one query-LLM \(f_{k-1}\) to the next one \(f_{k}\), we derive the corresponding data-LLMs \(g_{k-1}\) to \(g_{k}\) by setting \(\theta = \theta _{k-1}\) and \(\theta =\theta _{k}\) to the query-LLM definitions (linear models) such that: \(\theta _{k-1} = \mathbb {E}[\theta |v(\mathbf {q}) = k-1]\) and \(\theta _{k-1} = \mathbb {E}[\theta |v(\mathbf {q}) = k]\), respectively. Hence, based on this trajectory we derive the PLR estimate data function \(\hat{g}\) of the underlying data function g.

Remark 6

It is worth noting that the data function g based on Theorem 3 is achieved based only by the knowledge extracted from answer–query pairs and not by accessing the data points.

Example 5: The three query-LLMs \(f_{1}, f_{2}, f_{3}\) as three-dimensional planes in the query space and their corresponding/derived data-LLMs \(g_{1}, g_{2}, g_{3}\) as line segments over the data subspace (inner plot)

5 Query-driven statistical learning methodology

In this section we propose our query-driven statistical learning algorithm for our methodology through which all the query-LLM parameters \(\alpha _{k}\) minimize both (21) and (22). Then, we provide the PLR approximation error bound of the PLR estimate functions \(f_{k}\) of query function f and the impact of our VQ algorithm in this error.

Let us focus on the EQE \(\mathcal {J}\) in (21) and liaise with Example 3 (Fig. 5). We seek the best possible approximation of a random query \(\mathbf {q}\) out of the set \(\{\mathbf {w}_{k}\}_{k=1}^{K}\) of finite K query prototypes. We consider the closest neighbor projection of query \(\mathbf {q}\) to a query prototype \(\mathbf {w}_{j}\), which represents the j-th query subspace \(\mathbb {Q}_{j} \subset \{ \mathbf {q} \in \mathbb {Q}: ||\mathbf {q}-\mathbf {w}_{j} ||_{2} = \min _{k} ||\mathbf {q}-\mathbf {w}_{k} ||_{2} \}\). We incrementally minimize the objective function \(\mathcal {J}\) with the presence of a random query \(\mathbf {q}\) and update the winning prototype \(\mathbf {w}_{j}\) accordingly. However, the number of the query subspaces and, thus, query prototypes \(K>0\), is completely unknown and not necessarily constant. The key problem is to decide on an appropriate K value. In the literature a variety of AVQ methods exists, however, not suitable for incremental implementation, because K must be supplied in advance.

We propose a conditionally growing AVQ algorithm under \(L_{2}\) distance in which the prototypes are sequentially updated with the incoming queries and their number is adaptively growing, i.e., the number K increases if a criterion holds true. Given that K is not available a-priori, our VQ minimizes the objective \(\mathcal {J}\) with respect to a threshold value \(\rho \). This threshold determines the current number of prototypes K. Initially, the query space has a unique (random) prototype, i.e., \(K=1\). Upon the presence of a query \(\mathbf {q}\), our algorithm first finds the winning query prototype \(\mathbf {w}_{j}\) and then updates the prototype \(\mathbf {w}_{j}\) only if the condition \(||\mathbf {q}-\mathbf {w}_{j}||_{2} \le \rho \) holds true. Otherwise, the query \(\mathbf {q}\) is currently considered as a new prototype, thus, increasing the value of K by one. Through this conditional quantization, our VQ algorithm leaves the random queries to self-determine the resolution of quantization. Evidently, a high \(\rho \) value would result in coarse query space quantization (i.e., low resolution partition) while low \(\rho \) values yield a fine-grained quantization of the query space. The parameter \(\rho \) is associated with the stability–plasticity dilemma a.k.a. vigilance in Adaptive Resonance Theory [30]. In our case, the vigilance \(\rho \) represents a threshold of similarity between queries and prototypes, thus, guiding our VQ algorithm in determining whether a new query prototype should be formed.

Remark 7

To give a physical meaning to the vigilance parameter \(\rho \), we express it through a set of coefficient percentages \(a_{i} \in (0,1)\) and \(a_{\theta } \in \theta \) of the value ranges of each dimension \(x_{i}\) of query center \(\mathbf {x} \in \mathbb {R}^{d}\) and radius \(\theta \), respectively. Then, we obtain that \(\rho = ||[a_{1}, \ldots , a_{d}] ||_{2} + a_{\theta }\) and if we let \(a_{i} = a_{\theta }= a \in (0,1), \forall i\), then the vigilance parameter is rewritten as:

$$\begin{aligned} \rho= & {} a(d^{1/2}+1) \end{aligned}$$

(23)

A high quantization coefficient a value over high-dimensional data results in a low number of prototypes and vice versa.

Let us now focus on the EPE \(\mathcal {H}\) in (22) and liaise with Examples 3 and 4 (Figs. 5 and 6). The objective function \(\mathcal {H}\) is conditioned on the winning query-prototype index \(j = \arg \underset{k}{\min } ||\mathbf {q} - \mathbf {w}_{k}||_{2}\), i.e., it is guided by the VQ \(v(\mathbf {q}) = j\). Our target is to incrementally learn the query-LLM coefficients offset \(y_{j}\) and slope \(\mathbf {b}_{j}\) of the LLM function \(f_{j}\), which are associated with the winning query prototype \(\mathbf {w}_{j} \in \mathbb {Q}_{j}\) for a random query \(\mathbf {q}\).

The update rules for the optimization parameter \(\alpha \) in our SGD-based dual EQE/EPE optimization of the objective functions \(\mathcal {J}\) and \(\mathcal {H}\) are provided in Theorem 4.

Theorem 4

Given a pair of query–answer \((\mathbf {q},y)\) and its winning query prototype \(\mathbf {w}_{j}\), the optimization parameter \(\alpha \) converges to the optimal parameter \(\alpha ^{*}\), if it is updated as:

Proof

We adopt SGD to minimize both (21) and (22). \(\mathcal {J}\) and \(\mathcal {H}\) are minimized by updating \(\alpha = \{y_{k},\mathbf {w}_{k},\mathbf {b}_{k}\}\) in the negative direction of their sum of gradients. We obtain the set of update rules:

The provided training Algorithm 1 processes a random pair of query–answer one at a time from a training set \(\mathcal {T} = \{(\mathbf {q},y)\}\); see also Fig. 4. In the initialization phase of the training algorithm, there is only one query prototype \(\mathbf {w}_{1}\), i.e., \(K = 1\), which corresponds to the first query, while the associated query-LLM coefficients \(\mathbf {b}_{1}\) and \(y_{1}\) are initialized to \(\mathbf {0}\) and 0, respectively. For the t-th random pair \((\mathbf {q}_{t},y_{t})\) and onwards with \(t \ge 2\), the algorithm either updates the closest prototype to \(\mathbf {q}_{t}\) (out of the so far K prototypes) if their \(L_{2}\) distance is less than \(\rho \), or adds a new prototype increasing K by one and then the new LLM coefficients are initialized. The algorithm stops updating the query prototypes and query-LLM coefficients at the first step t where \(\max (\varGamma _{t}^{\mathcal {J}},\varGamma _{t}^{\mathcal {H}}) \le \gamma \). At that time and onwards, the algorithm returns the parameters set \(\alpha \) and no further modification is performed, i.e., the algorithms has converged.

Through the incremental training of the parameters set \(\alpha = \{(y_{k},\mathbf {b}_{k}, \mathbf {w}_{k})\}_{k=1}^{K}\), each query-LLM function \(f_{k}\) has estimated its parameters. The PLR approximation error bound for the LLM function \(f_{k}\) around the query prototype \(\mathbf {w}_{k}\) depends on the dimension d and curvature (second derivative) of the function \(f_{k}\) in the query subspace \(\mathbb {Q}_{k}\) as provided in Theorem 5. The approximation depends on the resolution of quantization K. Notably, the more prototypes K, the better the approximation of the query function f is achieved by query-LLMs, as proved in Theorem 6.

Proof

Upon a random query and the quantization of the query space \(\mathbb {Q}\) into K LLMs, each with a query prototype \(\mathbf {w}_{k}\), the derived approximation error of f through all \(f_{k}, k \in [K]\), is

where \(\lambda _{k}\) is the conditional approximation error bound given that \(\mathbf {q}\) is assigned to prototype \(\mathbf {w}_{k}\) and \(P(\mathbf {w}_{k})\) is the prior probability of \(\mathbf {w}_{k}\). Provided that all \(\mathbf {w}_{k}\) are equiprobable for being assigned to queries, i.e., \(P(\mathbf {w}_{k}) = \frac{1}{K}, \forall k\), then:

6 Data and query functions approximation and prediction

In this section we propose an algorithm that uses the query-LLM functions to approximate the PLR data function g over a data subspace given the corresponding data-LLM functions and an algorithm to predict the aggregate answer y of an unseen query based on the query-LLM functions.

Our algorithms entail the use of the previously trained query-LLM functions from the training query–answer pairs in the training set \(\mathcal {T}\) to predict aggregate answers to unseen queries Q1 and Q2 from the test set \(\mathcal {V}\); see also Fig. 4. We adopt the principle of the nearest neighbors regression for prediction [32]. The notion of neighborhood here is materialized by the overlapping of an unseen query with the query prototypes in the quantized space \(\mathbb {Q}\) (see Example 4, Fig. 6). By Definition 7, the queries \(\mathbf {q} = [\mathbf {x}, \theta ]\) and \(\mathbf {q}' = [\mathbf {x}', \theta ']\) overlap if the condition \(\mathcal {A}(\mathbf {q},\mathbf {q}') = \texttt {TRUE}\). To quantify a degree of overlapping between those queries represented as hyper-spheres in the \((d+1)\)-dim. space, we require that the two spheres are partially intersected. Let us define the ratio between the \(L_{2}\) distance of the centers of data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) and \(\mathbb {D}(\mathbf {x}', \theta ')\) over the distance of their radii, i.e., \(\frac{||\mathbf {x}-\mathbf {x}' ||_{2}}{\theta +\theta '}\). This ratio takes values in [0, 1] in the case of overlapping, with a value of unity when both spheres just meet each other. In the concentric case, the degree of overlapping should also take into consideration the remaining area from this perfect inclusion. We define the degree of overlapping for two queries as the normalized ratio \(\delta (\mathbf {q},\mathbf {q}') \in [0,1]\):

In the case where \(\mathcal {W}(\mathbf {q}) \equiv \emptyset \), we extrapolate the similarity of the query \(\mathbf {q}\) with the closest query prototype to associate the answer with the estimation \(\hat{y}\) derived only from the query-LLM function \(f_{j}(\mathbf {q},\theta )\) with the query prototype \(\mathbf {w}_{j}\) being closest to the query \(\mathbf {q}\). Through this projection, i.e., \(j = \arg \min _{k \in [K]}||\mathbf {q}-\mathbf {w}_{k}||_{2}\), we get the local slope and intercept of the local mapping of query \(\mathbf {q}\) onto the aggregate answer y.

(Case 2) be contained or contain a data subspace \(\mathbb {D}_{k}\), or

(Case 3) be outside of any data subspace \(\mathbb {D}_{k}\).

In Cases 1 and 2, the algorithm returns the derived data-LLMs of the data function g interpolating over the overlapping data subspaces, using the corresponding query-LLMs, as proved in Theorem 3. In Case 3, the best possible linear approximation of the data function g is returned through the extrapolation of the data subspace \(\mathbb {D}_{k}\) whose query prototype \(\mathbf {w}_{k}\) is closest to the query \(\mathbf {q}\). For Cases 1 and 2, we exploit the neighborhood \(\mathcal {W}(\mathbf {q})\) of the query \(\mathbf {q} = [\mathbf {x},\theta ]\). For Case 3, we select the data-LLM function, which corresponds to the query-LLM function with the closest query prototype \(\mathbf {w}_{j}\) to query \(\mathbf {q}\) since, in this case, \(\mathcal {W}(\mathbf {q}) \equiv \emptyset \).

The PLR approximation of the data function \(g(\mathbf {x})\) over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\) involves both the radius \(\theta \) and the query center \(\mathbf {x}\) using their similarity with radius \(\theta _{k}\) and the point \(\mathbf {x}_{k}\), respectively, from the \(\mathcal {W}(\mathbf {q})\). For the Cases 1 and 2, the set of the data-LLMs for a PLR approximation of data function \(g(\mathbf {x})\) is provided directly from those query-LLMs \(f_{k}\), whose query prototype \(\mathbf {w}_{k} \in \mathcal {W}(\mathbf {q})\). That is, for \(\mathbf {x} \in \mathbb {D}_{k}(\mathbf {x}_{k},\theta _{k})\), we obtain:

The PLR approximation of the data function is shown in Algorithm 3, which returns the set of the data-LLM functions \(\mathcal {S}\) defined over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\) for a given unseen query \(\mathbf {q} = [\mathbf {x},\theta ]\). Note that depending on the query radius \(\theta \) and the overlapping neighborhood set \(\mathcal {W}(\mathbf {q})\), we obtain: \(1 \le |\mathcal {S}| \le K\), where \(| \mathcal {S}|\) is the cardinality of the set \(\mathcal {S}\).

Remark 8

Figure 8(upper) shows how the data function \(u = g(x)\) is accurately approximated by \(K=6\) data-LLMs (green interpolating local lines) compared with the global linear regression function (REG in red) over the data subspace \(\mathbb {D}(0.5,0.5)\). We also illustrate the K linear models derived by the actual PLR data approximation algorithm [44], i.e., the best possible PLR data approximation should we have access to that data subspace, which corresponds to OPT\(_{K}\) in (3). Unlike our model, PLR needs access to the data and is thus very expensive; specifically, it involves a forward/backward iterative approach to produce the multiple linear models [44]. Our model, instead, incrementally derives the data-LLMs based on the optimization problems in (21) and (22). Note that the derived data-LLMs are highly accurate.

7 Convergence analysis and complexity

7.1 Global convergence analysis

In this section we show that our stochastic joint optimization algorithm is asymptotically stable. Concerning the objective function \(\mathcal {J}\) in (21), the query prototypes \(\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}]\) converge to the centroids (mean vectors) of the query subspaces \(\mathbb {Q}_{k}\). This convergence reflects the partition capability of our proposed AVQ algorithm into the prototypes of the query subspaces. The query subspaces naturally represent the (hyper)spheres of the data subspaces that the analysts are interested in accessed by their query centers \(\mathbf {x}_{k}\) and radii \(\theta _{k}\), \(\forall k\).

Concerning the objective function \(\mathcal {H}\) in (22), the approximation coefficients slope and intercept in Theorem 3 converge, too. This convergence refers to the linear regression coefficients that would have been derived should we were able to a fit linear regression function over each data subspace \(\mathbb {D}_{k}\), given that we had access to the data.

We provide two convergence theorems for the coefficients \(y_{k}\) and \(\mathbf {b}_{k}\) of the query-LLM \(f_{k}\). Firstly, we focus on the aggregate answer prediction \(y = y_{k} + \mathbf {b}_{k}(\mathbf {q}-\mathbf {w}_{k})^{\top }\). Given that the query prototype \(\mathbf {w}_{k}\) has converged, i.e., \(\mathbf {w}_{k} = \mathbb {E}[\mathbf {q}|\mathbb {Q}_{k}]\) from Theorem 7, then the expected aggregate value \(\mathbb {E}[y|\mathbb {Q}_{k}]\) converges to the \(y_{k}\) coefficient of the query-LLM \(f_{k}\). This also reflects our assignments of the statistical mapping \(\mathcal {F}\) of the local expectation query \(\mathbf {w}_{k}\) to the mean of the query-LLM \(f_{k}\), i.e., \(f_{k}(\mathbb {E}[\mathbf {x}_{k}|\mathbb {Q}_{k}],\mathbb {E}[\theta _{k}|\mathbb {Q}_{k}]) = \mathbb {E}[y|\mathbb {Q}_{k}]\). This refers to the local associative convergence of coefficient \(y_{k}\) given a query \(\mathbf {q} \in \mathbb {Q}_{k}\). In other words, the convergence of the query subspace enforces also convergence in the output domain.

Proof

Based on law of total expectations, for the LLM coefficient \(\mathbf {b}_{k}\) we obtain \(\mathbb {E}[\varDelta \mathbf {b}_{k}] = \int _{\mathbb {R}}\mathbb {E}[\varDelta \mathbf {b}_{k}| y]p(y)dy\). By using the update rule in Theorem 4, we write for the conditional expectation term \(\mathbb {E}[\varDelta \mathbf {b}_{k}| y]\):

By solving \(\mathbb {E}[\varDelta \mathbf {b}_{k}] = \mathbf {0}\), which implies that \(\mathbf {b}_{k}\) is constant with probability 1, we obtain that \(\mathbf {b}_{k} = \varvec{\beta }_{k}\). This refers to the population normal equations for the multivariate linear regression model within the subspace \(\mathbb {Q}_{k}\times \mathbb {R}\). \(\square \)

7.2 Partial convergence analysis

The entire statistical learning model runs in two phases: the training phase and the prediction phase. In the training phase, the query-LLM prototypes \((\mathbf {w}_{k},\mathbf {b}_{k}, y_{k})\), \(k \in [K]\), are updated upon the observation of a query–answer pair \((\mathbf {q},y)\) until their convergence w.r.t. the global stopping criterion in (24). In the prediction phase, the model proceeds with the mean-value prediction of the aggregate answer \(\hat{y}\), the PLR data approximation of the data function g and the output data value prediction \(\hat{u}\), without execution of any incoming query after convergence at \(t^{*}\). The major requirement for the model to transit from the training to the prediction phase is the triggering of the global stopping criterion at \(t^{*}\) w.r.t. a fixed \(\gamma >0\) convergence threshold.

Let us now provide an insight of this global criterion. The model convergence means that on average for all the trained query-LLM prototypes their improvement w.r.t. a new incoming query–answer pair is not as much significant as it was at the early stage of the training phase. The rate of updating such prototypes, which is reflected by the difference vector norms of \((\mathbf {w}_{k,t},\mathbf {b}_{k,t}, y_{k,t})\) and \((\mathbf {w}_{k,t-1},\mathbf {b}_{k,t-1}, y_{k,t-1})\) at observation t and \(t-1\), respectively, is decreasing as the number of query–answer pairs increases, i.e., \(t \rightarrow \infty \); refer also to convergence analysis in Sect. 7.

In a real world setting, however, we cannot obtain an infinite number of training pairs to ensure convergence. Instead, we are sequentially provided a finite number of training pairs \((\mathbf {q}_{t},y_{t})\) from a finite training set \(\mathcal {T}\). We obtain model convergence given that there are enough pairs in the set \(\mathcal {T}\) such that the criterion in (24) is satisfied. More interestingly, we have observed that some of the query-LLM prototypes, say \(L < K\) converge with less training query–answer pairs than all provides pairs \(|\mathcal {T}|\). Specifically, for those L prototypes, which represent certain data subspaces \(\mathbb {D}_{\ell }\) and query subspaces \(\mathbb {Q}_{\ell }\), \(\ell =1, \ldots , L\), it holds true that the convergence criterion \(\max \{\varGamma _{\ell }^{\mathcal {J}},\varGamma _{\ell }^{\mathcal {H}}\}_{t} \le \gamma \) for \(t < t^{*}\), where \(t^{*}\) corresponds to the last observed training pair where the entire model has globally converged, given a fixed \(\gamma \) convergence threshold. In this case, we introduce the concept of partial convergence if there is at least a subset of query-LLM prototypes, which have already converged w.r.t. \(\gamma \) at an earlier stage than the entire model (entire set of parameters). Interestingly, those \(\ell \) query-LLM prototypes transit from their training phase to the prediction phase. The partial convergence on those data subspaces is due to the fact that there were relatively more queries issued to those data subspaces compared to some other data subspaces up to the t-th observation with \(t < t^{*}\). Moreover, by construction of our model, only a relatively small subset of query-LLM prototypes are required for mean-value prediction and PLR data approximations (refer to the overlapping set \(\mathcal {W}\) in Sect. 6). Hence, based on the flexibility of the partial convergence, we can proceed with prediction and data approximation to certain incoming queries issued onto those data subspaces, where their corresponding query-LLM prototypes have partially converged, while the entire model is still on a training phase, i.e., it has not yet globally converged.

The advantage of this methodology is that we deliver predicted answers to the analysts’ queries without imposing the execution delay for those queries. Evidently, we obtain the flexibility to either proceed with the query execution after the prediction for refining more the converged data subspace or not. In both options, the analysts ‘do not need to wait’ for the system to execute firstly the query and then being delivered the answers. This motivated us to introduce a progressive predictive analytics or intermediate phase, where some parts of the model can, after their local convergence, provide predicted answers to the analysts without waiting for the entire model to converge.

The research challenge in supporting the progressive analytics phase is when some of the involved query-LLM parameters are not yet converged with some other query-LLM parameter, which have locally converged. Specifically, assume that at the t-th observation (with \(t < t^{*}\)) there are L query-LLM prototypes that have converged and the query \(\mathbf {q}_{t} = (\mathbf {x}_{t},\theta _{t})\) is arriving to the system (note: \(L < K\) at observation t). The overlapping set \(\mathcal {W}(\mathbf {q}_{t})\) consists of \(\ell \le L\) query prototypes \(\mathbf {w}_{i}\), \(i = 1, \ldots , \ell \), which have converged and \(\kappa < K-L\) query prototypes \(\mathbf {w}_{j}\), \(j = 1, \ldots , \kappa \), which have not yet converged, i.e., \(\mathcal {W}(\mathbf {q}_{t}) = \{\mathbf {w}_{i}\} \cup \{\mathbf {w}_{j}\}\). In this case, the mean-value prediction and the PLR data approximation over the data subspace \(\mathbb {D}(\mathbf {x}_{t},\theta _{t})\) involves \(\ell +\kappa \) prototypes such that:

with \(\ell = |\mathcal {C}(\mathbf {q}_{t})|\) and \(\kappa = |\mathcal {U}(\mathbf {q}_{t})|\). We adopt a convergence voting/consensus scheme for supporting this intermediate phase between training and prediction phase in light of delivering either predicted answers or actual answers to the analysts.

Case A If the consensual ratio \(\frac{\ell }{\ell + \kappa } \ge r\), i.e., more than \(r\%\) of the query prototypes in \(\mathcal {W}(\mathbf {q}_{t})\) have locally converged, with \(r \in (0.5, 1)\) then two options are available:

Case A.I The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged to the analysts and, then, executes the query for updating the \(\kappa \) not yet converged query prototypes to align with the model convergence mode. In this case, the analysts are delivered a predicted answer where the degree of confidence for this answer is regulated through the consensual ratio r. The mean-value prediction and PLR data approximation is achieved as described in Algorithms 2 and 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). After the query execution, the query-LLM prototypes from the un-converged set \(\mathcal {U}(\mathbf {q})\) in (32) are updated as described in Algorithm 1. Obviously, if the consensual ratio \(\frac{\ell }{\ell + \kappa } = 1\), then there is no such an intermediate phase.

Case A.II The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged, to the analysts, and does not execute the query, thus, no update is performed for those \(\kappa \) query prototypes. The mean-value prediction and PLR data approximation is achieved as described in Algorithm 2 and Algorithm 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). This obviously delays the global convergence and reduces the number of queries executed for convergence. This option is only preferable when most of the incoming queries focus on specific data subspaces and not on the entire data space. In other words, there is no meaning for the entire model to globally converge to transit from the training phase to the prediction phase, if most of the queries are issued on very specific data subspaces. At the extreme case, the model could delay a lot its convergence if more than 50% of the query prototypes are involved in the overlapping sets for all the incoming queries. To alleviate this case, our model creates new prototypes (incrementally) only when there is at least some interest on a specific data subspace, as discussed in Sect. 5 adopting the principles of adaptive resonance theory [30].

Case B Otherwise, i.e., the consensual ratio \(\frac{\ell }{\ell + \kappa } < r\), the model acts as usual in the training phase, i.e., it first executes the query and delivers the actual answer to the analyst, and then based on this actual answer it updates the prototypes as discussed in Sect. 5.

Algorithm 4 shows the partial convergence methodology of the model transition from the training phase to the intermediate phase, and then to the prediction phase.

For any observation \(t < t_{\top }\) the model is in a single training phase, while for any observation \(t_{\top } \le t < t^{*}\) the model is in the intermediate phase, i.e., prediction and/or training phase depending on the consensual ratio at the t-th observation (Cases A and/or B). At \(t > t^{*}\) the model transits to the single prediction phase. Figure 9 illustrates the activation of the training, intermediate and prediction phases over the observation time axis and the landmarks \(t_{\top }\) and \(t^{*}\). The landmark \(t_{\top }\) denotes the minimum number of training pairs the model requires to deliver to the analysts predicted and/or actual answers w.r.t. Cases A and B, while only predicted answers are delivered after \(t^{*}\) training pairs.

Remark 9

The prediction performance of the model in the intermediate phase is up to the performance of the model in the single prediction phase. This is attributed to the predicted answers based on the partial convergence w.r.t. consensual threshold r, where only r% of the query-LLM prototypes from the overlapping set \(\mathcal {W}(\mathbf {q})\) are used for prediction given an unseen query \(\mathbf {q}\). The prediction performance is a non-decreasing function with the number of observations t with \(t_{\top } \le t \le t^{*}\) as will be shown in our performance evaluation Sect. 8.

The landmarks \(t_{\top }\) and \(t^{*}\) for model transition from the training to the intermediate phase, and from the intermediate to the prediction phase, respectively

7.3 Computational complexity

In this section we report on the computational complexity of our model during the training and prediction phases. In the global convergence mode, the model ‘waits’ for the triggering of the criterion in (24) to transit from the training to the prediction phase. Under SGD over the objective minimization functions \(\mathcal {J}\) and \(\mathcal {H}\), with the hyperbolic learning schedule in (7), our model requires \(O(1/\gamma )\) [15] number of training pairs to reach the convergence threshold \(\gamma \). This means that the residual difference between the objective function value \(\mathcal {J}^{t^{*}}\) after \(t^{*}\) pairs and the optimal value \(\mathcal {J}^{*}\), i.e., with the optimal query-LLM parameters, asymptotically decreases exponentially, also known as linear convergence [25]. In this mode, there is a clear separation between the training and prediction phases, while the upper bound of the expected excess difference \(\mathbb {E}[\mathcal {J}^{t}-\mathcal {J}^{*}]\) after t training pairs is \(O\left( \sqrt{\frac{\log t}{t}}\right) \) [52], given a hyperbolic learning schedule in (7).

In the prediction phase, which is the operational mode of our model, given a mean-value query Q1 and a linear regression query Q2, we require O(dK) to calculate the neighborhood \(\mathcal {W}\) set and deliver the query-LLM functions, respectively, i.e., independent on the data size, thus, achieving scalability. We also require O(dK) space to store the query prototypes and the query-LLM coefficients. The derivation of the data-LLMs is then O(1) given than we have identified the query-LLMs for a given linear regression query.

8 Performance evaluation

8.1 Performance metrics

The proposed methodology deals with two major statistical learning components: prediction of the aggregate answer and data output, and data function approximation over data subspaces. For evaluating the performance of our model in light of these components, we should assess the model predictability and goodness of fit, respectively.

Predictability refers to the capability of a model to predict an output given an unseen input, i.e., such input–output pair is not provided during the model’s training phase. Measures of prediction focus on the differences between values predicted and values actually observed. Goodness of fit describes how well a model fits a set of observations, which were provided in the model’s training phase. It provides an understating on how well the selected independent variables (input) explain the variability in the dependent (output) variable. Measures of goodness of fit summarize the discrepancy between actual/observed values during training and the values approximated under the model in question.

We compare our statistical methodology against its ground truth counterparts: the multivariate linear regression model over data subspaces, hereafter referred to as REG, and the piecewise linear model (PLR) over data subspaces, both of which have full access to the data. Note that the PLR data approximation is the optimal multiple linear modeling over data subspaces we can obtain because it is constructed by accessing the data. Hence, we demonstrate how effectively our data-LLMs approximate the ground truth data function g and the optimal PLR data approximation. Specifically, we compare against the REG model using DMS PostgreSQL and the MATLAB and the PLR model using the ARESLab (MATLAB) toolbox7 for building PLR models based on the multivariate adaptive regression splines method in [44]. We show that our model is scalable and efficient and as (or even more than) accurate than the REG model, w.r.t. predictability and goodness of fit, and close to the accuracy obtained by the optimal PLR model. Our model is dramatically more scalable and efficient as, unlike REG and PLR models, it does not need access to data, yielding up to six orders of magnitude faster query execution.

8.1.1 Predictability

Predictability in the query space The Mean-Value Accuracy (A1 metric) refers to the answer prediction of the average value \(\hat{y}\) given an unseen Q1 query \(\mathbf {q} = [\mathbf {x},\theta ]\). Based on the EPE in (5), the A1 metric is the root-mean-square error (RMSE):

where \(\mathcal {W}(\mathbf {q})\) is the overlapping set for query \(\mathbf {q}\) defined in (27). Note that the query-LLM function \(f_{k}(\mathbf {x},\theta _{k})\) provides the intercept \(y_{k}\) and slope \(\mathbf {b}_{X,k}\) over the data input space by setting the radius \(\theta = \theta _{k}\) in the function \(f_{k}\). For a given data input \(\mathbf {x}\), the A2 metric is the RMSE of the predicted output \(\hat{u}\) over M unseen inputs:

respectively. The FVU metric indicates how closely the approximation of the data function g over a data subspace \(\mathbb {D}\) matches the actual data function g over that data subspace. If the FVU value is greater than 1, the explanatory input variable \(\mathbf {x}\) does not convey any information about the output u in the sense that the predictions \(\hat{u}\) do not covary with the actual output u. In this case, the data approximation function is a bad fit. The approximation is considered good when the FVU metric assumes a low value less that 1. Given an unseen query \(\mathbf {q}\) over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\), we measure the FVU and CoD metrics for the REG and PLR models, and the average FVU value \(s = \frac{1}{|\mathcal {S}|}\sum _{\ell =1}^{|\mathcal {S}|}s_{\ell }\) of the FVUs \(s_{\ell }\) (and CoDs) corresponding to the set of data-LLM functions \(\mathcal {S}: |\mathcal {S}| \ge 1\) derived by our Algorithm 3. In our experimental evaluation and comparative assessment of our model with the PLR and REG models (pair-wise), in each performance metric, we adopted the paired-sample two-tailed Student t test using a 95% confidence interval, i.e., significance level 0.05.

8.2 Experimental setup

8.2.1 Real and synthetic datasets

Our goal is to evaluate accuracy in terms of predictability and goodness of fit, efficiency and scalability over real and synthetic datasets. For accuracy, using the A1, A2, FVU, and CoD metrics, we intentionally sought multivariate real data functions g that exhibit extreme nonlinearity in many data subspaces. For this reason, to assess the A1 metric for Q1 queries, the A2 metric for data output predictions, and the FVU, CoD metrics for Q2 queries, we used two real datasets R1 from [45] and R3 from [16], and a synthetic dataset referred to as R2.

Real datasets The real dataset R1 consists of 6-dim. feature vectors corresponding to the concentration level of 6 gases, namely, Ethanol (E1), Ethylene (E2), Ammonia (A1), Acetaldehyde (A2), Acetone (A3), and Toluene (T) derived from chemical sensors. The sensors measurements of the dataset R1 were gathered within 36 months in a gas delivery platform facility situated at the ChemoSignals Laboratory in the BioCircuits Institute (BCI8), University of California. We expand the R1 size by adding extra 6-dim. vectors with Gaussian noise, thus, in total the R1 dataset contains \(15 \cdot 10^{6}\) multi-dimensional data vectors of gases concentration levels. With the R1 dataset we wished to delve into accuracy issues and this dataset was chosen because its data exhibits nonlinear relationships among features. All d-dim. real-valued vectors are scaled and normalized in [0,1] (\(d \in \{1, \ldots , 6\}\)) with significant nonlinear dependencies among the features, evidenced by a high FVU = 4.68. This indicates that a linear approximation of the entire data space is definitely to no avail, presenting a challenging dataset for our approach. Figure 10 shows the R1 scatter plot matrix for all gases concentrations (before scaling and normalization) depicting the dependencies between gases and the corresponding histograms of each dimension. We obtain significant correlations among many gases, indicatively E1 with A1, A2 and A3 with Pearson correlation coefficient 0.41, 0.23, and 0.98 (\(p < 0.05\)), respectively, and E2 with A2 having correlation 0.36 (\(p < 0.05\)). By further analyzing the R1 dataset, the first three Principal Components (PCs) explain the 99.73% of the total variance by 73.57%, 23.94%, and 2.22%, respectively, which are used for Q1 and Q2 analytics queries (prediction of the mean-value and model fitting).

The R3 real dataset contains environmental sensed data used for data-driven predictive models for the energy use of appliances [16]. The data include measurements of temperature and humidity sensors from a wireless network in a house located in Stambruges, Belgium. The sensors read contextual data every 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with an in-house ZigBee wireless sensor network built with XBee radios, Atmega328P micro-controllers and DHT-22 sensors. The digital DHT-22 sensors have an accuracy of ± 0.5 Celsius for temperature and ± 1% for relative humidity. The environmental parameters are: temperature in kitchen area (T1), humidity in kitchen area (H1), temperature in living room area (T2), humidity in living room area (H2), temperature in laundry room area (T3), and humidity in laundry room area (H3). The real dataset R3 consists of 6-dim. feature vectors corresponding to the above-mentioned six contextual parameters. We expand the R3 size by adding extra 6-dim. vectors with Gaussian noise, thus, in total the R3 dataset contains \(10 \cdot 10^{6}\) multi-dimensional data vectors of temperature and relative humidity of different areas within the house. All d-dim. real-valued vectors are scaled and normalized in [0,1] (\(d \in \{1, \ldots , 6\}\)) with significant nonlinear dependencies among the features, evidenced by a FVU = 7.32 indicating that a single linear approximation of the entire data space is not an option. Figure 11 shows the R3 scatter plot matrix for all dimensions (before scaling and normalization) along with their dependencies and histograms. The first four Principal Components (PCs) explain the 99.08% of the total variance by 67.82%, 19.60%, 9.29%, and 2.37%, respectively, used for Q1 and Q2 analytics queries (prediction of the mean-value and model fitting).

R3 Dataset Scatter Plot Matrix: Each cell plots the contextual/environmental parameter level of dimension X against dimension Y, with: temperature in kitchen (T1), humidity in kitchen (H1), temperature in living room (T2), humidity in living room (H2), temperature in laundry room (T3), and humidity in laundry room area (H3); the diagonal plots the histogram of each environmental parameter

Synthetic dataset To further evaluate scalability and efficiency along with accuracy, we now use a big synthetic dataset deriving from a benchmark function to ensure also significant nonlinearity. The R2 synthetic dataset of input–output pairs \((u,\mathbf {x})\) contains \(10^{10}\)d-dim. real data generated by the Rosenbrock function [46] \(u = g(\mathbf {x})\) and \(d \in \{1, \ldots , 6\}\). This is the popular benchmark function for testing nonlinear, gradient-based optimization algorithms. It has a global minimum inside a long, narrow, parabolic shaped flat valley, where convergence to the global minimum, however, is extremely non-trivial [46]. We obtain the Rosenbrock \(u = g(\mathbf {x}) = \sum _{i=1}^{d-1}100(x_{i+1}-x_{i}^{2})^{2} + (1-x_{i})^{2}\), \(\mathbf {x} = [x_{1}, \ldots , x_{d}]\), attribute domain \(|x_{i}| \le 10\) and global minimum is 0 at \(x_{i}=1, \forall d\). Obviously, there is no linear dependency among features in the data space evidenced by a FVU = 12.45. In addition, we generate \(10^{10}\) vectors adding noise \(\epsilon \sim \mathcal {N}(0,1)\) to each dimension. For illustration purposes, Fig. 12 shows the R2 dataset of the Rosenbrock function \(u = g(x_{1}, x_{2})\) with two \((d=2)\) variables \(x_{1}\) and \(x_{2}\) and its corresponding PLR approximation through \(K=23\) LLMs.

System implementation The real datasets R1 and R3 and the synthetic dataset R2 are stored in a PostgreSQL server with 2x Intel Xeon E5645, RAM 96 GB, HD: Seagate Constellation 1TB, 32MB cache. We use R1, R2, and R3 to assess the query Q1 prediction accuracy of the aggregate answer y (A1 metric), corresponding to the average of the dimensions of the first three and four PCs in R1 and in R3, respectively, and to the average data output u of the Rosenbrock in R2. The PLR approximation of the data function g regarding Q2 queries over the R1, R2, and R3 datasets is conducted over data dimensions \(d \in \{2,3,5,6\}\) for the metrics: FVU, CoD and data output prediction accuracy metric A2. For scalability and efficiency, our method compared against the PostgreSQL with a B-tree index on input vector \(\mathbf {x}\) (\(d \in \{2,5,6\} \) over Q1 queries, \(d =2\) over Q2 queries) and MATLAB (\(d=5\) and \(d=6\)) over Q2 queries using the regress function on the server and the ARESLab tool for PLR data approximation.

(Upper) R2 Synthetic Dataset of the Rosenbrock function \(u = g(x_{1}, x_{2})\) with two \((d=2)\) variables \(x_{1}\) and \(x_{2}\); (lower) the PLR approximation of the Rosenbrock function (\(d=2\)) through \(K=23\) LLMs and the corresponding contour plot

8.2.2 Query workloads, training and testing sets

Query workload Firstly, we generate certain query workloads to train and test our model. The random queries \(\mathbf {q} = [\mathbf {x},\theta ]\) with centers \(\mathbf {x}\) and radii \(\theta \) over the data subspaces are generated with uniformly distributed centers \(\mathbf {x} \in [0,1]^{d}\) for the R1 and R3 datasets and in \([-10,10]^{d}\) for the R2 dataset (recall that the data vectors in R1 and R3 are scaled and normalized in [0,1]). That is, the query centers can uniformly at random appear over all the data space defined by the domains of the datasets dimensions \(d \in \{1,\ldots ,6\}\). The query radius \(\theta \) affects the training time and the prediction quality (both in predictability and goodness of fit). In brief, a larger (smaller) \(\theta \) implies shorter (longer) training times as will be elaborated later. For each query, the radius \(\theta \sim \mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) is generated from a Gaussian distribution with mean \(\mu _{\theta }\), variance \(\sigma ^{2}_{\theta }\). We set random radius \(\theta \sim \mathcal {N}(0.1, 0.01)\) for the R1 and R3 datasets and \(\theta \sim \mathcal {N}(1, 0.25)\) for the R2 dataset, covering \(\sim \)20% in each feature data range; the justification for this setting is discussed later. Section 8.7 provides an extensive experimental and theoretical analysis of the impact of \(\theta \) on the model performance. Based on this set up, we generated random queries \(\mathbf {q}\) that are issued over the data spaces of the R1, R2 and R3 datasets. We use these queries for training and testing our models as follows.

Training and testing sets We describe how we generate the training and testing query–answer sets from the above-mentioned query workload methodology. To train our model, we generate training files \(\mathcal {T}\) consisting of random queries \(\mathbf {q}\) as described above along with their actual aggregate answers y after executing them. To test the performance of our models, we generate different testing files \(\mathcal {V}\) dedicated only for predictions containing random queries of various sizes: \(|\mathcal {T}| \in \{10^{3}, \ldots , 10^{4}\}\) and \(M = |\mathcal {V}| \in \{10^{3}, \ldots , 2\cdot 10^{4}\}\), respectively. Specifically, the training sets \(\mathcal {T}\) and testing sets \(\mathcal {V}\) contain pairs of queries and answers, i.e., \((\mathbf {q},y)\), where the queries were executed over the R1, R2, and R3 datasets (see also Fig. 4). We adopted the cross-validation technique [51] to evaluate all the predictive models by partitioning the original query–answer sets into a training set to train the models, and a test set to evaluate them. We use 10-fold cross-validation, where the original query–answer set is randomly partitioned into 10 equal size subsets. Of the 10 subsets, a single subset is retained as the validation dataset for testing the models, and the remaining 9 subsamples are used as training data. The cross-validation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds are then be averaged to produce a single estimation of the above-mentioned performance metrics.

8.3 Model training and convergence

We train our model with the training set \(\mathcal {T}\) and then evaluate and compare it with the ground truths REG (ProstgreSQL and MATLAB) and PLR (MATLAB) with the testing set \(\mathcal {V}\) examining the statistical significance in accuracy with respect to paired-sample two-tailed Student t test with significance level 0.05. Note, the \(\mathcal {T}\) and \(\mathcal {V}\) sets contain explicitly different queries as discussed in Sect. 8.2.2. The granularity of quantization for our model is tuned by the percentage coefficient \(a \in [0.05, 1]\), involved in vigilance parameter \(\rho = a(d^{1/2}+1)\) (see Sect. 5 and Remark 7). Specifically, a value of quantization coefficient \(a=1\) corresponds to the generation of only one prototype, i.e., \(K=1\) (that is, coarse quantization), while any value of coefficient \(a <1\) (that is, fine grained quantization) corresponds to a variable number of prototypes \(K>1\) depending on the underlying (unknown) data distribution. The default value for the quantization percentage coefficient is \(a=0.25\) in our experiments. The model is adapting its parameters/prototypes in a stochastic manner every time a new query–answer pair \((\mathbf {q},y) \in \mathcal {T}\) is present. We set model convergence threshold \(\gamma = 0.01\) in Algorithm 1 to transit from the training phase to the prediction phase. Moreover, the hyperbolic learning rate schedule for the t-th training pair is: \(\eta _{t} = (1+t)^{-1}\) [14] for the stochastic training of the prototypes as query–answer pairs \((\mathbf {q},y)\) are retrieved (one at a time) from the training set \(\mathcal {T}\). Notably, to fairly compare against the optimal PLR data approximation, we set its maximum numbers of the automatically discovered linear models (in the forward building phase of the PLR algorithm) equal to K and the generalized cross-validation penalty per PLR knot to 3 as suggested in [44]. The proposed statistical methodology requires training and prediction phases. We also introduce the intermediate phase in Sect. 7.2, which is controlled by the consensual threshold \(r = 0.7\) of the partially converged query prototypes involved in queries. That is the model starts providing predictions in the intermediate phase when 70% of the query prototypes have converged. We examine firstly the global convergence of the training phase of our model and, then, study the variant of partial convergence w.r.t. the landmarks \(t_{\top }\) and \(t^{*}\) and the impact on the required number of training pairs and accuracy. Table 1 shows the experimental parameters and their range/default values.

Figure 13 examines the termination criterion of the training Algorithm 1 \(\varGamma = \max (\varGamma ^{\mathcal {J}}, \varGamma ^{\mathcal {H}})\) against the number of training pairs \((\mathbf {q},y)\) in training set \(\mathcal {T}\) for \(d \in \{2,5\}\) over R1 and R2 datasets with quantization coefficient \(a = 0.25\); similar results are obtained from R3 dataset. The training phase terminates at the first instance \(t^{*}\) when \(\varGamma \le \gamma \), which is obtained for \(|\mathcal {T}| \approx 5300\) training pairs. The total average training time, which includes both Q1 execution time and model updates time, is (0.41, 0.36, 2.38) h for R1, R3, and R2 datasets, respectively. This should not be perceived as overhead of our approach, as 99.62% of the training time is devoted to executing the queries over the DMS/statistical system, which we cannot avoid that anyway even in the typical case as shown in Fig. 4. Any traditional approach would thus also pay 99.62% of this cost. This only affects how early our approach switches to using the trained model versus executing the queries against the system. Our experiments show that excellent quality results can be produced using a reasonable number of past executed queries for training purposes. Obviously, this can be tuned by setting different model convergence threshold \(\gamma \) values. We set \(\gamma = 0.01\), where \(\varGamma \) is (stochastically) trapped around 0.0046, with deviation 0.0023 in R1 and 0.0012 in R3. In R2, the \(\varGamma \) is strictly less than \(\gamma \) for \(|\mathcal {T}| > 5300\).

Figure 14 shows the relation between the percentage of the number of the training pairs \(|\mathcal {T}|\)% used for a specific percentage of query prototypes to partially converge given the intermediate phase for \(d \in \{2,5\}\) over R1 dataset with quantization coefficient \(a = 0.25\). Specifically, we observe that with only 35% of the training pairs, i.e., with almost 1800 query–answer pairs (landmark \(t_{\top } \approx 1800\)), we obtain a model convergence of the 70–80% of the query-LLM prototypes. This indicates that the entire model has partially converged to a great portion w.r.t. number of query-LLM prototypes requiring a relatively small number of training pairs. In this case, the intermediate phase is deemed of high importance for delivering predictions to the analysts while the model is still being in a ‘quasi-training’ mode. The model converges with a high rate as more training pairs from the training set \(\mathcal {T}\) are observed after the convergence of the 70% of the query-LLM prototypes. This suggests to set the consensual threshold for the intermediate phase \(r = 0.7\). However, during this phase, the delivered predictions to the analysts have to be assessed w.r.t. prediction accuracy, as will be discussed in Sect. 8.4.

Figure 15(upper) shows the evolution of the joint objective functions \(\mathcal {J}\) and \(\mathcal {H}\) and Fig. 15(lower) shows the evolution of the individual norm difference of each query prototype from the current average, i.e., the individual convergence criterion, against the percentage of training pairs. We observe that after 37% of training pairs of the training set \(\mathcal {T}\), all the prototypes start to transit from their training phase to the prediction phase, while minimizing their deviations from the average convergence trend of the model. This indicates the flexibility of our model in being in the prediction phase thus proceeding with prediction for those data subspaces where the corresponding query prototypes have already converged and, in the same time, being in the training phase for those query prototypes which have not yet converged, thus, keeping in a learning mode. After the global convergence, i.e., when the consensual ratio reaches the unity, the model transits entirely in the prediction phase thus achieving significantly fast query execution without accessing the data. This signifies the scalability of our query-driven approach.

Relation between the percentage of the training pairs \(|\mathcal {T}|\)% used for a specific percentage of query prototypes K % to partially converge given the intermediate phase for \(d \in \{2,5\}\)

(Upper) Evolution of the joint objective functions \(\mathcal {J}\) and \(\mathcal {H}\) and (lower) evolution of the difference of the individual convergence criterion per prototype against the percentage of number of training pairs \(|\mathcal {T}|\); dataset R1; \(a = 0.25\)

8.4 Evaluation of Q1 query: predictability and scalability

Figures 16 and 18(upper) show the RMSE e of the predicted answer y (A1 metric) against the resolution of quantization (coefficient) a over R1, R2, and R3 datasets, respectively, for Q1 queries using the generated testing set \(\mathcal {V}\) (see Sect. 8.2.2). For different quantization coefficient a values, our model identifies subspaces where the function \(f(\mathbf {x},\theta )\) behaves almost linearly, thus, the query-LLM functions approximate such regions with high accuracy. Interestingly, by quantizing the query space by adopting a small coefficient a value, i.e., fine-grained resolution of the query space, then high accuracy is achieved (low RMSE e values) with obvious nonlinear dependencies among dimensions. This is due to the fact that with small coefficient a values, we focus on very specific query subspaces where linear approximations suffice to approximate the query function f. This expects to result in a high number of prototypes K in order to build many query-LLM functions to capture all the possible nonlinearities of the query function f as will be discussed below.

Figure 23(lower) shows the number of query prototypes formated during the query space quantization. Indicatively, we obtain \(K = 450\) prototypes for quantization coefficient \(a=0.25\). This indicates that only 450 prototypes are required to accurately predict the aggregate answer y for data dimension \(d=5\), that is, it is required 450 query-LLM functions to capture the curvature of the query function f over the query subspaces. As the quantization coefficient \(a \rightarrow 1\), then we quantize the query function f into fewer query-LLMs approximations, thus, yielding higher RMSE values as expected due to coarse approximation of the function f.

Figures 17 and 18(lower) show the robustness of the our model w.r.t. predictability with various testing file sizes \(|\mathcal {V}|\) for R1, R2, and R3 datasets, respectively. Once the LLM model has converged, it provides a low and constant prediction error in terms of RMSE for different data dimensions d, indicating the robustness of the training and convergence phase of the proposed model. This means that the model after transiting into the prediction phase can accurately predict the aggregate answer y via the identified and optimized query-LLM functions thus no query processing and data access is needed at that phase.

To assess the efficiency and scalability for the mean-value prediction query Q1, Fig. 24(upper) shows in log scale the average Q1 execution time over the dataset R2 for LLM (with quantization coefficient \(a=0.25\)) corresponding to \(K=92\) and \(K=450\) query prototypes for dimensions \(d \in {2,5}\), respectively. Our method requires just 0.18 ms per query over massive data spaces in its prediction phase, offering up to five orders of magnitude speedup (0.18 ms vs up to \(10^{5}\) ms/query). This is expected, since the LLM-based model predicts the Q1’s outputs and does not execute the query over the data during prediction achieving high prediction accuracy.

We now examine the impact of the model partial convergence on the predictability, i.e., when the model is in the intermediate phase between the training and the prediction phases. Figure 19 (upper) shows the partial RMSE \(\tilde{e}\) of the predicted aggregate answer y (A1 metric) during the intermediate phase of the model and the achieved RMSE e during the prediction phase against the percentage of training pairs for consensual threshold \(r = 0.7\) over dimension \(d \in \{2, 5\}\) for the dataset R1. Similar results are obtained from R2 and R3 datasets. Specifically, the partial RMSE \(\tilde{e}\) is obtained only from the converged query prototypes during the intermediate phase as described in Case A.I in Sect. 7.2 for \(r=0.7\). That is, from those query prototypes whose any additional training pair \((\mathbf {q},y)\) does not significantly move the query prototypes in the query space. We observe the predictability capability of our model w.r.t. number of training pairs such that with almost 35% for \(d=2\) (and 45% for \(d=5\)) of the observed training pairs, the model achieves a RMSE value close to the RMSE value obtained in the fully prediction phase, i.e., after observing 100% of the observed training pairs from \(\mathcal {T}\). This indicates the flexibility of our model to proceed with accurate predictions even being in the intermediate phase, where some of the query prototypes are still in a training mode until the model entirely converges.

More interestingly, Fig. 19(lower) shows the efficiency of our model in achieving high prediction accuracy even during the intermediate phase describe above. The model being in the intermediate phase can provide RMSE values close to that at the end of the training phase by having 70% of the prototypes converged after observing 37% of the training pairs from the training set \(\mathcal {T}\). This demonstrates the fast convergence of the model and its immediate application for delivering predictions to the analysts and real-time predictive analytics applications while not yet being fully converged.

Remark 10

The RMSE in Fig. 19(lower) is normalized in [0,1] for comparison reasons with the percentages of converged prototypes and training pairs.

8.5 Evaluation of Q2 query: PLR data approximation and scalability

We evaluate the Q2 queries by using our query-/data-LLMs model against the REG and PLR models and show the statistical significance of the derived accuracy metrics. The explanation over the linear/nonlinear behaviors of data function g is interpreted by the variance explanation and model fitting metrics fraction of the variance unexplained FVU and coefficient of determination CoD against the quantization resolution coefficient a and the model prototypes K. Figures 20 and 21(upper) show the sum of squared residuals SSR between the actual answers and the predicted answers for the data-LLMs and REG model with \(d \in \{2,5, 6\}\) over the datasets R1, R2, and R3 with \(p < 0.05\).

Figures 22(upper) and 21(lower) show that the fraction of the variance unexplained of the function approximation FVU < 1 for our model, while for the REG model we obtain FVU > 1, \(p < 0.05\). This indicates the capability of our model to capture the nonlinearities of the underlying data function g over all the data subspaces, compared with the REG model provided in the modern DMS. Specifically, as the query space quantization is getting coarse, that is a low resolution with \(a \rightarrow 1\), our model approaches the fraction of unexplained variances FVU of the model fitting as that of the REG model, i.e., resulting to a few number of data-LLM functions. This is because, we enforce our model to generate fewer data-LLM functions, thus, cannot effectively capture the nonlinearities of the data function g over all possible data subspaces. Indicatively, we obtain only one data-LLM when \(a = 1\), i.e., only a global linear model approximates the data function g. As expected, the optimal PLR model achieves the lowest FVU by capturing the nonlinearity of the data function g with multiple linear basis functions. For quantization coefficient \(a < 1\), we achieve a low FVU and our model captures effectively the nonlinearity of data function g by autonomously deriving multiple data-LLMs, which is very close to the actual PLR approximation for quantization coefficient \(a < 0.1\) with \(p<0.05\). As the Rosenbrock function g is nonlinear, our model attempts to analyze it into data-LLMs and provide a fine-grained explanation on the behavior of the data function g. This cannot be achieved by the in-DMS model REG, since the Rosenbrock function cannot be expressed by a ‘global’ line within the entire data space \(\mathbb {D}(\mathbf {x},\theta )\). The PLR model shows statistically superior FVU performance (\(p < 0.05\)), while being dramatically inefficient compared to our model, as shown in Fig. 24(lower). On the other hand, our model conditionally quantizes the data function g into data-LLMs, thus, providing the list \(\mathcal {S}\) of local lines that significantly better explain the data subspace, without accessing the data and then providing high accurate model fitting.

In Fig. 22(lower) and in Fig. 21(lower) the data function g in R1 and R3 datasets does not behave linearly in all the random data subspaces. This is evidenced by the FVU metric of the REG model, which is relatively close to/over 1 for \(d=2\), \(d=5\), and \(d=6\) with \(p<0.05\). This information is unknown a priori to analysts, hence the results using the REG model would be fraught with approximation errors indicating ‘bad’ model fitting. It is worth noting that, the average number of data-LLM functions that are returned to the analysts for all the issued testing queries in the testing set \(\mathcal {V}\) is \(|\mathcal {S}| = 4.62\) per query with variance 3.88. This denotes the nonlinearity behavior of data function g and the fine-grained and accurate explanation of the function g within a specific data subspace \(\mathbb {D}(\mathbf {x},\theta )\) per query \(\mathbf {q} = [\mathbf {x},\theta ]\). Here, the PLR model achieves the lowest FVU value, i.e., best model fitting as expected (\(p<0.05\)), but note that this is also achieved by our data-LLM functions with a quantization coefficient \(a < 0.1\).

Figure 23(upper) shows the coefficient of determination CoD \(R^{2}\) for the LLM, REG, and PLR models over the R1 dataset (similar results are obtained for datasets R2 and R3) having a significance level of 5%. A positive value of \(R^{2}\) close to 1 depicts that a linear approximation is a good fit for the unknown data function g. While, a value of \(R^{2}\) close to 0, and especially, a negative value of \(R^{2}\) indicates a significantly bad fit signaling inaccuracies in function approximation. In our case, with \(K>60\) query prototypes, our model achieves high and positive \(R^{2}\) indicating that our model better explains the random queried data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) compared with the obtained explanation of the current in-DMS REG model over exactly the same data subspaces and \(p<0.05\). The REG model achieves low \(R^{2}\) values, including negative ones, thus it is inappropriate for predictions and function approximation. This indicates that the underlying data function g highly exhibits nonlinearities, which are not known to the analysts a-priori. By adopting our model, the analysts progressively learn the underlying data function g and also via the derived data-LLM functions capture the hidden nonlinearities over the queried data subspaces. Such subspaces could never be known to the analysts unless exhaustive data access and exploration takes place. This capability is only provided by our model. Notably, as the quantization coefficient \(a \rightarrow 1\), our model increases significantly the coefficient of variation \(R^{2}\) value indicating a better capture of the specificities of the underlying data function g, thus, providing more accurate linear models. Again, the data-access exhaustive PLR model achieves the highest CoD values, however at the cost of high insufficiency; see Fig. 24(lower). Regardless, note that our model can catch the PLR’s CoD value by simply increasing K, i.e., the granularity of query space quantization.

Figures 24(lower) and 26(lower) show the Q2 execution time over the dataset R2 and R3, respectively, for data-LLM (\(a=0.25\), i.e., \(K=(92,450)\) for \(d=(2,5)\)) through Algorithm 3, the REG model from PostgreSQL (\(d=2\)) (REG-DBMS), the REG model from MATLAB (\(d=5\)) (REG-MATLAB), and the optimal PLR against dataset size. The derived results are statistically significant with \(p<0.05\). Our model is highly scalable (note the flat curves) in both datasets and highly efficient, achieving 0.56 ms/query and 0.78 ms/query (even for massive datasets)–up to six orders of magnitude better than the REG and PLR models for R2 and R3 datasets, respectively. The full picture is then that our model provides ultimate scalability by being independent of the size of the dataset) and many orders of magnitude higher efficiency, while it ensures great goodness of fit (CoD,FVU), similar to that of PLR.

The PLR model with data sampling techniques could also be considered as an effective efficiency-accuracy trade-off. Figure 24, however, shows the efficiency limitations of such an approach. The PLR model, even over a very small random sample of size \(10^6\) = 0.01% of the \(10^{10}\) dataset, is shown to be \(>3\) orders of magnitude less efficient than our model. Also, recall that PLR here is implemented over MATLAB, with all data in memory, hiding the performance costs of a full in-DBMS implementation for the selection operator (computing the data subspace); the sampling of the data space; and the PLR algorithm (whose performance is shown in Fig. 24). All of this is in stark contrast to the O(1) cost of our model. Finally, note that, to our knowledge, PLR is not currently implemented within DMSs, regardless of its cost.

8.6 Data output predictability

We compare our data-LLM functions against the in-RDBMS REG and PLR models for providing accurate data output predictions w.r.t. the A2 metric, i.e., the data prediction performance with statistical level of significance 5%. We use our data-LLM functions for providing data output u predictions over unseen data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) using (31) and (29) for prediction. Figures 25 and 26(upper) show the RMSE v for the LLM, REG, and PLR models over the R1, R2, and R3 datasets against the testing set size \(|\mathcal {V}|\).

The LLM model can successfully predict the data output u by being statistically robust in terms of number of testing pairs \(|\mathcal {V}|\) (\(p<0.05\)) and assume comparable or, even, lower prediction error than the REG model. This denotes that our model, by fusing different data-LLM functions which better capture the characteristics of the underlying data function g, provides better data output u prediction than a ‘global’ REG model over random queried data subspaces \(\mathbb {D}\). Evidently, the PLR model achieves the lowest RMSE value by actually accessing the data and captures the actual nonlinearity of the data function g through linear models. However, this is achieved with relatively high computational complexity, higher than the REG model including polynomially data-access process [44]. Note, the data output prediction times for the LLM, REG, and PLR models in this experiment are the same presented in Fig. 24: The LLM model executes our Algorithm 2 by replacing \(\theta =\theta _{k}\) in (30), \(\forall \mathbf {w}_{k} \in \mathcal {W}(\mathbf {q})\), the REG model creates the linear approximation over the data space \(\mathbb {D}\), and the PLR adaptively finds the best linear models for data fitting in each prediction request.

Overall, the proposed LLM model through the training, intermediate and prediction phases achieves statistically significant scalability and accuracy performance compared with the in-RDBMS REG model and the data-access intensive PLR model (\(p<0.05\)). The scalability of the proposed model in the predictive analytics era is achieved by predicting the query answers and delivering to analysts the statistical behavior of the underlying data function without accessing the raw data and without processing/executing the analytics queries, as opposed to the data-driven REG and PLR models in the literature,

8.7 Impact of radius \(\theta \)

In this section we examine the impact of the query radius \(\theta \) on the predictive and scalability performance of our query-driven approach. Consider that the query function f is approximated by some function \(\hat{f}\). Then, the expected prediction error (EPE) in (5) for a random query \(\mathbf {q}\) with actual aggregate answer y is decomposed as:

The first term is the squared bias, i.e., the difference between the true function f and the expected value of the estimate \(\mathbb {E}[\hat{f}]\), where the expectation averages the randomness in the dataset. This term will most likely increase with radius \(\theta \), which implies an increase in the number of input data points \(n_{\theta }(\mathbf {x})\) from the dataset \(\mathcal {B}\). The second term is a variance that decreases as the query radius \(\theta \) increases. The third term is the irreducible error, i.e., the noise in the true relationship that cannot fundamentally be reduced by any model with variance \(\sigma ^{2} = \mathrm{Var}(\epsilon )\). The \(\theta \) value controls the influence that each neighbor query point has on the aggregate answer prediction. As radius \(\theta \) varies there is a bias-variance trade-off. An increase in radius \(\theta \) results in smoother aggregate answer prediction but increasing the bias. On the other hand, the variance goes to zero since, for instance, with a high radius \(\theta \), the prediction \(\hat{f}(\mathbf {x},\theta ) \approx \mathbb {E}[y]\), i.e., unconditioned to the query \(\mathbf {q}\). Imagine queries whose radii include all data points \(\mathbf {x}_{i}\) in the entire data space. In that case, we obtain a constant aggregate answer for each issued query, i.e., the aggregate answer \(y = 1/n\sum _{i=1}^{n}u_{i}\), which is the average of all data outputs \(u_{i}\) thus, no need to predict the aggregate answer y. By selecting a radius \(\theta \) such that \(n_{\theta }(\mathbf {x}) \approx n\), then the aggregate answer prediction is a relatively smooth function of query \(\mathbf {q}\), but has little to do with the actual positions of the data vectors \(\mathbf {x}_{i}\)’s over the data space. Evidently, the variance contribution to the expected error is then small. On the other hand, the prediction to a particular query \(\mathbf {q}\) is systematically biased toward the population response, regardless of any evidence for local variation in the data subspace \(\mathbb {D}(\mathbf {x},\theta )\). The other extreme is to select a radius \(\theta \) such that \(n_{\theta }(\mathbf {x}) = 1\) for all queries. We can expect less bias and, in this case, it goes to zero if \(n \rightarrow \infty \) [32]. Finally, Given the true model and infinite data, we should be able to reduce both the bias and variance terms to 0. That is, as the number of input data points n and \(n_{\theta }(\mathbf {x}) \rightarrow \infty \) such that \(\frac{n_{\theta }(\mathbf {x})}{n} \rightarrow 0\), then \(\hat{f}(\mathbf {x},\theta ) \rightarrow \mathbb {E}[y|\mathbf {x},\theta ]\). However, in real world with approximate models and finite data, the radius \(\theta \) plays the trade-off between minimizing the bias and minimizing the variance.

We experiment with different mean values \(\mu _{\theta }\) of the radius \(\theta \sim \mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) having a fixed variance \(\sigma ^{2}_{\theta }\) to examine the impact on the model training, quality of aggregate answer prediction, and PLR approximation of the underlying data function g. We examine the number of training pairs, \(|\mathcal {T}|\), where our method requires to reach the convergence threshold \(\gamma = 0.01\). We also examine the impact of radius \(\theta \) on the RMSE and CoD metrics. Hence, three factors (\(|\mathcal {T}|\), RMSE, and CoD) are influenced by the radius \(\theta \). We experiment with mean radius \(\mu _{\theta } \in \{0.01, \ldots , 0.99\}\) over the R1 dataset (similar results are obtained in R2 and R3 datasets). Consider the queries with high radius \(\theta \) drawn from Gaussian \(\mathcal {N}(\mu _{\theta },\sigma ^{2}_{\theta })\) with high mean radius \(\mu _{\theta }\). Then, radius \(\theta \) nearly covers the entire input data range and aggregate answer y is close to the average value of output u for all queries, i.e., \(n_{\theta }\) contains all \(\mathbf {x}\) input data points. In this case, all query prototypes \(\mathbf {w}_{k}\) correspond to constant query-LLM functions \(f_{k}(\mathbf {x},\theta ) \approx y_{k} = y\), where aggregate answer \(y = \mathbb {E}[u]\) unconditioned to \(\mathbf {x}\) and radius \(\theta \). Hence, the training and convergence of all LLMs is trivial since there is no any specificity to be extracted from each query-LLM function \(f_{k}\). Our method converges with a low number of training pairs \(|\mathcal {T}|\) as shown in Fig. 27(lower). On the other hand, a small \(\theta \) value refers to learning ‘meticulously’ all the specificities for all LLMs. In this case, our method requires a relatively high number of training query–answer pairs \(|\mathcal {T}|\) to converge; see Fig. 27(lower).

In terms of accuracy, the higher the radius \(\theta \) is, the lower the RMSE e becomes. With high radius \(\theta \), all query-LLM functions refer to constant functions with the extreme case where \(f_{k} \approx \mathbb {E}[u], \forall k\) as discussed above, thus, the RMSE \(e = \sqrt{\frac{1}{M}\sum (y_{i} - \hat{y}_{i})^{2}} \rightarrow 0\) with \(y_{i} \approx \mathbb {E}[u]\) due to the fact that \(n_{\theta }\) contains all input data and, thus, \(\hat{y}_{i} \approx \mathbb {E}[u]\) (see Fig. 27(upper) with \(|\mathcal {T}|= 5359\) training pairs required for convergence w.r.t. \(\gamma = 0.01\)). However, this comes at the expense of a low CoD \(R^{2}\) since the data function g is approximated ‘solely’ by a constant approximate function \(g(\mathbf {x}) \approx \mathbb {E}[u]\) (see Fig. 27(lower)). When radius \(\theta \) is small, we attempt to estimate the query function f over \((\mathbf {x},\theta )\) and, thus, approximate the data function \(g(\mathbf {x})\). This, however, requires many training query–answer pairs \(|\mathcal {T}|\); see Fig. 27(lower). Overall, there is a trade-off in the number of training pairs \(|\mathcal {T}|\) with approximation and accuracy capability. To obtain quality approximation, the CoD metric should be strictly greater than zero. This is achieved by setting the mean radius value \(\mu _{\theta } < (0.4, 0.5)\) for \(d \in (5,2)\). Then, we can compensate the RMSE and training time (number of training pairs \(|\mathcal {T}|\)) as shown in Figs. 27 and 28. In addition, there is a trade-off between training effort and predictability. As shown in Figs. 27, 28 and as explained above, a low \(\mu _{\theta }\) value results to a high RMSE and training effort in terms of \(|\mathcal {T}|\) size. By combining those trade-offs, for a reasonable training effort, to achieve low RMSE and high goodness of fit, i.e., a high positive CoD value, we set mean radius value \(\mu _{\theta } = 0.1\), which corresponds to \(\sim \) 20% of the data range for \(\sigma ^{2}_{\theta } = 0.01\). Finally, Fig. 29 shows the impact (trajectory) of the mean radius \(\mu _{\theta }\) on the training set size \(|\mathcal {T}|\), the prediction accuracy w.r.t RMSE e, and the goodness of fit w.r.t CoD \(R^{2}\) metric for \(d \in (2,5)\) for R1; we obtain similar results for R2 and R3 datasets.

9 Conclusions and future plans

We focused on the inferential task of piecewise linear regression and predictive modeling which are central to in-DMS predictive analytics. We introduced an investigation route, whereby answers from previously executed aggregate and regression queries are exploited to train novel statistical learning models which discover and approximate the unknown underlying data function with piecewise linear regression planes, predict future mean-value query answers, and predict the data output. We contribute with a statistical learning methodology, which yields highly accurate answers and data function approximation based only on the query–answer pairs and avoiding data access after the model training phase. The performance evaluation and comparative assessment revealed very promising results.

Our methodology is shown to be highly accurate, extremely efficient in computing query results (with sub-millisecond latencies even for massive datasets, yielding up to six orders of magnitude improvement compared to computing exact answers, produced by piecewise linear regression and global linear approximation models), and scalable, as predictions during query processing do not require access to the DMS engine, thus being insensitive to dataset sizes.

Our plans for future work focus on developing a framework that can dynamically and optimally switch between the training/intermediate phases and query prediction phases as analysts interests shift between data subspaces. Moreover, the developing framework is expected to cope with nonlinear approximations by evolving and expanding the fundamental representatives of both: data and query subspaces for supporting robust query subspace adaptation and for dealing with data spaces with online data updates.

Notes

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work is funded by the EU H2020 GNFUV Project RAWFIE–OC2–EXP–SCI (Grant#645220), under the EC FIRE+ initiative.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Copyright information

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

This article is published under an open access license.
Please check the 'Copyright Information' section for details of this license and
what re-use is permitted.
If your intended use exceeds what is permitted by the license or if
you are unable to locate the licence and re-use information,
please contact the Rights and Permissions team.