3
 The objective of this presentation is to introduce basics tools to handle missing data in CountrySTAT and FAOSTAT domains. They are based on simple and friendly approach, easy to use.  The CountrySTAT agricultural production domain was used as a basis to develop and test imputation and validation methodologies that could assist in standardisation across the different statistical domains presents at FAO level.

4
 Data are missing for different reasons 1) The value has not been measured (forget...); 2) The value is measured but lost; 3) The value is measured, but considered unusable (outliers, etc.); 4) The value is measured but unavailable.

6
 In a dataset, data can be 1) Missing completely at random (MCAR): when the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. P(r |Y observed ;Y missing ) = P(r ) 2) Missing at random (MAR): when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data. P(r |Y observed ;Y missing ) = P(r |Y observed ) 3) Not missing at random (NMAR): when data are not MCAR or MAR P(r |Y observed ;Y missing ) = P(r |Y observed ;Y missing ) 4) Censored and Truncated Data. Data use to be MCAR or MAR

11
A linear trend is assumed to exist between the start- and endpoints of gaps in the time series. Let y 0, y 1,..., y t-l denote the data points with values obtained from official sources before the gap and y t+r, y t+r+1,..., y m denote the data points with official values after the gap. The imputed values are calculated as:

17
 Used methods are based on regression imputation and used EM-algorithm : 1)Yield estimation: estimate yield using an arima model; 2)Linear regression: Use a linear regression between P t and A t including Trend; 3)Arima model: Estimate P t and A t using ARIMA model; 4) Spline regression: Estimate P t and A t using spline;

22
 The ARIMA models must be identiﬁed  ARIMA(0,1,1): Y t =Y t-1 + α+ ε t - θ 1 * ε t-1 ;  Use relation between Production and Area  Use these variable as time series and Impute using EM- algorithm.  Package mtsdi of R.  Impute using ARIMA model for Pt and At imputation

23
 Form of interpolation where the interpolant is a special type of piecewise polynomial called a spline.  For each interval, we try estimate a polynomial function which fit well data.  Spline interpolation is preferred over polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline.  Package mtsdi of R.  Impute using Spline regression for Pt and At imputation

34
 For the 3 tests cases, relatives errors are less for method of Spline in the most of case, when the percentage of missing data is more than 10%.  The method ARIMA is more adapted when we have less than 10% of missing data in the dataset.  The above tests use only two variables for the same crop (area and production). If the number of missing data exceeds 40%, it will be appropriated to use a third correlated control variable.