Skip to contents

Introduction

We start with a simple case, where we use data_stationary included in our package, which is generated with the following data generation process: \[ Y_t = 40+0.5Y_{t-1}-1.5X_t-0.5X_{t-1}-C_t+v_t, v_t \sim \text{i.i.d }\mathcal{N}(0,1) \]

Generate Missing Index

Then using the function generate.missing_index() provided in our package, we create a MNAR missing index, and create the columns for corresponding lagged outcomes.

data = data_stationary#using data_stationary
index_MNAR = generate.missing_index(type = "MNAR", 
                                    n=length(data$y),
                                    param = list(data = data.frame(y=data$y,x=data$x),
                                                 MNAR.type = "increasing",
                                                 coeff = c(0.1,-0.3,0),
                                                 MNAR.drawplot = c(TRUE, "y"))
                                    )$missing_index
#> The missing indicators for MNAR 'increasing' are generated!

#> The missing rate for MNAR is 0.46

data$y[index_MNAR] = NA
data$y_1 = c(NA,data$y[1:999])
data$x_1 = c(NA,data$x[1:999])
data$c_1 = c(NA,data$c[1:999])
data = data[c(2:1000),]
data_space_SSMimpute = data
kable(head(data_space_SSMimpute))
Date y x c y_1 x_1 c_1
2 2020-02-07 NA 10.73859 8.716357 23.43100 9.859805 7.950641
3 2020-02-08 NA 10.24018 7.791593 NA 10.738591 8.716357
4 2020-02-09 23.46471 11.26975 5.322497 NA 10.240184 7.791593
5 2020-02-10 20.95864 10.66030 7.306048 23.46471 11.269747 5.322497
6 2020-02-11 23.51839 10.56685 7.330904 20.95864 10.660300 7.306048
7 2020-02-12 22.40664 11.38117 6.250145 23.51839 10.566847 7.330904
imputeTS::ggplot_na_distribution(data_space_SSMimpute$y, color_missing = "pink",color_missing_border = "pink", alpha_missing = 0.1)

# Start Imputation

Based on our prior knowledge(usually we need an additional step for data exploration, see: exploration of state-space model for a detailed introduction), we set up the formula and all the learning parameter in ss_param and cpt_learning_param


formula="y~y_1+x+x_1+c"
formula_var=unlist(strsplit(unlist(strsplit(formula,"~"))[2],"+",fixed=T))

ss_param=list(inits=c(log(0.25),log(1)),m0=c(40,0.1,-0.1,-0.1,-1),C0=diag(rep(10^3),5), AR1_coeffi=NULL,rw_coeffi="intercept", v_cp_param=NULL, w_cp_param=NULL,max_iteration=100)

result_SSMimpute1=SSMimpute_unanimous_cpts(data_ss_ori=data_space_SSMimpute,formula_var,ss_param_temp=ss_param,
                                                         initial_imputation_option="StructTS",
                                                         estimate_convergence_cri=0.01,
                                                         lik_convergence_cri=0.01,
                                                         stepsize_for_newpart=1/3,
                                                         max_iteration=100,
                                                         cpt_learning_param=list(cpt_method="mean",burnin=1/10,mergeband=20,convergence_cri=10),
                                                         cpt_initial_guess_option="ignore",
                                                         dlm_option="smooth",m=5,seed=1,printFlag=F)

Analysis of Result

Now we take a close look at the result generated from SSMimpute_unanimous_cpts()

Estimated coefficient:

#kable(result_statespace_SSMimpute1$result_convergence)
#kable(result_statespace_SSMimpute1$result_convergence_mp)
kable(result_SSMimpute1$result_convergence_mp_addV)
Estimate Std.Error
(Intercept) 41.8260426 1.5690795
y_1 0.4607595 0.0256214
x -1.4920563 0.0625198
x_1 -0.5985293 0.0901322
c -0.9763831 0.0530130

We can see SSMimpute_unanimous_cpts() successfully uncover the data generating process.

We also plot the imputed missing value versus the ground truth:

#result_statespace_SSMimpute1$estimated_cpts
data_na = result_SSMimpute1$data_temp
length(data_na$y_1)
#> [1] 999

data_temp = result_SSMimpute1$data_temp
missing_part=which(is.na(data_temp$y))[which(is.na(data_temp$y))<nrow(data_temp)]
data_temp$y_1[missing_part+1]=result_SSMimpute1$y_final
#imputeTS::ggplot_na_distribution(data_stationary$y[1:999], color_missing = "pink",color_missing_border = "pink", alpha_missing = 0.9)
imputeTS::ggplot_na_imputations(x_with_na = data_space_SSMimpute$y_1, x_with_imputations = data_na$y_1,x_with_truth = data_stationary$y[1:999])