Title: | LP Smoothed Inference and Graphics |
---|---|
Description: | Classical tests of goodness-of-fit aim to validate the conformity of a postulated model to the data under study. In their standard formulation, however, they do not allow exploring how the hypothesized model deviates from the truth nor do they provide any insight into how the rejected model could be improved to better fit the data. To overcome these shortcomings, we establish a comprehensive framework for goodness-of-fit which naturally integrates modeling, estimation, inference and graphics. In this package, the deviance tests and comparison density plots are performed to conduct the LP smoothed inference, where the letter L denotes nonparametric methods based on quantiles and P stands for polynomials. Simulations methods are used to perform variance estimation, inference and post-selection adjustments. Algeri S. and Zhang X. (2020) <arXiv:2005.13011>. |
Authors: | Xiangyu Zhang <[email protected]>, Sara Algeri <[email protected]> |
Maintainer: | Xiangyu Zhang <[email protected]> |
License: | GPL-3 |
Version: | 0.1.3 |
Built: | 2024-10-27 05:05:14 UTC |
Source: | https://github.com/cran/LPsmooth |
Constructs the CD-plot and computes the deviance test for exhaustive goodness-of-fit.
CDplot(data,m=4,g,par0=NULL,range=NULL,lattice=NULL,selection=TRUE,criterion="BIC", B=1000,samplerG=NULL,h=NULL,samplerH=NULL,R=500,ylim=c(0,2),CD.plot=TRUE)
CDplot(data,m=4,g,par0=NULL,range=NULL,lattice=NULL,selection=TRUE,criterion="BIC", B=1000,samplerG=NULL,h=NULL,samplerH=NULL,R=500,ylim=c(0,2),CD.plot=TRUE)
data |
A data vector. See details. |
m |
If |
g |
Function corresponding to the parametric start. See details. |
par0 |
A vector of starting values for the parameters of |
range |
Interval corresponding to the support of the continuous data distribution. |
lattice |
Support of the discrete data distribution. |
selection |
A logical argument indicating if model selection should be performed. See details. |
criterion |
If |
B |
A positive integer corresponding to the number of bootstrap replicates. |
samplerG |
A function corresponding to the random sampler for the parametric start |
h |
Instrumental probability function. If |
samplerH |
A function corresponding to the random sampler for the instrumental probability function |
R |
A positive integer corresponding to the size of the grid of equidistant points at which the comparison densities are evaluated. The default is |
ylim |
If |
CD.plot |
A logical argument indicating if the comparison density plot should be displayed or not. The default is |
The argument data
collects the data for which we want to test if its distribution corresponds to the one of the postulated model specified in the argument g
.
If the parametric start is fully known, it must be specified in a way that it takes x
as the only argument. If the parametric start is not fully known, it must be specified in a way that it takes arguments x
and par
, with par
corresponding to the vector of unknown parameters. The latter are estimated numerically via maximum likelihood estimation and par0
specifies the initial values of the parameters to be used in the optimization.
The value m
determines the smoothness of the estimated comparison density, with smaller values of m
leading to smoother estimates.
If selection=TRUE
, the largest coefficient estimates are selected according to either the AIC or BIC criterion as described in Algeri and Zhang, 2020 (see also Ledwina, 1994 and Mukhopadhyay, 2017). The resulting estimator is the one in Gajek's formulation with orthonormal basis corresponding to LP score functions (see Algeri and Zhang, 2020 and Gajek, 1986).
Deviance |
Value of the deviance test statistic. |
p_value |
P-value of the deviance test. |
Sara Algeri and Xiangyu Zhang
Algeri S. and Zhang X. (2020). Exhaustive goodness-of-fit via smoothed inference and graphics. arXiv:2005.13011.
Gajek, L. (1986). On improving density estimators which are not bona fide functions. The Annals of sStatistics, 14(4):1612–1618.
Ledwina, T. (1994). Data-driven version of neymany's smooth test of fit. Journal of the American Statistical Association, 89(427):1000–1005.
Mukhopadhyay, S. (2017). Large-scale mode identification and data-driven sciences. Electronic Journal of Statistics 11 (2017), no. 1, 215–240.
d_hat
, find_h_disc
, find_h_cont
.
data<-rbinom(50,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) samplerG<-function(n){xx<-rpois(n*3,10) xxx<-sample(xx[xx<=20],n) return(xxx)} CDplot(data,m=4,g,par0=NULL,range=NULL,lattice=seq(0,20), selection=FALSE,criterion="BIC",B=10,samplerG,R=300,ylim=c(0,2))
data<-rbinom(50,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) samplerG<-function(n){xx<-rpois(n*3,10) xxx<-sample(xx[xx<=20],n) return(xxx)} CDplot(data,m=4,g,par0=NULL,range=NULL,lattice=seq(0,20), selection=FALSE,criterion="BIC",B=10,samplerG,R=300,ylim=c(0,2))
Estimates the comparison density for continuous and discrete data.
d_hat(data,m=4,g,range=NULL,lattice=NULL,selection=TRUE,criterion="BIC")
d_hat(data,m=4,g,range=NULL,lattice=NULL,selection=TRUE,criterion="BIC")
data |
A data vector. See details. |
m |
If |
g |
Function corresponding to the parametric start. See details. |
range |
Interval corresponding to the support of the continuous data distribution. |
lattice |
Support of the discrete data distribution. |
selection |
A logical argument indicating if model selection should be performed. See details. |
criterion |
If |
The argument data
collects the data for which we want to test if its distribution corresponds to the one of the postulated model specified in the argument g
. The parametric start is assumed to be fully specified and takes x
as the only argument.
The value m
determines the smoothness of the estimated comparison density, with smaller values of m
leading to smoother estimates.
If selection=TRUE
, the largest coefficient estimates are selected according to either the AIC or BIC criterion as described in Algeri and Zhang, 2020 (see also Ledwina, 1994 and Mukhopadhyay, 2017). The resulting estimator is the one in Gajek's formulation with orthonormal basis corresponding to LP score functions (see Algeri and Zhang, 2020 and Gajek, 1986).
LPj |
Estimates of the coefficients. |
du |
Function corresponding to the estimated comparison density in the u domain corresponding to the probability integral transformation. |
dx |
Function corresponding to the estimated comparison density in the x domain. |
f |
Function corresponding to the estimated probability function of the data. |
Sara Algeri and Xiangyu Zhang
Algeri S. and Zhang X. (2020). Exhaustive goodness-of-fit via smoothed inference and graphics. arXiv:2005.13011.
Gajek, L. (1986). On improving density estimators which are not bona fide functions. The Annals of sStatistics, 14(4):1612–1618.
Ledwina, T. (1994). Data-driven version of neymany's smooth test of fit. Journal of the American Statistical Association, 89(427):1000–1005.
Mukhopadhyay, S. (2017). Large-scale mode identification and data-driven sciences. Electronic Journal of Statistics 11 (2017), no. 1, 215–240.
library("LPBkg") #Example discrete data<-rbinom(1000,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) ddhat<-d_hat(data,m=4,g, range=NULL,lattice=seq(0,20), selection=TRUE,criterion="BIC") xx<-seq(0,20) ddhat$dx(xx) ddhat$LPj #Example continuous data<-rnorm(1000,0,1) g<-function(x)dt(x,10) ddhat<-d_hat(data,m=4,g, range=c(-100,100), selection=TRUE,criterion="AIC") uu<-seq(0,1,length=10) ddhat$du(uu) ddhat$LPj
library("LPBkg") #Example discrete data<-rbinom(1000,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) ddhat<-d_hat(data,m=4,g, range=NULL,lattice=seq(0,20), selection=TRUE,criterion="BIC") xx<-seq(0,20) ddhat$dx(xx) ddhat$LPj #Example continuous data<-rnorm(1000,0,1) g<-function(x)dt(x,10) ddhat<-d_hat(data,m=4,g, range=c(-100,100), selection=TRUE,criterion="AIC") uu<-seq(0,1,length=10) ddhat$du(uu) ddhat$LPj
Computes the probability mass function of a mixture of the negative binomials.
dmixnegbinom(x, pis, size, probs)
dmixnegbinom(x, pis, size, probs)
x |
A scalar or vector of non-negative integer values. |
pis |
A vector collecting the mixture weights. See details. |
size |
A postive value corresponding to the target for number of successful trials. |
probs |
A vector collecting the probabilities of success for each mixture component. |
The argument pis
is a vector with length equal the number of components in the mixture. The vector pis
must sum up to one, e.g. c(0.7, 0.2, 0.1)
.
All the negative binomials contributing to the mixture are assumed to have the same size
.
Value of the probability mass function of the mixture of negative binomials evaluated at x
.
Xiangyu Zhang and Sara Algeri
rmixtruncnorm
, dmixtruncnorm
, rmixnegbinom
, find_h_disc
.
xx<-seq(0,30,length=31) dmixnegbinom(xx,pis=c(0.4,0.6),size=25,probs=c(0.6,0.7))
xx<-seq(0,30,length=31) dmixnegbinom(xx,pis=c(0.4,0.6),size=25,probs=c(0.6,0.7))
Computes the probability density function of a mixture of truncated normals.
dmixtruncnorm(x, pis, means, sds, range)
dmixtruncnorm(x, pis, means, sds, range)
x |
A scalar or vector of real values. |
pis |
A vector collecting the mixture weights. See details. |
means |
A vector collecting the means of the mixture components. |
sds |
A vector collecting the standard deviations of the mixture components. |
range |
Interval corresponding to the support of each of truncated normal contributing to the mixture. See details. |
The argument pis
is a vector with its length equal the number of components in the mixture. The vector pis
must sum up to one, e.g. c(0.7, 0.2, 0.1)
.
The argument range
is an interval corresponding to the support of each truncated normal contributing to the mixture. In other words, all the truncated normals contributing to the mixture are assumed to have the same range.
Value of the probability density function of the mixture of normals evaluated at x
.
Sara Algeri and Xiangyu Zhang
rmixtruncnorm
, dmixnegbinom
, rmixnegbinom
, find_h_cont
xx<-seq(0,30,length=10) dmixtruncnorm(xx,pis=c(0.3,0.6,0.1),means=c(3,6,25),sds=c(3,4,10),range=c(0,30))
xx<-seq(0,30,length=10) dmixtruncnorm(xx,pis=c(0.3,0.6,0.1),means=c(3,6,25),sds=c(3,4,10),range=c(0,30))
Finds the optimal instrumental density h
to be used in the bidirectional acceptance sampling.
find_h_cont(data,g,dhat,range=NULL,M_0=NULL,par0=NULL,lbs,ubs,check.plot=TRUE, ylim.f=c(0,2),ylim.d=c(0,2),global=FALSE)
find_h_cont(data,g,dhat,range=NULL,M_0=NULL,par0=NULL,lbs,ubs,check.plot=TRUE, ylim.f=c(0,2),ylim.d=c(0,2),global=FALSE)
data |
A data vector. |
g |
Function corresponding to the parametric start or postulated model. See details. |
dhat |
Function corresponding to the estimated comparison density in the |
range |
Interval corresponding to the support of the continuous data distribution. |
M_0 |
Starting point for optimization. See details. |
par0 |
A vector of starting values of the parameters to be estimated. See details. |
lbs |
A vector of the lower bounds of the parameters to be estimated. |
ubs |
A vector of the upper bounds of the parameters to be estimated. |
check.plot |
A logical argument indicating if the plot comparing the densities involved should be displayed or not. The default is TRUE. |
ylim.f |
If check.plot=TRUE, the range of the y-axis of the plot for the probability density functions. |
ylim.d |
If check.plot=TRUE, the range of the y-axis of the plot for the comparison densities. |
global |
A logical argument indicating if a global optimization is needed to find the instrumental probability function |
The parametric start specified in g
is assumed to be fully specified and takes x
as the only argument. The argument dhat
is the estimated comparison density in the x
domain. We usually get the argument dhat
by means of the function d_hat
within our package.
The value M_0
and the vector par0
are used for the optimization process for finding the optimal instrumental density h. Usually, we choose M_0
to be the central point of the range. For example, if the range is from 0
to 30
, we choose 15
as starting point. The choice of M_0
is not expected to affect substantially the accuracy of the solution. The vector par0
collects initial values for the parameters which characterize the instrumental density. For instance, if h
is a mixuture of p
truncated normals, the first p-1
elements of pis
correspond to the starting values
for the first p-1
mixture weights. The following p
elements are the initial values for the means of the p
truncated normals contributing to the mixture. Finally, the last p
elements of par0
correspond to the starting values for the standard deviations of the p
truncated normals contributing to the mixture.
The argument global
controls whether to use a global optimization or not. A local method allows to reduce the optimization time but the solution is particularly sensible to the choice of par0
. Conversely, setting global=TRUE
leads to more accurate result.
Mstar |
The reciprocal of the acceptance rate. |
pis |
The optimal set of mixture weights. |
means |
The optimal mean vector. |
sds |
The optimal set of standard deviations. |
h |
Function corresponding to the optimal instrumental density. |
Sara Algeri and Xiangyu Zhang
Algeri S. and Zhang X. (2020). Exhaustive goodness-of-fit via smoothed inference and graphics. arXiv:2005.13011.
d_hat
, find_h_disc
, rmixtruncnorm
, dmixtruncnorm
library("truncnorm") library("LPBkg") L=0 U=30 range=c(L,U) set.seed(12395) meant=-15 sdt=15 n=300 data<-rtruncnorm(n,a=L,b=U,mean=meant,sd=sdt) poly2_num<-function(x){4.576-0.317*x+0.00567*x^2} poly2_den<-integrate(poly2_num,lower=L,upper=U)$value g<-function(x){poly2_num(x)/poly2_den} ddhat<-d_hat(data,m=2,g, range=c(L,U), selection=FALSE)$dx lb=c(0,-20,0,0,0) ub=c(1,10,rep(30,3)) par0=c(0.3,-17,1,10,15) range=c(L,U) find_h_cont(data,g,ddhat,range,M_0=10,par0,lb,ub,ylim.f=c(0,0.25),ylim.d=c(-1,2))
library("truncnorm") library("LPBkg") L=0 U=30 range=c(L,U) set.seed(12395) meant=-15 sdt=15 n=300 data<-rtruncnorm(n,a=L,b=U,mean=meant,sd=sdt) poly2_num<-function(x){4.576-0.317*x+0.00567*x^2} poly2_den<-integrate(poly2_num,lower=L,upper=U)$value g<-function(x){poly2_num(x)/poly2_den} ddhat<-d_hat(data,m=2,g, range=c(L,U), selection=FALSE)$dx lb=c(0,-20,0,0,0) ub=c(1,10,rep(30,3)) par0=c(0.3,-17,1,10,15) range=c(L,U) find_h_cont(data,g,ddhat,range,M_0=10,par0,lb,ub,ylim.f=c(0,0.25),ylim.d=c(-1,2))
Finds the optimal probability mass function h
to be used in the bidirectional acceptance sampling.
find_h_disc(data,g,dhat,lattice=NULL,M_0=NULL,size,par0=NULL,check.plot=TRUE, ylim.f=c(0,2),ylim.d=c(0,2),global=FALSE)
find_h_disc(data,g,dhat,lattice=NULL,M_0=NULL,size,par0=NULL,check.plot=TRUE, ylim.f=c(0,2),ylim.d=c(0,2),global=FALSE)
data |
A data vector. |
g |
Function corresponding to the parametric start or postulated model. See details. |
dhat |
Function corresponding to the estimated comparison density in the |
lattice |
Support of the discrete data distribution. |
size |
A postive value corresponding to the target for number of successful trials. |
M_0 |
Starting point for optimization. See details. |
par0 |
A vector of starting values of the parameters to be estimated. See details. |
check.plot |
A logical argument indicating if the plot comparing the densities involved should be displayed or not. The default is TRUE. |
ylim.f |
If check.plot=TRUE, the range of the y-axis of the plot for the probability density functions. |
ylim.d |
If check.plot=TRUE, the range of the y-axis of the plot for the comparison densities. |
global |
A logical argument indicating if a global optimization is needed to find the instrumental probability function |
The parametric start specified in g
is assumed to be fully specified and takes x
as only argument. The argument dhat
is the estimated comparison density in the x
domain. We usually get the argument dhat
by means of the function d_hat
within our package.
The value M_0
and the vector par0
are used for the optimization process for finding the optimal instrumental density h. Usually, we could choose the M_0
to be the central point of the lattice
. For example, if the range is from 0
to 30
, we could choose 15
as the starting point. The choice of M_0
is not expected to affect substantially the accuracy of the solution. The vector par0
collects initial values for the parameters which characterize the instrumental probability mass function. For instance, if h
is a mixuture of p
negative binomials, the first p-1
elements of pis
correspond to the starting values
for the first p-1
mixture weights. The following p
elements are the initial values for the probablities of success of the p
negative binomials contributing to the mixture. The argument global
controls whether to use a global optimization or not. A local method allows to reduce the optimization time but the solution is particularly sensible to the choice of par0
. Conversely, setting global=TRUE
leads to more accurate result.
Mstar |
The reciprocal of the acceptance rate. |
pis |
The optimal set of mixture weights. |
probs |
The optimal set of probabilities of success. |
h |
Function corresponding to the optimal instrumental probability mass function. |
Xiangyu Zhang and Sara Algeri
Algeri S. and Zhang X. (2020). Smoothed inference and graphics via LP modeling, arXiv:2005.13011.
d_hat
, find_h_cont
, rmixtruncnorm
, dmixtruncnorm
lattice=seq(0,20,length=21) n=200 data<-rbinom(n,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) ddhat<-d_hat(data,m=1,g=g,lattice=lattice,selection=TRUE)$dx find_h_disc(data=data,g=g,dhat=ddhat,lattice,M_0=10,size=15,par0=c(0.3,0.5,0.6), check.plot=TRUE,ylim.f=c(0,0.4),ylim.d=c(-3,2.5),global=FALSE)
lattice=seq(0,20,length=21) n=200 data<-rbinom(n,size=20,prob=0.5) g<-function(x)dpois(x,10)/(ppois(20,10)-ppois(0,10)) ddhat<-d_hat(data,m=1,g=g,lattice=lattice,selection=TRUE)$dx find_h_disc(data=data,g=g,dhat=ddhat,lattice,M_0=10,size=15,par0=c(0.3,0.5,0.6), check.plot=TRUE,ylim.f=c(0,0.4),ylim.d=c(-3,2.5),global=FALSE)
Generates random samples from a mixture of negative binomials.
rmixnegbinom(n, pis, size, probs)
rmixnegbinom(n, pis, size, probs)
n |
Size of the random sample. |
pis |
A vector collecting the mixture weights. See details. |
size |
A postive value corresponding to the target for number of successful trials. See details. |
probs |
A vector collecting the probabilities of success for every mixture component. |
The argument pis
is a vector with length equal the number of components in the mixture. The vector pis
must sum up to one, e.g. c(0.7, 0.2, 0.1)
.
All the negative binomials contributing to the mixture are assumed to have the same size
.
A vector collecting the random sample of size n
from the mixture of negative binomials specified.
Xiangyu Zhang and Sara Algeri
rmixtruncnorm
, dmixtruncnorm
, dmixnegbinom
, find_h_disc
.
rmixnegbinom(n=100,pis=c(0.3,0.6,0.1),size=2,probs=c(0.3,0.4,0.2))
rmixnegbinom(n=100,pis=c(0.3,0.6,0.1),size=2,probs=c(0.3,0.4,0.2))
Generates random samples from a mixture of truncated normals.
rmixtruncnorm(n, pis, means, sds, range)
rmixtruncnorm(n, pis, means, sds, range)
n |
Size of the random sample. |
pis |
A vector collecting the mixture weights. See details. |
means |
A vector collecting the means of the mixture components. |
sds |
A vector collecting the standard deviations of the mixture components. |
range |
Interval corresponding to the support of each of truncated normal contributing to the mixture. See details. |
The argument pis
is a vector with its length equal the number of components in the mixture. The vector pis
must sum up to one, e.g. c(0.7, 0.2, 0.1)
.
The argument range
is an interval corresponding to the support of each truncated normal contributing to the mixture.
A vector collecting the random sample of size n
from the mixture of truncated specified.
Sara Algeri and Xiangyu Zhang
dmixtruncnorm
,dmixnegbinom
, rmixnegbinom
, find_h_cont
.
rmixtruncnorm(n=10,pis=c(0.5,0.5),means=c(3,6),sds=c(3,4),range=c(0,30))
rmixtruncnorm(n=10,pis=c(0.5,0.5),means=c(3,6),sds=c(3,4),range=c(0,30))