Inference¶
Sampling¶
-
pymc3.sampling.
sample
(draws=500, step=None, init='auto', n_init=200000, start=None, trace=None, chain=0, njobs=1, tune=500, nuts_kwargs=None, step_kwargs=None, progressbar=True, model=None, random_seed=-1, live_plot=False, discard_tuned_samples=True, live_plot_kwargs=None, **kwargs)¶ Draw samples from the posterior using the given step methods.
Multiple step methods are supported via compound step methods.
Parameters: - draws (int) – The number of samples to draw. Defaults to 500. The number of tuned samples are discarded by default. See discard_tuned_samples.
- step (function or iterable of functions) – A step function or collection of functions. If there are variables without a step methods, step methods for those variables will be assigned automatically.
- init (str {'ADVI', 'ADVI_MAP', 'MAP', 'NUTS', 'auto', None}) –
Initialization method to use. Only works for auto-assigned step methods.
- ADVI: Run ADVI to estimate starting points and diagonal covariance matrix. If njobs > 1 it will sample starting points from the estimated posterior, otherwise it will use the estimated posterior mean.
- ADVI_MAP: Initialize ADVI with MAP and use MAP as starting point.
- MAP: Use the MAP as starting point.
- NUTS: Run NUTS to estimate starting points and covariance matrix. If njobs > 1 it will sample starting points from the estimated posterior, otherwise it will use the estimated posterior mean.
- auto : Auto-initialize, if possible. Currently only works when NUTS is auto-assigned as step method (default).
- None: Do not initialize.
- n_init (int) – Number of iterations of initializer If ‘ADVI’, number of iterations, if ‘nuts’, number of draws.
- start (dict) – Starting point in parameter space (or partial point) Defaults to trace.point(-1)) if there is a trace provided and model.test_point if not (defaults to empty dict).
- trace (backend, list, or MultiTrace) – This should be a backend instance, a list of variables to track, or a MultiTrace object with past values. If a MultiTrace object is given, it must contain samples for the chain number chain. If None or a list of variables, the NDArray backend is used. Passing either “text” or “sqlite” is taken as a shortcut to set up the corresponding backend (with “mcmc” used as the base name).
- chain (int) – Chain number used to store sample in backend. If njobs is greater than one, chain numbers will start here.
- njobs (int) – Number of parallel jobs to start. If None, set to number of cpus in the system - 2.
- tune (int) – Number of iterations to tune, if applicable (defaults to 500). These samples will be drawn in addition to samples and discarded unless discard_tuned_samples is set to True.
- nuts_kwargs (dict) –
Options for the NUTS sampler. See the docstring of NUTS for a complete list of options. Common options are
- target_accept: float in [0, 1]. The step size is tuned such that we approximate this acceptance rate. Higher values like 0.9 or 0.95 often work better for problematic posteriors.
- max_treedepth: The maximum depth of the trajectory tree.
- step_scale: float, default 0.25 The initial guess for the step size scaled down by 1/n**(1/4).
If you want to pass options to other step methods, please use step_kwargs.
- step_kwargs (dict) – Options for step methods. Keys are the lower case names of the step method, values are dicts of keyword arguments. You can find a full list of arguments in the docstring of the step methods. If you want to pass arguments only to nuts, you can use nuts_kwargs.
- progressbar (bool) – Whether or not to display a progress bar in the command line. The bar shows the percentage of completion, the sampling speed in samples per second (SPS), and the estimated remaining time until completion (“expected time of arrival”; ETA).
- model (Model (optional if in with context)) –
- random_seed (int or list of ints) – A list is accepted if more if njobs is greater than one.
- live_plot (bool) – Flag for live plotting the trace while sampling
- live_plot_kwargs (dict) – Options for traceplot. Example: live_plot_kwargs={‘varnames’: [‘x’]}
- discard_tuned_samples (bool) – Whether to discard posterior samples of the tune interval.
Returns: trace (pymc3.backends.base.MultiTrace) – A MultiTrace object that contains the samples.
Examples
>>> import pymc3 as pm ... n = 100 ... h = 61 ... alpha = 2 ... beta = 2
>>> with pm.Model() as model: # context management ... p = pm.Beta('p', alpha=alpha, beta=beta) ... y = pm.Binomial('y', n=n, p=p, observed=h) ... trace = pm.sample(2000, tune=1000, njobs=4) >>> pm.df_summary(trace) mean sd mc_error hpd_2.5 hpd_97.5 p 0.604625 0.047086 0.00078 0.510498 0.694774
-
pymc3.sampling.
iter_sample
(draws, step, start=None, trace=None, chain=0, tune=None, model=None, random_seed=-1)¶ Generator that returns a trace on each iteration using the given step method. Multiple step methods supported via compound step method returns the amount of time taken.
Parameters: - draws (int) – The number of samples to draw
- step (function) – Step function
- start (dict) – Starting point in parameter space (or partial point) Defaults to trace.point(-1)) if there is a trace provided and model.test_point if not (defaults to empty dict)
- trace (backend, list, or MultiTrace) – This should be a backend instance, a list of variables to track, or a MultiTrace object with past values. If a MultiTrace object is given, it must contain samples for the chain number chain. If None or a list of variables, the NDArray backend is used.
- chain (int) – Chain number used to store sample in backend. If njobs is greater than one, chain numbers will start here.
- tune (int) – Number of iterations to tune, if applicable (defaults to None)
- model (Model (optional if in with context)) –
- random_seed (int or list of ints) – A list is accepted if more if njobs is greater than one.
Example
for trace in iter_sample(500, step): ...
-
pymc3.sampling.
sample_ppc
(trace, samples=None, model=None, vars=None, size=None, random_seed=None, progressbar=True)¶ Generate posterior predictive samples from a model given a trace.
Parameters: - trace (backend, list, or MultiTrace) – Trace generated from MCMC sampling
- samples (int) – Number of posterior predictive samples to generate. Defaults to the length of trace
- model (Model (optional if in with context)) – Model used to generate trace
- vars (iterable) – Variables for which to compute the posterior predictive samples. Defaults to model.observed_RVs.
- size (int) – The number of random draws from the distribution specified by the parameters in each sample of the trace.
Returns: samples (dict) – Dictionary with the variables as keys. The values corresponding to the posterior predictive samples.
-
pymc3.sampling.
init_nuts
(init='ADVI', njobs=1, n_init=500000, model=None, random_seed=-1, progressbar=True, **kwargs)¶ Initialize and sample from posterior of a continuous model.
This is a convenience function. NUTS convergence and sampling speed is extremely dependent on the choice of mass/scaling matrix. In our experience, using ADVI to estimate a diagonal covariance matrix and using this as the scaling matrix produces robust results over a wide class of continuous models.
Parameters: - init (str {'ADVI', 'ADVI_MAP', 'MAP', 'NUTS'}) – Initialization method to use. * ADVI : Run ADVI to estimate posterior mean and diagonal covariance matrix. * ADVI_MAP: Initialize ADVI with MAP and use MAP as starting point. * MAP : Use the MAP as starting point. * NUTS : Run NUTS and estimate posterior mean and covariance matrix.
- njobs (int) – Number of parallel jobs to start.
- n_init (int) – Number of iterations of initializer If ‘ADVI’, number of iterations, if ‘metropolis’, number of draws.
- model (Model (optional if in with context)) –
- progressbar (bool) – Whether or not to display a progressbar for advi sampling.
- **kwargs (keyword arguments) – Extra keyword arguments are forwarded to pymc3.NUTS.
Returns: - start (pymc3.model.Point) – Starting point for sampler
- nuts_sampler (pymc3.step_methods.NUTS) – Instantiated and initialized NUTS sampler object
Step-methods¶
NUTS¶
-
class
pymc3.step_methods.hmc.nuts.
NUTS
(vars=None, Emax=1000, target_accept=0.8, gamma=0.05, k=0.75, t0=10, adapt_step_size=True, max_treedepth=10, on_error='summary', **kwargs)¶ A sampler for continuous variables based on Hamiltonian mechanics.
NUTS automatically tunes the step size and the number of steps per sample. A detailed description can be found at [1], “Algorithm 6: Efficient No-U-Turn Sampler with Dual Averaging”.
Nuts provides a number of statistics that can be accessed with trace.get_sampler_stats:
- mean_tree_accept: The mean acceptance probability for the tree that generated this sample. The mean of these values across all samples but the burn-in should be approximately target_accept (the default for this is 0.8).
- diverging: Whether the trajectory for this sample diverged. If there are any divergences after burnin, this indicates that the results might not be reliable. Reparametrization can often help, but you can also try to increase target_accept to something like 0.9 or 0.95.
- energy: The energy at the point in phase-space where the sample was accepted. This can be used to identify posteriors with problematically long tails. See below for an example.
- energy_change: The difference in energy between the start and the end of the trajectory. For a perfect integrator this would always be zero.
- max_energy_change: The maximum difference in energy along the whole trajectory.
- depth: The depth of the tree that was used to generate this sample
- tree_size: The number of leafs of the sampling tree, when the sample was accepted. This is usually a bit less than 2 ** depth. If the tree size is large, the sampler is using a lot of leapfrog steps to find the next sample. This can for example happen if there are strong correlations in the posterior, if the posterior has long tails, if there are regions of high curvature (“funnels”), or if the variance estimates in the mass matrix are inaccurate. Reparametrisation of the model or estimating the posterior variances from past samples might help.
- tune: This is True, if step size adaptation was turned on when this sample was generated.
- step_size: The step size used for this sample.
- step_size_bar: The current best known step-size. After the tuning samples, the step size is set to this value. This should converge during tuning.
References
[1] Hoffman, Matthew D., & Gelman, Andrew. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Parameters: - vars (list of Theano variables, default all continuous vars) –
- Emax (float, default 1000) – Maximum energy change allowed during leapfrog steps. Larger deviations will abort the integration.
- target_accept (float (0,1), default .8) – Try to find a step size such that the average acceptance probability across the trajectories are close to target_accept. Higher values for target_accept lead to smaller step sizes.
- step_scale (float, default 0.25) – Size of steps to take, automatically scaled down by 1/n**(1/4). If step size adaptation is switched off, the resulting step size is used. If adaptation is enabled, it is used as initial guess.
- gamma (float, default .05) –
- k (float (5,1) default .75) – scaling of speed of adaptation
- t0 (int, default 10) – slows initial adaptation
- adapt_step_size (bool, default=True) – Whether step size adaptation should be enabled. If this is disabled, k, t0, gamma and target_accept are ignored.
- max_treedepth (int, default=10) – The maximum tree depth. Trajectories are stoped when this depth is reached.
- integrator (str, default "leapfrog") – The integrator to use for the trajectories. One of “leapfrog”, “two-stage” or “three-stage”. The second two can increase sampling speed for some high dimensional problems.
- scaling (array_like, ndim = {1,2}) – The inverse mass, or precision matrix. One dimensional arrays are interpreted as diagonal matrices. If is_cov is set to True, this will be interpreded as the mass or covariance matrix.
- is_cov (bool, default=False) – Treat the scaling as mass or covariance matrix.
- on_error ({'summary', 'warn', 'raise'}, default='summary') –
How to report problems during sampling.
- summary: Print one warning after sampling.
- warn: Print individual warnings as soon as they appear.
- raise: Raise an error on the first problem.
- potential (Potential, optional) – An object that represents the Hamiltonian with methods velocity, energy, and random methods. It can be specified instead of the scaling matrix.
- model (pymc3.Model) – The model
- kwargs (passed to BaseHMC) –
Notes
The step size adaptation stops when self.tune is set to False. This is usually achieved by setting the tune parameter if pm.sample to the desired number of tuning steps.
Metropolis¶
-
class
pymc3.step_methods.metropolis.
Metropolis
(vars=None, S=None, proposal_dist=None, scaling=1.0, tune=True, tune_interval=100, model=None, mode=None, **kwargs)¶ Metropolis-Hastings sampling step
Parameters: - vars (list) – List of variables for sampler
- S (standard deviation or covariance matrix) – Some measure of variance to parameterize proposal distribution
- proposal_dist (function) – Function that returns zero-mean deviates when parameterized with S (and n). Defaults to normal.
- scaling (scalar or array) – Initial scale factor for proposal. Defaults to 1.
- tune (bool) – Flag for tuning. Defaults to True.
- tune_interval (int) – The frequency of tuning. Defaults to 100 iterations.
- model (PyMC Model) – Optional model for sampling step. Defaults to None (taken from context).
- mode (string or Mode instance.) – compilation mode passed to Theano functions
-
class
pymc3.step_methods.metropolis.
BinaryMetropolis
(vars, scaling=1.0, tune=True, tune_interval=100, model=None)¶ Metropolis-Hastings optimized for binary variables
Parameters: - vars (list) – List of variables for sampler
- scaling (scalar or array) – Initial scale factor for proposal. Defaults to 1.
- tune (bool) – Flag for tuning. Defaults to True.
- tune_interval (int) – The frequency of tuning. Defaults to 100 iterations.
- model (PyMC Model) – Optional model for sampling step. Defaults to None (taken from context).
-
static
competence
(var)¶ BinaryMetropolis is only suitable for binary (bool) and Categorical variables with k=1.
-
class
pymc3.step_methods.metropolis.
BinaryGibbsMetropolis
(vars, order='random', model=None)¶ A Metropolis-within-Gibbs step method optimized for binary variables
-
static
competence
(var)¶ BinaryMetropolis is only suitable for Bernoulli and Categorical variables with k=2.
-
static
-
class
pymc3.step_methods.metropolis.
CategoricalGibbsMetropolis
(vars, proposal='uniform', order='random', model=None)¶ A Metropolis-within-Gibbs step method optimized for categorical variables. This step method works for Bernoulli variables as well, but it is not optimized for them, like BinaryGibbsMetropolis is. Step method supports two types of proposals: A uniform proposal and a proportional proposal, which was introduced by Liu in his 1996 technical report “Metropolized Gibbs Sampler: An Improvement”.
-
static
competence
(var)¶ CategoricalGibbsMetropolis is only suitable for Bernoulli and Categorical variables.
-
static
Slice¶
-
class
pymc3.step_methods.slicer.
Slice
(vars=None, w=1.0, tune=True, model=None, **kwargs)¶ Univariate slice sampler step method
Parameters: - vars (list) – List of variables for sampler.
- w (float) – Initial width of slice (Defaults to 1).
- tune (bool) – Flag for tuning (Defaults to True).
- model (PyMC Model) – Optional model for sampling step. Defaults to None (taken from context).
Hamiltonian Monte Carlo¶
-
class
pymc3.step_methods.hmc.hmc.
HamiltonianMC
(vars=None, path_length=2.0, step_rand=<function unif>, **kwargs)¶ Parameters: - vars (list of theano variables) –
- path_length (float, default=2) – total length to travel
- step_rand (function float -> float, default=unif) – A function which takes the step size and returns an new one used to randomize the step size at each iteration.
- step_scale (float, default=0.25) – Initial size of steps to take, automatically scaled down by 1/n**(1/4).
- scaling (array_like, ndim = {1,2}) – The inverse mass, or precision matrix. One dimensional arrays are interpreted as diagonal matrices. If is_cov is set to True, this will be interpreded as the mass or covariance matrix.
- is_cov (bool, default=False) – Treat the scaling as mass or covariance matrix.
- potential (Potential, optional) – An object that represents the Hamiltonian with methods velocity, energy, and random methods. It can be specified instead of the scaling matrix.
- model (pymc3.Model) – The model
- **kwargs (passed to BaseHMC) –
Variational¶
OPVI¶
Variational inference is a great approach for doing really complex, often intractable Bayesian inference in approximate form. Common methods (e.g. ADVI) lack from complexity so that approximate posterior does not reveal the true nature of underlying problem. In some applications it can yield unreliable decisions.
Recently on NIPS 2017 OPVI framework was presented. It generalizes variational inverence so that the problem is build with blocks. The first and essential block is Model itself. Second is Approximation, in some cases \(log Q(D)\) is not really needed. Necessity depends on the third and forth part of that black box, Operator and Test Function respectively.
Operator is like an approach we use, it constructs loss from given Model, Approximation and Test Function. The last one is not needed if we minimize KL Divergence from Q to posterior. As a drawback we need to compute \(loq Q(D)\). Sometimes approximation family is intractable and \(loq Q(D)\) is not available, here comes LS(Langevin Stein) Operator with a set of test functions.
Test Function has more unintuitive meaning. It is usually used with LS operator and represents all we want from our approximate distribution. For any given vector based function of \(z\) LS operator yields zero mean function under posterior. \(loq Q(D)\) is no more needed. That opens a door to rich approximation families as neural networks.
References
- Rajesh Ranganath, Jaan Altosaar, Dustin Tran, David M. Blei Operator Variational Inference https://arxiv.org/abs/1610.09033 (2016)
-
class
pymc3.variational.opvi.
ObjectiveFunction
(op, tf)¶ Helper class for construction loss and updates for variational inference
Parameters: - op (
Operator
) – OPVI Functional operator - tf (
TestFunction
) – OPVI TestFunction
-
random
(size=None)¶ Posterior distribution from initial latent space
Parameters: size (int) – number of samples from distribution Returns: posterior space (theano)
-
score_function
(sc_n_mc=None, more_replacements=None, fn_kwargs=None)¶ Compiles scoring function that operates which takes no inputs and returns Loss
Parameters: - sc_n_mc (int) – number of scoring MC samples
- more_replacements – Apply custom replacements before compiling a function
- fn_kwargs (dict) – arbitrary kwargs passed to theano.function
Returns: theano.function
-
step_function
(obj_n_mc=None, tf_n_mc=None, obj_optimizer=<function adagrad_window>, test_optimizer=<function adagrad_window>, more_obj_params=None, more_tf_params=None, more_updates=None, more_replacements=None, total_grad_norm_constraint=None, score=False, fn_kwargs=None)¶ Step function that should be called on each optimization step.
Generally it solves the following problem:
\[\mathbf{\lambda^{*}} = \inf_{\lambda} \sup_{\theta} t(\mathbb{E}_{\lambda}[(O^{p,q}f_{\theta})(z)])\]Parameters: - obj_n_mc (int) – Number of monte carlo samples used for approximation of objective gradients
- tf_n_mc (int) – Number of monte carlo samples used for approximation of test function gradients
- obj_optimizer (function (loss, params) -> updates) – Optimizer that is used for objective params
- test_optimizer (function (loss, params) -> updates) – Optimizer that is used for test function params
- more_obj_params (list) – Add custom params for objective optimizer
- more_tf_params (list) – Add custom params for test function optimizer
- more_updates (dict) – Add custom updates to resulting updates
- total_grad_norm_constraint (float) – Bounds gradient norm, prevents exploding gradient problem
- score (bool) – calculate loss on each step? Defaults to False for speed
- fn_kwargs (dict) – Add kwargs to theano.function (e.g. {‘profile’: True})
- more_replacements (dict) – Apply custom replacements before calculating gradients
Returns: theano.function
-
updates
(obj_n_mc=None, tf_n_mc=None, obj_optimizer=<function adagrad_window>, test_optimizer=<function adagrad_window>, more_obj_params=None, more_tf_params=None, more_updates=None, more_replacements=None, total_grad_norm_constraint=None)¶ Calculates gradients for objective function, test function and then constructs updates for optimization step
Parameters: - obj_n_mc (int) – Number of monte carlo samples used for approximation of objective gradients
- tf_n_mc (int) – Number of monte carlo samples used for approximation of test function gradients
- obj_optimizer (function (loss, params) -> updates) – Optimizer that is used for objective params
- test_optimizer (function (loss, params) -> updates) – Optimizer that is used for test function params
- more_obj_params (list) – Add custom params for objective optimizer
- more_tf_params (list) – Add custom params for test function optimizer
- more_updates (dict) – Add custom updates to resulting updates
- more_replacements (dict) – Apply custom replacements before calculating gradients
- total_grad_norm_constraint (float) – Bounds gradient norm, prevents exploding gradient problem
Returns: ObjectiveUpdates
- op (
-
class
pymc3.variational.opvi.
Operator
(approx)¶ Base class for Operator
Parameters: approx ( Approximation
) – an approximation instanceNotes
For implementing Custom operator it is needed to define
Operator.apply()
method-
OBJECTIVE
¶ alias of
ObjectiveFunction
-
apply
(f)¶ Operator itself
\[(O^{p,q}f_{\theta})(z)\]Parameters: f ( TestFunction
or None) – function that takes z = self.input and returns same dimensional outputReturns: TensorVariable – symbolically applied operator
-
-
class
pymc3.variational.opvi.
Approximation
(local_rv=None, model=None, cost_part_grad_scale=1, scale_cost_to_minibatch=False, random_seed=None, **kwargs)¶ Base class for approximations.
Parameters: - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (
Model
) – PyMC3 model for inference - cost_part_grad_scale (float or scalar tensor) – Scaling score part of gradient can be useful near optimum for archiving better convergence properties. Common schedule is 1 at the start and 0 in the end. So slow decay will be ok. See (Sticking the Landing; Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016) for details
- scale_cost_to_minibatch (bool, default False) – Scale cost to minibatch instead of full dataset
- random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
Notes
Defining an approximation needs custom implementation of the following methods:
.create_shared_params(**kwargs)
- Returns {dict|list|theano.shared}
.random_global(size=None, no_rand=False)
- Generate samples from posterior. If no_rand==False: sample from MAP of initial distribution. Returns TensorVariable
.log_q_W_global(z)
- It is needed only if used with operator that requires \(logq\) of an approximation Returns Scalar
You can also override the following methods:
._setup(**kwargs)
Do some specific stuff having kwargs before callingApproximation.create_shared_params()
.check_model(model, **kwargs)
Do some specific check for model having kwargs
kwargs mentioned above are supplied as additional arguments for
Approximation
There are some defaults class attributes for approximation classes that can be optionally overridden.
initial_dist_name
string that represents name of the initial distribution. In most cases if will be uniform or normalinitial_dist_map
float where initial distribution has maximum density
References
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.
-
apply_replacements
(node, deterministic=False, include=None, exclude=None, more_replacements=None)¶ Replace variables in graph with variational approximation. By default, replaces all variables
Parameters: - node (Theano Variables (or Theano expressions)) – node or nodes for replacements
- deterministic (bool) – whether to use zeros as initial distribution if True - zero initial point will produce constant latent variables
- include (list) – latent variables to be replaced
- exclude (list) – latent variables to be excluded for replacements
- more_replacements (dict) – add custom replacements to graph, e.g. change input source
Returns: node(s) with replacements
-
check_model
(model, **kwargs)¶ Checks that model is valid for variational inference
-
construct_replacements
(include=None, exclude=None, more_replacements=None)¶ Construct replacements with given conditions
Parameters: - include (list) – latent variables to be replaced
- exclude (list) – latent variables to be excluded for replacements
- more_replacements (dict) – add custom replacements to graph, e.g. change input source
Returns: dict – Replacements
Returns: {dict|list|theano.shared}
-
initial
(size, no_rand=False, l=None)¶ Initial distribution for constructing posterior
Parameters: - size (int) – number of samples
- no_rand (bool) – return zeros if True
- l (int) – length of sample, defaults to latent space dim
Returns: tt.TensorVariable – sampled latent space
-
log_q_W_global
(z)¶ log_q_W samples over q for global vars
-
log_q_W_local
(z)¶ log_q_W samples over q for local vars Gradient wrt mu, rho in density parametrization can be scaled to lower variance of ELBO
-
logq
(z)¶ Total logq for approximation
-
random
(size=None, no_rand=False)¶ Implements posterior distribution from initial latent space
Parameters: - size (scalar) – number of samples from distribution
- no_rand (bool) – whether use deterministic distribution
Returns: posterior space (theano)
-
random_fn
¶ Implements posterior distribution from initial latent space
Parameters: - size (int) – number of samples from distribution
- no_rand (bool) – whether use deterministic distribution
Returns: posterior space (numpy)
-
random_global
(size=None, no_rand=False)¶ Implements posterior distribution from initial latent space
Parameters: - size (scalar) – number of samples from distribution
- no_rand (bool) – whether use deterministic distribution
Returns: global posterior space
-
random_local
(size=None, no_rand=False)¶ Implements posterior distribution from initial latent space
Parameters: - size (scalar) – number of samples from distribution
- no_rand (bool) – whether use deterministic distribution
Returns: local posterior space
-
sample
(draws=1, include_transformed=False)¶ Draw samples from variational posterior.
Parameters: - draws (int) – Number of random samples.
- include_transformed (bool) – If True, transformed variables are also sampled. Default is False.
Returns: trace (
pymc3.backends.base.MultiTrace
) – Samples drawn from variational posterior.
-
sample_node
(node, size=100, more_replacements=None)¶ Samples given node or nodes over shared posterior
Parameters: - node (Theano Variables (or Theano expressions)) –
- size (scalar) – number of samples
- more_replacements (dict) – add custom replacements to graph, e.g. change input source
Returns: sampled node(s) with replacements
-
scale_grad
(inp)¶ Rescale gradient of input
References
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016
- Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
-
seed
(random_seed=None)¶ Reinitialize RandomStream used by this approximation
Parameters: random_seed (int) – New random seed
-
to_flat_input
(node)¶ Replaces vars with flattened view stored in self.input
-
view
(space, name, reshape=True)¶ Construct view on a variable from flattened space
Parameters: - space (matrix or vector) – space to take view of variable from
- name (str) – name of variable
- reshape (bool) – whether to reshape variable from vectorized view
Returns: (reshaped) slice of matrix – variable view
Inference¶
-
class
pymc3.variational.inference.
ADVI
(local_rv=None, model=None, cost_part_grad_scale=1, scale_cost_to_minibatch=False, random_seed=None, start=None)¶ Automatic Differentiation Variational Inference (ADVI)
This class implements the meanfield ADVI, where the variational posterior distribution is assumed to be spherical Gaussian without correlation of parameters and fit to the true posterior distribution. The means and standard deviations of the variational posterior are referred to as variational parameters.
For explanation, we classify random variables in probabilistic models into three types. Observed random variables \({\cal Y}=\{\mathbf{y}_{i}\}_{i=1}^{N}\) are \(N\) observations. Each \(\mathbf{y}_{i}\) can be a set of observed random variables, i.e., \(\mathbf{y}_{i}=\{\mathbf{y}_{i}^{k}\}_{k=1}^{V_{o}}\), where \(V_{k}\) is the number of the types of observed random variables in the model.
The next ones are global random variables \(\Theta=\{\theta^{k}\}_{k=1}^{V_{g}}\), which are used to calculate the probabilities for all observed samples.
The last ones are local random variables \({\cal Z}=\{\mathbf{z}_{i}\}_{i=1}^{N}\), where \(\mathbf{z}_{i}=\{\mathbf{z}_{i}^{k}\}_{k=1}^{V_{l}}\). These RVs are used only in AEVB.
The goal of ADVI is to approximate the posterior distribution \(p(\Theta,{\cal Z}|{\cal Y})\) by variational posterior \(q(\Theta)\prod_{i=1}^{N}q(\mathbf{z}_{i})\). All of these terms are normal distributions (mean-field approximation).
\(q(\Theta)\) is parametrized with its means and standard deviations. These parameters are denoted as \(\gamma\). While \(\gamma\) is a constant, the parameters of \(q(\mathbf{z}_{i})\) are dependent on each observation. Therefore these parameters are denoted as \(\xi(\mathbf{y}_{i}; \nu)\), where \(\nu\) is the parameters of \(\xi(\cdot)\). For example, \(\xi(\cdot)\) can be a multilayer perceptron or convolutional neural network.
In addition to \(\xi(\cdot)\), we can also include deterministic mappings for the likelihood of observations. We denote the parameters of the deterministic mappings as \(\eta\). An example of such mappings is the deconvolutional neural network used in the convolutional VAE example in the PyMC3 notebook directory.
This function maximizes the evidence lower bound (ELBO) \({\cal L}(\gamma, \nu, \eta)\) defined as follows:
\[\begin{split}{\cal L}(\gamma,\nu,\eta) & = \mathbf{c}_{o}\mathbb{E}_{q(\Theta)}\left[ \sum_{i=1}^{N}\mathbb{E}_{q(\mathbf{z}_{i})}\left[ \log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) \right]\right] \\ & - \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] - \mathbf{c}_{l}\sum_{i=1}^{N} KL\left[q(\mathbf{z}_{i})||p(\mathbf{z}_{i})\right],\end{split}\]where \(KL[q(v)||p(v)]\) is the Kullback-Leibler divergence
\[KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv,\]\(\mathbf{c}_{o/g/l}\) are vectors for weighting each term of ELBO. More precisely, we can write each of the terms in ELBO as follows:
\[\begin{split}\mathbf{c}_{o}\log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) & = & \sum_{k=1}^{V_{o}}c_{o}^{k} \log p(\mathbf{y}_{i}^{k}| {\rm pa}(\mathbf{y}_{i}^{k},\Theta,\eta)) \\ \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] & = & \sum_{k=1}^{V_{g}}c_{g}^{k}KL\left[ q(\theta^{k})||p(\theta^{k}|{\rm pa(\theta^{k})})\right] \\ \mathbf{c}_{l}KL\left[q(\mathbf{z}_{i}||p(\mathbf{z}_{i})\right] & = & \sum_{k=1}^{V_{l}}c_{l}^{k}KL\left[ q(\mathbf{z}_{i}^{k})|| p(\mathbf{z}_{i}^{k}|{\rm pa}(\mathbf{z}_{i}^{k}))\right],\end{split}\]where \({\rm pa}(v)\) denotes the set of parent variables of \(v\) in the directed acyclic graph of the model.
When using mini-batches, \(c_{o}^{k}\) and \(c_{l}^{k}\) should be set to \(N/M\), where \(M\) is the number of observations in each mini-batch. This is done with supplying total_size parameter to observed nodes (e.g.
Normal('x', 0, 1, observed=data, total_size=10000)
). In this case it is possible to automatically determine appropriate scaling for \(logp\) of observed nodes. Interesting to note that it is possible to have two independent observed variables with different total_size and iterate them independently during inference.For working with ADVI, we need to give
The probabilistic model
model with three types of RVs (observed_RVs, global_RVs and local_RVs).
(optional) Minibatches
The tensors to which mini-bathced samples are supplied are handled separately by using callbacks in
Inference.fit()
method that change storage of shared theano variable or bypymc3.generator()
that automatically iterates over minibatches and defined beforehand.(optional) Parameters of deterministic mappings
They have to be passed along with other params to
Inference.fit()
method as more_obj_params argument.
For more information concerning training stage please reference
pymc3.variational.opvi.ObjectiveFunction.step_function()
Parameters: - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (
pymc3.Model
) – PyMC3 model for inference - cost_part_grad_scale (scalar) – Scaling score part of gradient can be useful near optimum for archiving better convergence properties. Common schedule is 1 at the start and 0 in the end. So slow decay will be ok. See (Sticking the Landing; Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016) for details
- scale_cost_to_minibatch (bool) – Scale cost to minibatch instead of full dataset, default False
- random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
- start (Point) – starting point for inference
References
- Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic Differentiation Variational Inference. arXiv preprint arXiv:1603.00788.
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.
-
class
pymc3.variational.inference.
FullRankADVI
(local_rv=None, model=None, cost_part_grad_scale=1, scale_cost_to_minibatch=False, gpu_compat=False, random_seed=None, start=None)¶ Full Rank Automatic Differentiation Variational Inference (ADVI)
Parameters: - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (
pymc3.Model
) – PyMC3 model for inference - cost_part_grad_scale (scalar) – Scaling score part of gradient can be useful near optimum for archiving better convergence properties. Common schedule is 1 at the start and 0 in the end. So slow decay will be ok. See (Sticking the Landing; Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016) for details
- scale_cost_to_minibatch (bool, default False) – Scale cost to minibatch instead of full dataset
- random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
- start (Point) – starting point for inference
References
- Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic Differentiation Variational Inference. arXiv preprint arXiv:1603.00788.
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.
-
classmethod
from_advi
(advi, gpu_compat=False)¶ Construct FullRankADVI from ADVI
Parameters: advi ( ADVI
) –Other Parameters: gpu_compat (bool) – use GPU compatible version or not Returns: FullRankADVI
-
classmethod
from_full_rank
(full_rank)¶ Construct FullRankADVI from FullRank approximation
Parameters: full_rank ( FullRank
) – approximation to start withReturns: FullRankADVI
-
classmethod
from_mean_field
(mean_field, gpu_compat=False)¶ Construct FullRankADVI from MeanField approximation
Parameters: mean_field ( MeanField
) – approximation to start withOther Parameters: gpu_compat (bool) – use GPU compatible version or not Returns: FullRankADVI
-
class
pymc3.variational.inference.
SVGD
(n_particles=100, jitter=0.01, model=None, kernel=<pymc3.variational.test_functions.RBF object>, temperature=1, scale_cost_to_minibatch=False, start=None, histogram=None, random_seed=None, local_rv=None)¶ Stein Variational Gradient Descent
This inference is based on Kernelized Stein Discrepancy it’s main idea is to move initial noisy particles so that they fit target distribution best.
Algorithm is outlined below
- Input: A target distribution with density function \(p(x)\)
- and a set of initial particles \({x^0_i}^n_{i=1}\)
Output: A set of particles \({x_i}^n_{i=1}\) that approximates the target distribution.
\[\begin{split}x_i^{l+1} &\leftarrow x_i^{l} + \epsilon_l \hat{\phi}^{*}(x_i^l) \\ \hat{\phi}^{*}(x) &= \frac{1}{n}\sum^{n}_{j=1}[k(x^l_j,x) \nabla_{x^l_j} logp(x^l_j)+ \nabla_{x^l_j} k(x^l_j,x)]\end{split}\]Parameters: - n_particles (int) – number of particles to use for approximation
- jitter (float) – noise sd for initial point
- model (
pymc3.Model
) – PyMC3 model for inference - kernel (callable) – kernel function for KSD \(f(histogram) -> (k(x,.), \nabla_x k(x,.))\)
- temperature (float) – parameter responsible for exploration, higher temperature gives more broad posterior estimate
- scale_cost_to_minibatch (bool, default False) – Scale cost to minibatch instead of full dataset
- start (Point) – initial point for inference
- histogram (
Empirical
) – initialize SVGD with given Empirical approximation instead of default initial particles - random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
- start – starting point for inference
References
- Qiang Liu, Dilin Wang (2016) Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm arXiv:1608.04471
- Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng (2017) Stein Variational Policy Gradient arXiv:1704.02399
-
class
pymc3.variational.inference.
ASVGD
(approx=<class 'pymc3.variational.approximations.FullRank'>, local_rv=None, kernel=<pymc3.variational.test_functions.RBF object>, temperature=1, model=None, **kwargs)¶ Amortized Stein Variational Gradient Descent
This inference is based on Kernelized Stein Discrepancy it’s main idea is to move initial noisy particles so that they fit target distribution best.
Algorithm is outlined below
Input: Parametrized random generator \(R_{\theta}\)
Output: \(R_{\theta^{*}}\) that approximates the target distribution.
\[\begin{split}\Delta x_i &= \hat{\phi}^{*}(x_i) \\ \hat{\phi}^{*}(x) &= \frac{1}{n}\sum^{n}_{j=1}[k(x_j,x) \nabla_{x_j} logp(x_j)+ \nabla_{x_j} k(x_j,x)] \\ \Delta_{\theta} &= \frac{1}{n}\sum^{n}_{i=1}\Delta x_i\frac{\partial x_i}{\partial \theta}\end{split}\]Parameters: - approx (
Approximation
) – - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- kernel (callable) – kernel function for KSD \(f(histogram) -> (k(x,.), \nabla_x k(x,.))\)
- temperature (float) – parameter responsible for exploration, higher temperature gives more broad posterior estimate
- model (
Model
) – - kwargs (kwargs for
Approximation
) –
References
- Dilin Wang, Yihao Feng, Qiang Liu (2016) Learning to Sample Using Stein Discrepancy http://bayesiandeeplearning.org/papers/BDL_21.pdf
- Dilin Wang, Qiang Liu (2016) Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning arXiv:1611.01722
- Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng (2017) Stein Variational Policy Gradient arXiv:1704.02399
-
fit
(n=10000, score=None, callbacks=None, progressbar=True, obj_n_mc=30, **kwargs)¶ Performs Amortized Stein Variational Gradient Descent
Parameters: - n (int) – number of iterations
- score (bool) – evaluate loss on each iteration or not
- callbacks (list[function : (Approximation, losses, i) -> None]) – calls provided functions after each iteration step
- progressbar (bool) – whether to show progressbar or not
- obj_n_mc (int) – sample n particles for Stein gradient
- kwargs (kwargs) – additional kwargs for
ObjectiveFunction.step_function()
Returns: Approximation
- approx (
-
class
pymc3.variational.inference.
Inference
(op, approx, tf, local_rv=None, model=None, op_kwargs=None, **kwargs)¶ Base class for Variational Inference
Communicates Operator, Approximation and Test Function to build Objective Function
Parameters: - op (Operator class) –
- approx (Approximation class or instance) –
- tf (TestFunction instance) –
- local_rv (dict) – mapping {model_variable -> local_variable} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (Model) – PyMC3 Model
- op_kwargs (dict) – kwargs passed to
Operator
- kwargs (kwargs) – additional kwargs for
Approximation
-
fit
(n=10000, score=None, callbacks=None, progressbar=True, **kwargs)¶ Performs Operator Variational Inference
Parameters: - n (int) – number of iterations
- score (bool) – evaluate loss on each iteration or not
- callbacks (list[function : (Approximation, losses, i) -> None]) – calls provided functions after each iteration step
- progressbar (bool) – whether to show progressbar or not
- kwargs (kwargs) – additional kwargs for
ObjectiveFunction.step_function()
Returns: Approximation
-
pymc3.variational.inference.
fit
(n=10000, local_rv=None, method='advi', model=None, random_seed=None, start=None, inf_kwargs=None, **kwargs)¶ Handy shortcut for using inference methods in functional way
Parameters: - n (int) – number of iterations
- local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- method (str or
Inference
) – string name is case insensitive in {‘advi’, ‘fullrank_advi’, ‘advi->fullrank_advi’, ‘svgd’, ‘asvgd’} - model (
pymc3.Model
) – PyMC3 model for inference - random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
- inf_kwargs (dict) – additional kwargs passed to
Inference
- start (Point) – starting point for inference
Other Parameters: - frac (float) – if method is ‘advi->fullrank_advi’ represents advi fraction when training
- kwargs (kwargs) – additional kwargs for
Inference.fit()
Returns: Approximation
Approximations¶
-
class
pymc3.variational.approximations.
MeanField
(local_rv=None, model=None, cost_part_grad_scale=1, scale_cost_to_minibatch=False, random_seed=None, **kwargs)¶ Mean Field approximation to the posterior where spherical Gaussian family is fitted to minimize KL divergence from True posterior. It is assumed that latent space variables are uncorrelated that is the main drawback of the method
Parameters: - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (
pymc3.Model
) – PyMC3 model for inference - start (Point) – initial mean
- cost_part_grad_scale (scalar) – Scaling score part of gradient can be useful near optimum for archiving better convergence properties. Common schedule is 1 at the start and 0 in the end. So slow decay will be ok. See (Sticking the Landing; Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016) for details
- scale_cost_to_minibatch (bool) – Scale cost to minibatch instead of full dataset, default False
- random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
References
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
-
log_q_W_global
(z)¶ log_q_W samples over q for global vars
-
class
pymc3.variational.approximations.
FullRank
(local_rv=None, model=None, cost_part_grad_scale=1, scale_cost_to_minibatch=False, gpu_compat=False, random_seed=None, **kwargs)¶ Full Rank approximation to the posterior where Multivariate Gaussian family is fitted to minimize KL divergence from True posterior. In contrast to MeanField approach correlations between variables are taken in account. The main drawback of the method is computational cost.
Parameters: - local_rv (dict[var->tuple]) – mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- model (PyMC3 model for inference) –
- start (Point) – initial mean
- cost_part_grad_scale (float or scalar tensor) – Scaling score part of gradient can be useful near optimum for archiving better convergence properties. Common schedule is 1 at the start and 0 in the end. So slow decay will be ok. See (Sticking the Landing; Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016) for details
- scale_cost_to_minibatch (bool, default False) – Scale cost to minibatch instead of full dataset
- random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
Other Parameters: gpu_compat (bool) – use GPU compatible version or not
References
- Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
-
classmethod
from_mean_field
(mean_field, gpu_compat=False)¶ Construct FullRank from MeanField approximation
Parameters: mean_field ( MeanField
) – approximation to start withOther Parameters: gpu_compat (bool) – use GPU compatible version or not Returns: FullRank
-
log_q_W_global
(z)¶ log_q_W samples over q for global vars
-
class
pymc3.variational.approximations.
Empirical
(trace, local_rv=None, scale_cost_to_minibatch=False, model=None, random_seed=None, **kwargs)¶ Builds Approximation instance from a given trace, it has the same interface as variational approximation
Parameters: - trace (
MultiTrace
) – Trace storing samples (e.g. from step methods) - local_rv (dict[var->tuple]) – Experimental for Empirical Approximation mapping {model_variable -> local_variable (\(\mu\), \(\rho\))} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- scale_cost_to_minibatch (bool) – Scale cost to minibatch instead of full dataset, default False
- model (
pymc3.Model
) – PyMC3 model for inference - random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
Examples
>>> with model: ... step = NUTS() ... trace = sample(1000, step=step) ... histogram = Empirical(trace[100:])
-
classmethod
from_noise
(size, jitter=0.01, local_rv=None, start=None, model=None, random_seed=None, **kwargs)¶ Initialize Histogram with random noise
Parameters: - size (int) – number of initial particles
- jitter (float) – initial sd
- local_rv (dict) – mapping {model_variable -> local_variable} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details
- start (Point) – initial point
- model (
pymc3.Model
) – PyMC3 model for inference - random_seed (None or int) – leave None to use package global RandomStream or other valid value to create instance specific one
- kwargs (other kwargs passed to init) –
Returns:
-
histogram
¶ Shortcut to flattened Trace
-
histogram_logp
¶ Symbolic logp for every point in trace
- trace (
-
pymc3.variational.approximations.
sample_approx
(approx, draws=100, include_transformed=True)¶ Draw samples from variational posterior.
Parameters: - approx (
Approximation
) – Approximation to sample from - draws (int) – Number of random samples.
- include_transformed (bool) – If True, transformed variables are also sampled. Default is True.
Returns: trace (class:pymc3.backends.base.MultiTrace) – Samples drawn from variational posterior.
- approx (
Operators¶
-
class
pymc3.variational.operators.
KL
(approx)¶ Operator based on Kullback Leibler Divergence
\[KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv\]
-
class
pymc3.variational.operators.
KSD
(approx, temperature=1)¶ Operator based on Kernelized Stein Discrepancy
- Input: A target distribution with density function \(p(x)\)
- and a set of initial particles \(\{x^0_i\}^n_{i=1}\)
Output: A set of particles \(\{x_i\}^n_{i=1}\) that approximates the target distribution.
\[\begin{split}x_i^{l+1} \leftarrow \epsilon_l \hat{\phi}^{*}(x_i^l) \\ \hat{\phi}^{*}(x) = \frac{1}{n}\sum^{n}_{j=1}[k(x^l_j,x) \nabla_{x^l_j} logp(x^l_j)+ \nabla_{x^l_j} k(x^l_j,x)]\end{split}\]Parameters: approx ( Empirical
) – Empirical Approximation used for inferenceReferences
- Qiang Liu, Dilin Wang (2016) Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm arXiv:1608.04471
-
OBJECTIVE
¶ alias of
KSDObjective