Data¶

pymc3.data.get_data(filename)¶

Returns a BytesIO object for a package data file.

Parameters:	filename (str) – file to load
Returns:	BytesIO of the data

class pymc3.data.GeneratorAdapter(generator)¶: Helper class that helps to infer data type of generator with looking at the first item, preserving the order of the resulting generator

class pymc3.data.Minibatch(data, batch_size=128, dtype=None, broadcastable=None, name='Minibatch', random_seed=42, update_shared_f=None, in_memory_size=None)¶

Multidimensional minibatch that is pure TensorVariable

Parameters:

data (ndarray) – initial data
batch_size (int or List[int|tuple(size, random_seed)]) – batch size for inference, random seed is needed for child random generators
dtype (str) – cast data to specific type
broadcastable (tuple[bool]) – change broadcastable pattern that defaults to (False, ) * ndim
name (str) – name for tensor, defaults to “Minibatch”
random_seed (int) – random seed that is used by default
update_shared_f (callable) – returns ndarray that will be carefully stored to underlying shared variable you can use it to change source of minibatches programmatically
in_memory_size (int or List[int|slice|Ellipsis]) – data size for storing in theano.shared

shared¶: shared tensor – Used for storing data

minibatch¶: minibatch tensor – Used for training

Examples

Consider we have data >>> data = np.random.rand(100, 100)

if we want 1d slice of size 10 we do >>> x = Minibatch(data, batch_size=10)

Note, that your data is cast to floatX if it is not integer type But you still can add dtype kwarg for Minibatch

in case we want 10 sampled rows and columns [(size, seed), (size, seed)] it is >>> x = Minibatch(data, batch_size=[(10, 42), (10, 42)], dtype=’int32’) >>> assert str(x.dtype) == ‘int32’

or simpler with default random seed = 42 [size, size] >>> x = Minibatch(data, batch_size=[10, 10])

x is a regular TensorVariable that supports any math >>> assert x.eval().shape == (10, 10)

You can pass it to your desired model >>> with pm.Model() as model: ... mu = pm.Flat(‘mu’) ... sd = pm.HalfNormal(‘sd’) ... lik = pm.Normal(‘lik’, mu, sd, observed=x, total_size=(100, 100))

Then you can perform regular Variational Inference out of the box >>> with model: ... approx = pm.fit()

Notable thing is that Minibatch has shared, minibatch, attributes you can call later >>> x.set_value(np.random.laplace(size=(100, 100)))

and minibatches will be then from new storage it directly affects x.shared. the same thing would be but less convenient >>> x.shared.set_value(pm.floatX(np.random.laplace(size=(100, 100))))

programmatic way to change storage is as follows I import partial for simplicity >>> from functools import partial >>> datagen = partial(np.random.laplace, size=(100, 100)) >>> x = Minibatch(datagen(), batch_size=10, update_shared_f=datagen) >>> x.update_shared()

To be more concrete about how we get minibatch, here is a demo 1) create shared variable >>> shared = theano.shared(data)

2) create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype(‘int64’)

3) take that slice >>> minibatch = shared[ridx]

That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.

Suppose you need some replacements in the graph, e.g. change minibatch to testdata >>> node = x ** 2 # arbitrary expressions on minibatch x >>> testdata = pm.floatX(np.random.laplace(size=(1000, 10)))

Then you should create a dict with replacements >>> replacements = {x: testdata} >>> rnode = theano.clone(node, replacements) >>> assert (testdata ** 2 == rnode.eval()).all()

To replace minibatch with it’s shared variable you should do the same things. Minibatch variable is accessible as an attribute as well as shared, associated with minibatch >>> replacements = {x.minibatch: x.shared} >>> rnode = theano.clone(node, replacements)

For more complex slices some more code is needed that can seem not so clear >>> moredata = np.random.rand(10, 20, 30, 40, 50)

default total_size that can be passed to PyMC3 random node is then (10, 20, 30, 40, 50) but can be less verbose in some cases

1) Advanced indexing, total_size = (10, Ellipsis, 50) >>> x = Minibatch(moredata, [2, Ellipsis, 10])

We take slice only for the first and last dimension >>> assert x.eval().shape == (2, 20, 30, 40, 10)

2) Skipping particular dimension, total_size = (10, None, 30) >>> x = Minibatch(moredata, [2, None, 20]) >>> assert x.eval().shape == (2, 20, 20, 40, 50)

3) Mixing that all, total_size = (10, None, 30, Ellipsis, 50) >>> x = Minibatch(moredata, [2, None, 20, Ellipsis, 10]) >>> assert x.eval().shape == (2, 20, 20, 40, 10)