Data¶
-
pymc3.data.
get_data
(filename)¶ Returns a BytesIO object for a package data file.
Parameters: filename (str) – file to load Returns: BytesIO of the data
-
class
pymc3.data.
GeneratorAdapter
(generator)¶ Helper class that helps to infer data type of generator with looking at the first item, preserving the order of the resulting generator
-
class
pymc3.data.
Minibatch
(data, batch_size=128, dtype=None, broadcastable=None, name='Minibatch', random_seed=42, update_shared_f=None, in_memory_size=None)¶ Multidimensional minibatch that is pure TensorVariable
Parameters: - data (
ndarray
) – initial data - batch_size (int or List[int|tuple(size, random_seed)]) – batch size for inference, random seed is needed for child random generators
- dtype (str) – cast data to specific type
- broadcastable (tuple[bool]) – change broadcastable pattern that defaults to (False, ) * ndim
- name (str) – name for tensor, defaults to “Minibatch”
- random_seed (int) – random seed that is used by default
- update_shared_f (callable) – returns
ndarray
that will be carefully stored to underlying shared variable you can use it to change source of minibatches programmatically - in_memory_size (int or List[int|slice|Ellipsis]) – data size for storing in theano.shared
shared tensor – Used for storing data
-
minibatch
¶ minibatch tensor – Used for training
Examples
Consider we have data >>> data = np.random.rand(100, 100)
if we want 1d slice of size 10 we do >>> x = Minibatch(data, batch_size=10)
Note, that your data is cast to floatX if it is not integer type But you still can add dtype kwarg for
Minibatch
in case we want 10 sampled rows and columns [(size, seed), (size, seed)] it is >>> x = Minibatch(data, batch_size=[(10, 42), (10, 42)], dtype=’int32’) >>> assert str(x.dtype) == ‘int32’
or simpler with default random seed = 42 [size, size] >>> x = Minibatch(data, batch_size=[10, 10])
x is a regular
TensorVariable
that supports any math >>> assert x.eval().shape == (10, 10)You can pass it to your desired model >>> with pm.Model() as model: ... mu = pm.Flat(‘mu’) ... sd = pm.HalfNormal(‘sd’) ... lik = pm.Normal(‘lik’, mu, sd, observed=x, total_size=(100, 100))
Then you can perform regular Variational Inference out of the box >>> with model: ... approx = pm.fit()
Notable thing is that
Minibatch
has shared, minibatch, attributes you can call later >>> x.set_value(np.random.laplace(size=(100, 100)))and minibatches will be then from new storage it directly affects x.shared. the same thing would be but less convenient >>> x.shared.set_value(pm.floatX(np.random.laplace(size=(100, 100))))
programmatic way to change storage is as follows I import partial for simplicity >>> from functools import partial >>> datagen = partial(np.random.laplace, size=(100, 100)) >>> x = Minibatch(datagen(), batch_size=10, update_shared_f=datagen) >>> x.update_shared()
To be more concrete about how we get minibatch, here is a demo 1) create shared variable >>> shared = theano.shared(data)
2) create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype(‘int64’)
3) take that slice >>> minibatch = shared[ridx]
That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.
Suppose you need some replacements in the graph, e.g. change minibatch to testdata >>> node = x ** 2 # arbitrary expressions on minibatch x >>> testdata = pm.floatX(np.random.laplace(size=(1000, 10)))
Then you should create a dict with replacements >>> replacements = {x: testdata} >>> rnode = theano.clone(node, replacements) >>> assert (testdata ** 2 == rnode.eval()).all()
To replace minibatch with it’s shared variable you should do the same things. Minibatch variable is accessible as an attribute as well as shared, associated with minibatch >>> replacements = {x.minibatch: x.shared} >>> rnode = theano.clone(node, replacements)
For more complex slices some more code is needed that can seem not so clear >>> moredata = np.random.rand(10, 20, 30, 40, 50)
default total_size that can be passed to PyMC3 random node is then (10, 20, 30, 40, 50) but can be less verbose in some cases
1) Advanced indexing, total_size = (10, Ellipsis, 50) >>> x = Minibatch(moredata, [2, Ellipsis, 10])
We take slice only for the first and last dimension >>> assert x.eval().shape == (2, 20, 30, 40, 10)
2) Skipping particular dimension, total_size = (10, None, 30) >>> x = Minibatch(moredata, [2, None, 20]) >>> assert x.eval().shape == (2, 20, 20, 40, 50)
3) Mixing that all, total_size = (10, None, 30, Ellipsis, 50) >>> x = Minibatch(moredata, [2, None, 20, Ellipsis, 10]) >>> assert x.eval().shape == (2, 20, 20, 40, 10)
- data (