Data¶

pymc3.data.
get_data
(filename)¶ Returns a BytesIO object for a package data file.
Parameters: filename : str
file to load
Returns
——
BytesIO of the data

class
pymc3.data.
GeneratorAdapter
(generator)¶ Helper class that helps to infer data type of generator with looking at the first item, preserving the order of the resulting generator

class
pymc3.data.
Minibatch
(**kwargs)¶ Multidimensional minibatch that is pure TensorVariable
Parameters: data :
ndarray
initial data
batch_size : int or List[inttuple(size, random_seed)]
batch size for inference, random seed is needed for child random generators
dtype : str
cast data to specific type
broadcastable : tuple[bool]
change broadcastable pattern that defaults to (False, ) * ndim
name : str
name for tensor, defaults to “Minibatch”
random_seed : int
random seed that is used by default
update_shared_f : callable
returns
ndarray
that will be carefully stored to underlying shared variable you can use it to change source of minibatches programmaticallyin_memory_size : int or List[intsliceEllipsis]
data size for storing in theano.shared
Notes
Below is a common use case of Minibatch within the variational inference. Importantly, we need to make PyMC3 “aware” of minibatch being used in inference. Otherwise, we will get the wrong \(logp\) for the model. To do so, we need to pass the total_size parameter to the observed node, which correctly scales the density of the model logp that is affected by Minibatch. See more in examples below.
Examples
Consider we have data >>> data = np.random.rand(100, 100)
if we want 1d slice of size 10 we do >>> x = Minibatch(data, batch_size=10)
Note, that your data is cast to floatX if it is not integer type But you still can add dtype kwarg for
Minibatch
in case we want 10 sampled rows and columns [(size, seed), (size, seed)] it is >>> x = Minibatch(data, batch_size=[(10, 42), (10, 42)], dtype=’int32’) >>> assert str(x.dtype) == ‘int32’
or simpler with default random seed = 42 [size, size] >>> x = Minibatch(data, batch_size=[10, 10])
x is a regular
TensorVariable
that supports any math >>> assert x.eval().shape == (10, 10)You can pass it to your desired model >>> with pm.Model() as model: … mu = pm.Flat(‘mu’) … sd = pm.HalfNormal(‘sd’) … lik = pm.Normal(‘lik’, mu, sd, observed=x, total_size=(100, 100))
Then you can perform regular Variational Inference out of the box >>> with model: … approx = pm.fit()
Notable thing is that
Minibatch
has shared, minibatch, attributes you can call later >>> x.set_value(np.random.laplace(size=(100, 100)))and minibatches will be then from new storage it directly affects x.shared. the same thing would be but less convenient >>> x.shared.set_value(pm.floatX(np.random.laplace(size=(100, 100))))
programmatic way to change storage is as follows I import partial for simplicity >>> from functools import partial >>> datagen = partial(np.random.laplace, size=(100, 100)) >>> x = Minibatch(datagen(), batch_size=10, update_shared_f=datagen) >>> x.update_shared()
To be more concrete about how we get minibatch, here is a demo 1) create shared variable >>> shared = theano.shared(data)
2) create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]1e10).astype(‘int64’)
3) take that slice >>> minibatch = shared[ridx]
That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.
Suppose you need some replacements in the graph, e.g. change minibatch to testdata >>> node = x ** 2 # arbitrary expressions on minibatch x >>> testdata = pm.floatX(np.random.laplace(size=(1000, 10)))
Then you should create a dict with replacements >>> replacements = {x: testdata} >>> rnode = theano.clone(node, replacements) >>> assert (testdata ** 2 == rnode.eval()).all()
To replace minibatch with it’s shared variable you should do the same things. Minibatch variable is accessible as an attribute as well as shared, associated with minibatch >>> replacements = {x.minibatch: x.shared} >>> rnode = theano.clone(node, replacements)
For more complex slices some more code is needed that can seem not so clear >>> moredata = np.random.rand(10, 20, 30, 40, 50)
default total_size that can be passed to PyMC3 random node is then (10, 20, 30, 40, 50) but can be less verbose in some cases
1) Advanced indexing, total_size = (10, Ellipsis, 50) >>> x = Minibatch(moredata, [2, Ellipsis, 10])
We take slice only for the first and last dimension >>> assert x.eval().shape == (2, 20, 30, 40, 10)
2) Skipping particular dimension, total_size = (10, None, 30) >>> x = Minibatch(moredata, [2, None, 20]) >>> assert x.eval().shape == (2, 20, 20, 40, 50)
3) Mixing that all, total_size = (10, None, 30, Ellipsis, 50) >>> x = Minibatch(moredata, [2, None, 20, Ellipsis, 10]) >>> assert x.eval().shape == (2, 20, 20, 40, 10)
Attributes
shared (shared tensor) Used for storing data minibatch (minibatch tensor) Used for training 
clone
()¶ Return a new Variable like self.
Returns: Variable instance
A new Variable instance (or subclass instance) with no owner or index.
Notes
Tags are copied to the returned instance.
Name is copied to the returned instance.
