Data

pymc3.data.get_data(filename)

Returns a BytesIO object for a package data file.

Parameters:

filename : str

file to load

Returns

——-

BytesIO of the data

class pymc3.data.GeneratorAdapter(generator)

Helper class that helps to infer data type of generator with looking at the first item, preserving the order of the resulting generator

class pymc3.data.Minibatch(**kwargs)

Multidimensional minibatch that is pure TensorVariable

Parameters:

data : ndarray

initial data

batch_size : int or List[int|tuple(size, random_seed)]

batch size for inference, random seed is needed for child random generators

dtype : str

cast data to specific type

broadcastable : tuple[bool]

change broadcastable pattern that defaults to (False, ) * ndim

name : str

name for tensor, defaults to “Minibatch”

random_seed : int

random seed that is used by default

update_shared_f : callable

returns ndarray that will be carefully stored to underlying shared variable you can use it to change source of minibatches programmatically

in_memory_size : int or List[int|slice|Ellipsis]

data size for storing in theano.shared

Notes

Below is a common use case of Minibatch within the variational inference. Importantly, we need to make PyMC3 “aware” of minibatch being used in inference. Otherwise, we will get the wrong \(logp\) for the model. To do so, we need to pass the total_size parameter to the observed node, which correctly scales the density of the model logp that is affected by Minibatch. See more in examples below.

Examples

Consider we have data >>> data = np.random.rand(100, 100)

if we want 1d slice of size 10 we do >>> x = Minibatch(data, batch_size=10)

Note, that your data is cast to floatX if it is not integer type But you still can add dtype kwarg for Minibatch

in case we want 10 sampled rows and columns [(size, seed), (size, seed)] it is >>> x = Minibatch(data, batch_size=[(10, 42), (10, 42)], dtype=’int32’) >>> assert str(x.dtype) == ‘int32’

or simpler with default random seed = 42 [size, size] >>> x = Minibatch(data, batch_size=[10, 10])

x is a regular TensorVariable that supports any math >>> assert x.eval().shape == (10, 10)

You can pass it to your desired model >>> with pm.Model() as model: … mu = pm.Flat(‘mu’) … sd = pm.HalfNormal(‘sd’) … lik = pm.Normal(‘lik’, mu, sd, observed=x, total_size=(100, 100))

Then you can perform regular Variational Inference out of the box >>> with model: … approx = pm.fit()

Notable thing is that Minibatch has shared, minibatch, attributes you can call later >>> x.set_value(np.random.laplace(size=(100, 100)))

and minibatches will be then from new storage it directly affects x.shared. the same thing would be but less convenient >>> x.shared.set_value(pm.floatX(np.random.laplace(size=(100, 100))))

programmatic way to change storage is as follows I import partial for simplicity >>> from functools import partial >>> datagen = partial(np.random.laplace, size=(100, 100)) >>> x = Minibatch(datagen(), batch_size=10, update_shared_f=datagen) >>> x.update_shared()

To be more concrete about how we get minibatch, here is a demo 1) create shared variable >>> shared = theano.shared(data)

2) create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype(‘int64’)

3) take that slice >>> minibatch = shared[ridx]

That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.

Suppose you need some replacements in the graph, e.g. change minibatch to testdata >>> node = x ** 2 # arbitrary expressions on minibatch x >>> testdata = pm.floatX(np.random.laplace(size=(1000, 10)))

Then you should create a dict with replacements >>> replacements = {x: testdata} >>> rnode = theano.clone(node, replacements) >>> assert (testdata ** 2 == rnode.eval()).all()

To replace minibatch with it’s shared variable you should do the same things. Minibatch variable is accessible as an attribute as well as shared, associated with minibatch >>> replacements = {x.minibatch: x.shared} >>> rnode = theano.clone(node, replacements)

For more complex slices some more code is needed that can seem not so clear >>> moredata = np.random.rand(10, 20, 30, 40, 50)

default total_size that can be passed to PyMC3 random node is then (10, 20, 30, 40, 50) but can be less verbose in some cases

1) Advanced indexing, total_size = (10, Ellipsis, 50) >>> x = Minibatch(moredata, [2, Ellipsis, 10])

We take slice only for the first and last dimension >>> assert x.eval().shape == (2, 20, 30, 40, 10)

2) Skipping particular dimension, total_size = (10, None, 30) >>> x = Minibatch(moredata, [2, None, 20]) >>> assert x.eval().shape == (2, 20, 20, 40, 50)

3) Mixing that all, total_size = (10, None, 30, Ellipsis, 50) >>> x = Minibatch(moredata, [2, None, 20, Ellipsis, 10]) >>> assert x.eval().shape == (2, 20, 20, 40, 10)

Attributes

shared (shared tensor) Used for storing data
minibatch (minibatch tensor) Used for training
clone()

Return a new Variable like self.

Returns:

Variable instance

A new Variable instance (or subclass instance) with no owner or index.

Notes

Tags are copied to the returned instance.

Name is copied to the returned instance.