Vectors and Variables
Our imaginary chicken farmer from the previous section has gotten more sophisticated: she's testing three different types of feed to see which one produces more eggs.
Counting More Chickens
The egg dataset now includes a 0
, 1
, or 2
for each day that indicates which type of feed they got.
val eggs = List[(Int, Long)]((0,31), (2,47), (0,35), (2,40), (0,33), (2,44), (0,30), (2,46), (0,33), (0,30), (2,36), (2,54), (1,45), (1,39), (2,62), (2,54), (1,30), (2,40), (2,48), (1,33), (0,40), (2,38), (0,31), (2,46), (1,41), (1,42), (0,39), (1,29), (0,28), (1,36), (2,46), (2,33), (2,41), (2,48), (1,32), (0,24), (1,34), (2,48), (1,52), (1,37), (0,28), (0,37), (2,51), (2,44), (1,40), (0,41), (0,36), (1,44), (0,32), (0,31), (0,31), (0,32), (0,33), (1,27), (0,40), (2,45), (2,40), (1,46), (0,35), (2,46), (0,34), (1,41), (0,38), (0,34), (2,46), (1,44), (2,49), (2,39), (1,41), (2,37), (1,29), (0,29), (2,41), (2,46), (1,42), (1,34), (1,32), (1,35), (0,32), (1,40), (1,37), (1,38), (1,42), (1,38), (1,36), (0,38), (0,41), (1,51), (1,40))
/*
eggs: List[(Int, Long)] = List(
(0, 31L),
(2, 47L),
(0, 35L),
(2, 40L),
...
*/
As before, we'll create a lambda
that captures the baseline egg-laying rate for the flock.
val lambda = Gamma(0.5, 100).latent
/*
lambda: Real = Real(0.00, Infinity)
*/
This time, however, we'll also create a vector of 3 random variables that represent the egg-laying rate for each of the 3 different feeds. We want these to be able to scale the baseline rate up or down a small amount. There are a lot of different modeling choices we could make here, but in this case we'll start by defining random variables that represent the log of those rates, normally distributed around the log of the baseline, with a small standard deviation.
val logFeeds = Normal(lambda.log, 0.1).latentVec(3)
/*
logFeeds: Vec[Real] = Vec(Real(-Infinity, Infinity), Real(-Infinity, Infinity), Real(-Infinity, Infinity))
*/
This gives us back a Vec[Real]
, which is a type defined by Rainier. It's similar to a Scala Vector[Real]
, but adapted to work better with random variables.
Like a regular Vector
, we can transform it using map
. Here, we want to bring these random variables back out of log-space:
val feeds = logFeeds.map(_.exp)
/*
feeds: Vec[Real] = Vec(Real(0.00, Infinity), Real(0.00, Infinity), Real(0.00, Infinity))
*/
Again, a sanity check on the bounds: everything in the Vec
is now >= 0, so we haven't screwed up our ability to use these as rates.
Multivariate Data
We'd now like to construct a model that associates these three different (but related) feed
rates with their corresponding observations. We'll do that in two different ways.
- First, we'll split the data into three parts, and construct three separate models, which we'll then merge.
- Next, we'll use use a variation of
Model.observe
that accepts vectors of likelihoods, one for each observation.
It's worth noting that we can just keep reusing the same feeds
random variables as we construct each of these models. That's because everything in Rainier is immutable - there's no global context that you're modifying when you create parameters or make observations. That makes things a lot easier to reason about, especially when messing around in a REPL or a notebook.
Merging Models
For any single type of feed, creating a model looks exactly as it did in the previous section. For example, we could filter to just type 0:
val eggs0 = eggs.filter{case (f, _) => f == 0}.map{case (_, c) => c}
val model0 = Model.observe(eggs0, Poisson(feeds(0)))
/*
eggs0: List[Long] = List(
31L,
35L,
33L,
30L,
...
model0: Model = Model[2]
*/
You'll notice that this is a Model[2]
, because it has two parameters: the baseline lambda
, and then the feeds(0)
which is derived from it.
Similarly, we can do this for all the feed types at once by first grouping the data by feed type, and then mapping over the groups:
val models = eggs.groupBy(_._1).toList.map{
case (i, data) =>
val counts = data.map(_._2)
Model.observe(counts, Poisson(feeds(i)))
}
/*
models: List[Model] = List(Model[2], Model[2], Model[2])
*/
Now we have three different Model[2]
objects, each of them referencing the same lambda
but a different element of feeds
.
Finally, we can use Model
's merge
method to combine all three models into a single joint model for all three feed types.
val mergedModel = models.reduce{(m1, m2) => m1.merge(m2)}
/*
mergedModel: Model = Model[4]
*/
This model has 4 parameters, as it should.
There's nothing wrong with building the model this way in this situation, where you have a categorical independent variable. However, if you had a continuous covariate, this wouldn't work (or rather, it would work, but it would require building a separate Model
for each data point, which scales very badly), so it's good to see an alternative way of solving the same problem.
Vector Likelihoods
So far we've used Model.observe
with the following signature: observe[Y](ys: Seq[Y], likelihood: Distribution[Y])
. That is: a sequence of observations, and a single distribution that provides the likelihood function for all of them.
Another option is observe[Y](ys: Seq[Y], likelihoods: Vec[Distribution[Y])
. In this case, rather than assuming that each Y
has an identical likelihood distribution, we pass in a whole Vec
of Distribution
objects, one for each value in ys
.
To make use of it, we first have to separate our data into our independent variables and dependent variable.
val eggFeeds = eggs.map(_._1)
val eggCounts = eggs.map(_._2)
/*
eggFeeds: List[Int] = List(
0,
2,
0,
2,
...
eggCounts: List[Long] = List(
31L,
47L,
35L,
40L,
...
*/
Next, we want to create a Vec
from our independent variables (that is, eggFeeds
), and map over them to create a Poisson
for each one. The code is quite straightforward.
val feedsVec = Vec.from(eggFeeds)
val poissonVec = feedsVec.map{i: Real => Poisson(feeds(i))}
/*
feedsVec: Vec[Real] = Vec(
Real(0.00),
Real(2.00),
Real(0.00),
Real(2.00),
...
poissonVec: Vec[Poisson] = Vec(
Distribution(),
Distribution(),
Distribution(),
Distribution(),
...
*/
What happens here, though, is a little bit strange: when we pass it to Vec.from
, our regular integer data gets transformed into a Vec[Real]
. In fact, any combination of numbers, tuples, lists, and maps will get converted into its Real
-based equivalent: so for example a Seq[Map[String,Double]]
will become a Vec[Map[String,Real]]
. Anything that can't be converted like this cannot be used to construct a Vec
.
For a clue as to why, check out the next line: the i
that we're using to index feeds
is, of course, a Real
. It represents some row from eggFeeds
, but we're not given any insight into which row it is, or any ability to act differently for some rows than others; and similarly, by indexing into feeds
with a Real
, we know we are receiving some element of feeds
, but not which element. All of that helps with vectorizing the computation and helping things scale nicely.
If you didn't entirely follow that last bit, that's ok. The short version is that as long as you can transform your data into primitive data structures like lists and maps of numbers, and then stuff them into a Vec
, Rainier will be happy.
Finally, we'll build the model itself.
val vecModel = Model.observe(eggCounts, poissonVec)
/*
vecModel: Model = Model[4]
*/
Hooray! We're back to a Model[4]
that is, mathematically, the same as the mergedModel
we produced before (but using a much more general technique).