2. First example: a sine

We will go through a simple example fit to introduce how the module works. First, import the modules:

import lsqfitgp as lgp
import numpy as np
import gvar

I suppose you know about numpy, but you may not know about gvar. It is a module for automatic linear correlation tracking. We will se how to use it in the example, if you want to know more it has a good documentation.

Now we generate some fake data. Let’s make a sine:[1]

x = np.linspace(-5, 5, 11) # 11 numbers uniformly spaced from -5 to 5
y = np.sin(x)

We won’t add errors to the data for now. What we are going to do in the next steps is pretending we don’t know the data comes from a sine, and letting the fit guess a function that passes through the data.

The first step in using a Gaussian process is choosing a kernel function. Say \(f\) is the unknown function we want to model. The kernel function \(k(x, x')\) specifies the covariance we expect a priori between \(f(x)\) and \(f(x')\):

\[k(x, x') = \operatorname{Cov}[f(x), f(x')].\]

This means that the a priori variance of \(f(x)\) is \(\operatorname{Var}[f(x)] = k(x, x)\), and the a priori correlation between \(f(x)\) and \(f(x')\) is

\[\begin{split}\operatorname{Corr}[f(x), f(x')] &= \frac {\operatorname{Cov}[f(x), f(x')]} {\sqrt{\operatorname{Var}[f(x)] \operatorname{Var}[f(x')]}} = \\ &= \frac {k(x, x')} {\sqrt{k(x, x) k(x', x')}}.\end{split}\]

Again, in other words: \(\sqrt{k(x, x)}\) expresses how much you are uncertain on the function value at \(x\) before seeing the data, and \(k(x, x')\) expresses how much you think the value at point \(x\) is linked with the value at point \(x'\).

lsqfitgp allows you to specify arbitrary functions as kernels; however, not all functions are valid kernels, so it is convenient to use one of the already available ones from the module. We will use a quite common kernel, the exponential quadratic:

\[k(x, x') = \exp \left( \frac 12 (x - x')^2 \right)\]

which is available as ExpQuad:

kernel = lgp.ExpQuad()

Note the parentheses: ExpQuad is a class, and we create an instance by calling it.

Now we build a GP object, providing as argument the kernel we want to use, and tell it we want to evaluate the process on the array of points x we created before:

gproc = lgp.GP(kernel).addx(x, 'foo')

So, we add points using the addx() method of GP objects. It won’t be very useful to only ask for the process values on the data—we already know the answer there—so we add another array of points, more finely spaced, and more extended:

xpred = np.linspace(-10, 10, 200)
gproc = gproc.addx(xpred, 'bar')

When adding x and xpred, we also provided two strings to addx(), 'foo' and 'bar'. These have no special meaning, they are just labels the GP object uses to distinguish the various arrays you add to it. You can think of them as variable names.

When calling addx the second time, we did gproc = gproc.addx(...), instead of just gproc.addx(...). This is necessary because GP object are immutable: each operation returns a new GP object, leaving the original as it is. This may cause some easy to fix coding errors, but it prevents messier ones. In Python this paradigm is employed, e.g., by strings.

Now, we ask gproc to compute the a posteriori function values on xpred by using the known values on x:

ypred = gproc.predfromdata({'foo': y}, 'bar')

So predfromdata() takes two arguments: the first is a Python dictionary with labels as keys and arrays of function values as values, the second is the label of the points on which we want the estimate.

Now we make a plot of everything:

from matplotlib import pyplot as plt

fig, ax = plt.subplots(num='lsqfitgp example')

ax.set_title('Gaussian process on a sine')
ax.set_xlabel('x')
ax.set_ylabel('y')

ax.plot(x, y, marker='.', linestyle='', color='black', label='data')

ypred_mean = gvar.mean(ypred)
ax.plot(xpred, ypred_mean, label='posterior mean')

ax.legend()

fig.savefig('sine1.png')
../_images/sine1.png

To plot ypred, we did ypred_mean = gvar.mean(ypred). This is because the output of predfromdata() is not an array of numbers, but an array of gvar.GVar objects. GVar variables represent Gaussian distributions; gvar.mean() extracts the mean of the distribution. Let’s make an error band for the fit by computing the standard deviations with gvar.sdev():

ypred_sdev = gvar.sdev(ypred)
bottom = ypred_mean - ypred_sdev
top = ypred_mean + ypred_sdev
ax.fill_between(xpred, bottom, top, alpha=0.3, color='blue')

fig.savefig('sine2.png')
../_images/sine2.png

Look at the result. First, consider the line giving the mean. Between the data points, it goes smoothly from one point to the next. It reasonably approximates a sine. However, outside the region with the data it goes to zero. This is about a thing I forgot to mention when introducing kernels: apart from the variance and correlations, a Gaussian process also has an a priori mean. lsqfitgp always assumes that the a priori mean is zero, so, going far from the data, the a posteriori mean should go to zero too.

It would be easy to provide an arbitrary prior mean by first subtracting it from the data, and then adding it back to the result. However, I encourage the reader to avoid using prior means at all while learning Gaussian processes. You should always think about the links and correlations between the points and the general properties of the function instead of focusing on what value or specific functional form you expect, and let the data choose the values and shape.

Now look at the band. It is a band of \(\pm1\sigma\) around the posterior mean (\(\sigma\) stands conventionally for the standard deviation). For the Gaussian distribution, this interval has a probability of 68 %. A \(\pm2\sigma\) interval would have a probability of 95 %, so this means that the fit expects the function to stay most of the time within a band twice as large as that plotted, and 2/3 of the time inside the one plotted. So, the summary of the figure is this: close to the points, the function is probably a sine; as soon as we move out from the datapoints, the fit gives up and just returns the prior.

Instead of trying to interpret what functions the fit result is representing, we can sample some plausible functions from the posterior distribution. gvar has a function for doing that, gvar.raniter():

for ypred_sample in gvar.raniter(ypred, 4):
    ax.plot(xpred, ypred_sample, color='red', alpha=0.3)

fig.savefig('sine3.png')
../_images/sine3.png

What do we learn from this? First, the sampled functions are a bit uglier than the mean line, even between the datapoints. To have a clear idea of the result, you should always plot some samples instead of looking at the mean.

Second, outside the datapoints the sampled functions oscillate with peaks that are a bit too narrow compared to the peaks of the sine. This is invisible by looking at the error band, so again: to have a clear idea of the result, plot some samples instead of looking at the error band.

Now, to try to get a smooth fit, we want to increase the “wavelength” of the kernel to match that of the sine. To do that, we just need to rescale the input points prior to feeding them to the kernel. We could do that by hand by multiplying x and xpred by something, but all Kernel objects provide a convenient parameter to do that automatically, scale. So let’s do everything again, rescaling by a factor of 3:

gp = (lgp
    .GP(lgp.ExpQuad(scale=3))
    .addx(x, 'foo')
    .addx(xpred, 'bar')
)
ypred = gp.predfromdata({'foo': y}, 'bar')

ax.cla() # clear the plot

ax.plot(x, y, '.k')
for sample in gvar.raniter(ypred, 5):
    ax.plot(xpred, sample, 'r', alpha=0.3)

fig.savefig('sine4.png')
../_images/sine4.png

This is a bit more similar to the sine. Between the datapoints, the samples all follow the same path, and they continue to oscillate in a reasonable way also outside from the datapoints. But what further away? Let’s make a plot from -20 to 20:

xpred_long = np.linspace(-20, 20, 200)
gp = gp.addx(xpred_long, 'baz')

ypred = gp.predfromdata({'foo': y}, 'baz')

ax.cla()

ax.plot(x, y, '.k')
for sample in gvar.raniter(ypred, 5):
    ax.plot(xpred_long, sample, 'r', alpha=0.3)

fig.savefig('sine5.png')
../_images/sine5.png

So, after a an oscillation, it still goes back to the prior, as it should. It just goes on 3 times as much as before because we set scale=3.

Finally, let’s see how to add errors to the data. First, we generate some Gaussian distributed errors with standard deviation 0.2:

rng = np.random.default_rng([2023, 8, 23, 16, 28])
yerr = 0.2 * rng.standard_normal(len(y))
y += yerr

Then, we build an array of GVar variables from y:

gy = gvar.gvar(y, 0.2 * np.ones(len(y)))

Now gy represents an array of independent Gaussian distributions with mean y and standard deviation 0.2. We then repeat everything as before, but using gy instead of y:

ypred = gp.predfromdata({'foo': gy}, 'baz')

ax.cla()

ax.errorbar(x, gvar.mean(gy), yerr=gvar.sdev(gy), fmt='.k', capsize=2)
for sample in gvar.raniter(ypred, 5):
    ax.plot(xpred_long, sample, 'r', alpha=0.3)

fig.savefig('sine6.png')
../_images/sine6.png

The behaviour makes sense: the samples are not forced any more to pass strictly on the datapoints.

You should now know all the basics. You can experiment using other kernels, the available ones are listed in Predefined kernels.

Footnotes