Fitting microscopy data in parallel across your CPU cores or on a GPU can dramatically speed up processing and analysis. The tooling to achieve this is not straightforward, however, so I thought writing a tutorial would be a good record of the process for myself as well as others. In this post I will stick to what I am familiar with: python’s concurrent.futures module and the Gpufit C++ library (called from python).
Contents
Introduction
Solid state defects are often employed as quantum sensors, where the well defined physical properties of the system (energy levels, optical dynamics) are utilised to measure local quantities, for example magnetic or temperature fields.
Sensing protocols generally measure the defect photoluminescence (PL) as a function of some independent parameter ($\xi$
) that depends on the particular pulse sequence.
A quantity of interest is then extracted by a fit to the curve.
The most basic example is optically detected magnetic resonance (ODMR), where PL is acquired as a function of microwave driving frequency. Local static fields can be determined from the position of resonance positions in the resultant ODMR spectrum (Fig. 1).
Quantum microscopy extends this principle by acquiring the measurement curve or spectrum at different locations in space to form a data cube - a three dimensional array ${\bf A}_{ijk}$
where for example $j$
and $k$
index the position and $i$
the independent sweep parameter $\xi$
, e.g. mw frequency.
This imaging can be achieved by moving the sample relative to the measurement location (‘scanning quantum microscopy’), or by addressing and measuring an ensemble of defects in parallel onto a camera1 (‘widefield quantum microscopy’).
Either way, a large dataset can be formed, requiring a fit of each pixel’s PL curve as a function of $i$
.
Some of our datasets can be over 5 GiB (2048×2048 pixel image, 100×2 sig/ref acquisitions at 8 byte storage), which at two milliseconds per fit/pixel can take two hours to process in series.
However this problem is what computer people call ’embarrassingly parallel’ - each task (pixel) can be executed independently of the others, and thus is ripe for parallelization.
In this tutorial we will look at a few parallelization techniques to dramatically reduce this processing time: multiprocessing (using more than one of your CPU cores in parallel) and GPU processing (using a graphics card which is optimized for this type of problem).
Before we get into the nitty gritty I will note that Volk et al. from the Harvard and MIT Earth Science departments have published an open-source MATLAB package ‘QDMlab’ for NV-centre ODMR datasets, which may be a ready-made solution to your problem2.
The first two methods use SciPy’s least_squares function, which is implemented in C/Fortan (depending on the fit method, both much faster than raw Python). This function is well tested, accurate and fast for a single fit. If you have a large number of data points, or a large number of parameters, the recently released JAXFit can provide an order of magnitude speedup over Scipyfit. The motivating use-case of JAXFit is complex fits of a large 2D array. Our datasets are 3D, however (a stack of 2D images). Gpufit is optimized for this use case, however it requires a bit of finessing - which is what this tutorial is for.
Initial framework
Let’s start by preparing a simulated demonstration (ODMR) data cube. Note I have uploaded the final script3. The nitty gritty code is hidden here:
Preparing initial state
import numpy as np
import numpy.random
import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import make_axes_locatable, axes_size
import cycler
import scipy.optimize
import itertools
import warnings
import tqdm
import tqdm.contrib
import concurrent.futures
import psutil
import os
import sys
import pygpufit.gpufit as gf
def set_style():
"""sets the Defective Club matplotlib stylesheet"""
rcparams_dict = {
"axes.prop_cycle": cycler.cycler(
color=[
"#7239b3ff",
"#008fd5",
"#fc4f30",
"#e5ae38",
"#6d904f",
"#8b8b8b",
]
),
"lines.linewidth": 2,
"lines.markersize": 3,
"xtick.labelsize": 10,
"xtick.major.size": 4,
"xtick.direction": "in",
"xtick.top": True,
"ytick.labelsize": 10,
"ytick.direction": "in",
"ytick.major.size": 4,
"ytick.right": True,
"legend.fontsize": "small",
"legend.loc": "lower left",
"font.size": 8.0,
"font.family": "monospace",
"image.cmap": "plasma",
}
for optn, val in rcparams_dict.items():
if isinstance(val, (list, tuple)):
val = tuple(val)
try:
mpl.rcParams[optn] = val
except KeyError:
warn(
f"mpl rcparams key '{optn}' not recognised as a valid rc parameter."
)
def _add_colorbar(im, fig, ax, aspect=20, pad_fraction=1, **kwargs):
"""helped function, adds colorbar to 'im' returned by 'imshow'"""
divider = make_axes_locatable(ax)
width = axes_size.AxesY(ax, aspect=1.0 / aspect)
pad = axes_size.Fraction(pad_fraction, width)
cax = divider.append_axes("right", size=width, pad=pad)
cbar = fig.colorbar(im, cax=cax, **kwargs)
tick_locator = mpl.ticker.MaxNLocator(nbins=5)
cbar.locator = tick_locator
cbar.update_ticks()
cbar.ax.get_yaxis().labelpad = 15
cbar.ax.linewidth = 0.5
return cbar
def bounds_semicircle(pos, centre, radius, lw=2, right=True):
"""semicircle (+x or -x depending on 'right') shaped feature.
(only the boundary of the feature, at linewidth 'lw')"""
return (np.abs(radius - np.linalg.norm(pos - centre)) < lw) and (
((pos[1] - centre[1] > 0) and right)
or ((pos[1] - centre[1] < 0) and not right)
)
def get_freqs(num_freqs, freq_span):
"""return mw frequencies we 'sample'"""
return np.linspace(2870 - freq_span // 2, 2870 + freq_span // 2, num_freqs)
def get_data_cube(
heights,
widths,
bg_zeeman,
ft_zeeman,
ft_width_dif,
ft_height_dif,
cam_shape,
ft_centre,
ft_rad,
ft_linewidth,
num_freqs,
freq_span,
noise_sigma,
):
"""
generate a 2-peak lorentzian data cube:
two cocentric semicircles of opposite magnetic field"""
heights = np.array(heights)
widths = np.array(widths)
raw_cube = np.empty((num_freqs, *cam_shape))
freqs = get_freqs(num_freqs, freq_span)
for ypos in range(cam_shape[0]):
for xpos in range(cam_shape[1]):
pos = np.array([ypos, xpos])
heights_res = heights + ft_height_dif
widths_res = widths + ft_width_dif
if bounds_semicircle(
pos, ft_centre, ft_rad, lw=ft_linewidth, right=True
):
l_freq = 2870 - bg_zeeman // 2 - ft_zeeman // 2
r_freq = 2870 + bg_zeeman // 2 + ft_zeeman // 2
elif bounds_semicircle(
pos, ft_centre, ft_rad, lw=ft_linewidth, right=False
):
l_freq = 2870 - bg_zeeman // 2 + ft_zeeman // 2
r_freq = 2870 + bg_zeeman // 2 - ft_zeeman // 2
else:
l_freq = 2870 - bg_zeeman // 2
r_freq = 2870 + bg_zeeman // 2
heights_res = heights
widths_res = widths
raw_cube[:, ypos, xpos] = (
lorentz(freqs, [widths[0], l_freq, heights[0]])
+ lorentz(freqs, [widths[1], r_freq, heights[1]])
+ 1.0
)
rng = numpy.random.default_rng(12345) # fix generator for reproducibility
# add gaussian noise
data_cube = raw_cube + rng.normal(
0.0, noise_sigma, size=np.shape(raw_cube)
)
return data_cube
Now let’s fit the average spectrum with a simple least squares algorithm.
For all of the examples here you can inject your own data if you provide the measured frequencies as an array (freqs
), and shape your (normalised) PL data in the same way as data_cube
(indexed as [freq/tau, y, x]
).
You could also change the model to fit relaxometry data - nothing is particularly specific here.
def lorentz(x, params):
"""lorentzian defined by fwhm, peak position, height"""
w, x0, A = params
R = ((x - x0) / (w / 2)) ** 2 # w/2 to convert from fwhm to hwhm
return A * (1 / (1 + R))
def model(params, x):
"""the physics-informed model we want to fit the data to"""
return 1.0 + lorentz(x, params[:3]) + lorentz(x, params[3:])
def resid(params, xs, pl):
"""residual: difference between measured data 'pl' and model"""
return pl - model(params, xs)
def scipyfit_avg(freqs, data_cube, guess, npts=1000):
"""
fit the average spectrum with scipy
freqs: array of mw frequencies
data_cube: array of PL, indexed as [freq, y, x]
guess: array of initial guesses on parameters
"""
avg_spectrum = np.nanmean(data_cube, axis=(1, 2))
fit_params = scipy.optimize.least_squares(
resid, guess, args=(freqs, avg_spectrum)
).x
fit_freqs = np.linspace(min(freqs), max(freqs), npts)
return fit_freqs, model(fit_params, fit_freqs)
scipy_guess = (20, 2820, -0.015, 20, 2920, -0.015)
fit_freqs, sp_avg_fit = scipyfit_avg(freqs, data_cube, scipy_guess)
plt.plot(freqs, np.nanmean(data_cube, axis=(1,2)), "o")
plt.plot(fit_freqs, sp_avg_fit, "--")
The resultant figure is shown below.
Fitting image data
Ok, so we’ve confirmed that works, let’s map that process across all of our pixels. I’ve written this in a way that seems a bit convoluted, but this functional form is easier to parallelise.
def fit_wrapper(resid, guess, freqs, locator, fit_optns):
"""return the location of the fit, alongside the result"""
fitres = scipy.optimize.least_squares(
resid, guess, args=(freqs, locator[2]), **fit_optns
)
return (tuple(locator[:2]), fitres.x, fitres.jac)
def locator_generator(cube):
"""yields spectrum, while keeping track of pixel location (y,x)"""
_, len_y, len_x = np.shape(cube)
for y in range(len_y):
for x in range(len_x):
yield [y, x, cube[:, y, x]]
def get_images(results, image_shape, guess):
"""converts scipy wrapped results into images"""
result_imgs = [np.full(image_shape, np.nan) for _ in enumerate(guess)]
for pixel_loc, pixel_params, _ in results:
for i, _ in enumerate(guess):
result_imgs[i][pixel_loc] = pixel_params[i]
return result_imgs
def serial_scipyfit_image(freqs, data_cube, guess):
"""fit the full data cube in series with scipy"""
num_pixels = np.prod(data_cube.shape[1:3])
series_results = list(
tqdm.contrib.tmap( # map operation with progress-bar
fit_wrapper, # 'do the fit'
itertools.repeat(resid), # resid is repeated for each element in map
itertools.repeat(guess),
itertools.repeat(freqs),
locator_generator(data_cube), # extact spectrum and pixel location
itertools.repeat({"method": "trf"}),
ascii=True, # this and below are just progress-bar options
mininterval=1,
total=num_pixels,
unit="px",
disable=False,
desc="series",
)
)
# series_results will be in shape: [(y, x), best_fit_params_array, jac_at_solution]
return get_images(series_results, data_cube.shape[1:3], guess)
serial_res = serial_scipyfit_image(freqs, data_cube, scipy_guess)
plot_fit_images(
get_bnv(serial_res[1], serial_res[4], bg_zeeman),
get_avg_fwhm(serial_res[0], serial_res[3]),
)
I ran the above for 100x100 pixels, which on my machine took 1m18s (128px/s) with a gaussian noise of $\sigma=7.5e-3$
(for a contrast of 1e-2 that’s pretty high). Let’s see how much faster we can do with simple parallelization.
We use the concurrent.futures
module which separates the pixels into groups of CHUNKSIZE
pixels, stores these in a pool, and whenever a thread becomes available (up to maximum THREADS
), it hands it one of these chunks to process.
This method isn’t as clean as in some other languages due to python’s issue with the global interpreter lock, but it will give us some speed up.
def parallel_scipyfit_image(freqs, data_cube, guess, chunksize, threads):
"""fit the full data cube in parallel with scipy"""
num_pixels = np.prod(data_cube.shape[1:3])
with concurrent.futures.ProcessPoolExecutor(
max_workers=threads
) as executor:
# note that the change to a parallel map is very simple if we
# write our code in the functional paradigm!
pool_results = list(
tqdm.tqdm(
executor.map(
fit_wrapper,
itertools.repeat(resid),
itertools.repeat(guess),
itertools.repeat(freqs),
locator_generator(data_cube),
itertools.repeat({"method": "trf"}),
chunksize=chunksize,
),
ascii=True,
mininterval=1,
total=num_pixels,
unit="px",
disable=False,
desc="pool ",
)
)
# pool_results will be in shape: [(y, x), best_fit_params_array, jac_at_solution]
return get_images(pool_results, data_cube.shape[1:3], guess)
parallel_res = parallel_scipyfit_image(
freqs, data_cube, scipy_guess, chunksize, threads
)
We’ve achieved a moderate speed up to 26s total fitting time, now at 372px/s. Not fantastic, but significant at long fit times. You could provide the Jacobian, or JIT compile the model with Numba to decrease the fit time. A faster programming language, such as MATLAB or Julia, would have reduced overheads on the parallelization. We can do even better though - let’s look at the work from Przybylski et al.4.
Gpufit
Gpufit is a C++ package that implements the Levenberg-Marquardt algorithm for CUDA-compatible (NVIDIA GPU) devices. It also contains a single-threaded CPU implementation, Cpufit. There are a few complexities to setting up Gpufit. Whilst binaries are available for Windows, they assume a certain generation of graphics card, and do not come with the required fit models for quantum microscopy. I’ll attempt to write down all of the steps required install and customize Gpufit here, both for Windows and Linux. First I’ll go through the installation/compilation process, then how to customize Gpufit to your liking.
Official Cpufit support is still a bit of a work-in-progress (see my pull request here) on Windows, but works well on linux for me. If you don’t have a GPU and just want the Cpufit library, you can skip anything to do with hardware compatibility and driver updates, but you will need to install the CUDA toolkit so the build script doesn’t complain (maybe we can fix that in the future).
Initial installation
The official install instructions are here, if I miss anything. Before you get started with installation, check all of the compatibility requirements and documentation here as you need to install everything in the right order!
First of all, if you want to fit on the GPU, it must be a NVIDIA CUDA-compatible card! Write down what NVIDIA calls the ‘compute capability’, and check this page to determine what versions of the SDK (‘CUDA Toolkit’) it thus supports. The latter link can also be used to identify the CUDA-compatibility. The toolkits can be found here, alongside documentation. The installation guide in the documentation will tell you what OS version (for Windows) and compiler version will work with each toolkit.
Not done yet! Before proceeding, update your driver (e.g. device manager → display adapters), and check the version (NVIDIA control panel → system info). Now check compatibility between your driver version, compute compatibility and hardware generation (link to NVIDIA docs). You’ll also probably want to make sure you aren’t using the GPU for graphics output5.
So in summary you need compatibility between: GPU card, compute compatibility, SDK/toolkit, OS version, compiler and driver version. If you get any of these wrong, you’ll need to start all over again after doing a hard uninstall of each library, etc.
Ok now time to install the dependencies, in this order:
- Install python and git, as I said above ensure GPU device driver is updated
- Clone the git repo (or my branch)
- Install your compiler (on Windows: Microsoft Visual Studio, not to be confused with Microsoft Visual Studio Code! ; on *nix: GCC/G++)
- Install CUDA toolkit
- Install cmake
- Install boost for testing
Windows build
First, identify the directory which contains the Gpufit source code (for example, on a Windows computer the Gpufit source code may be stored in C:\src\gpufit
).
Next, create a build directory outside the source code source directory (e.g. C:\src\gpufit-build
).
Finally, run cmake to configure and generate the compiler input files.
The following commands, executed from the command prompt, assume that the cmake executable (e.g. C:\Program Files\CMake\bin\cmake.exe
) is automatically found via the PATH environment variable (if not, the full path to cmake.exe must be specified).
This example also assumes that the source and build directories have been set up as specified above.
cd C:\src\gpufit-build
cmake -G "Visual Studio 12 2013 Win64" C:\src\Gpufit
I then open up the cmake gui (which will auto-populate fields from this previous cmake run) to edit some more things:
- set
_USE_CBLAS
flag to be true (if you get errors when building try False → sometimes gpufit gets the name of the cuBLAS dll incorrect) - add
BOOST_ROOT
variable to wherever you installed/unpacked BOOST
After configuring and generating the solution files using cmake, go to the desired build directory and open Gpufit.sln using Visual Studio (or the cmake gui has a button to open it for you).
Select the Debug
or Release
build options, as appropriate.
Select the build target ALL_BUILD
, and build this target.
If the build process completes without errors, the Gpufit binary files will be created in the corresponding Debug
or Release
folders in the build directory.
The unit tests can be executed by building the target RUN_TESTS
or by starting the created executables in the output directory from the command line. I recommend you run some of the tests.
Python install:
- Ensure wheel installed (
pip install wheel
) before the above compilation (or just run it again) - Install with:
pip install --force-reinstall <path-to-.whl-file>
(often inGpufit-build\Release\dist\
orGpufit-build\pyGpufit\dist\
)
Linux build
# cd ~/src
# git clone https://github.com/gpufit/Gpufit.git Gpufit
# OR something like:
# git clone https://github.com/casparvitch/Gpufit && git checkout QSL
mkdir Gpufit-build
cd Gpufit-build
# I required gcc-7
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_C_COMPILER=gcc-7 \
-DCMAKE_CXX_COMPILER=g++-7 -USE_CUBLAS ../Gpufit
make
make tests # to run tests
cd pyGpufit/dist
pip install --force-reinstall pyGpufit-*-py2.py3-none-any.whl
Customization
Alright but what if you want to add your own models? I’ll point to the changes I made in my branch, which I think is easier to understand than some snippets pasted here. The overall process is:
- Add a new file and function to the
models
directory. The sweep parameter is stored inuser_info
, and you need to specify the value of your function and its partial derivatives. Example. - Add an enum for your model to
constants.h
, I recommend a negative number so you can rebase on future functions added to main. This change is necessary for Cpufit also, as Cpufit imports this header. Example. - Include your model and add two switch cases in
models.cuh
. Example. - Write a test case in
Consistency.cpp
(put it here so you can add a Cpufit test as well). Example. Or add a python test as I did here.
Cpufit
Unforunately you need to define the Cpufit models separately.
The key difference is that instead of point_index
being an argument to our model now we iterate over it directly.
Here’s the process:
- Define the model_ID in
constants.h
as in the above section. - Add
LMFitCPP::calc_values_foo
andLMFitCPP::calc_derivatives_foo
functions, and a switch case tocalc_curve_values
inlm_fit_cpp.cpp
. Example. - Add private declarations for the above functions in
lm_fit.h
. Example. - Add case to
FitInterface::set_number_of_parameters
in interface.cpp, specifying number of parameters. Example. - Add test cases as above.
Other capabilities
I just wanted to list two other possibilities using Gpufit:
-
Fitting a vector model is possibility, follow the example here.
-
Fitting a model without an explicit derivative, see here.
Results
Let’s fit the same data_cube
as above with Cpufit. The relevant snippet is below.
def gpufit_reshape(data_cube):
"""
data_cube -> data_cube_reshaped, data_cube_positions, for gpufit/cpufit
returns <reshaped_array>, <pixel position of each row in <reshaped_array>>
"""
data_cube_reshaped = [] # (number_pixels, number_freqs)
data_cube_positions = [] # pixel indices, matches shape of the above
for y, x, spec in locator_generator(data_cube):
data_cube_reshaped.append(spec)
data_cube_positions.append(
(y, x)
) # make sure this is a tuple for indexing!
data_cube_positions = np.array(data_cube_positions, dtype=np.int32)
data_cube_reshaped = np.array(data_cube_reshaped, dtype=np.float32)
return data_cube_reshaped, data_cube_positions
def cpufit_image(freqs, data_cube, guess, bounds):
"""fit the full data cube in series with cpufit"""
num_pixels = np.prod(data_cube.shape[1:3])
guess_params = np.repeat([guess], repeats=num_pixels, axis=0).astype(
np.float32
) # (number_pixels, number_params)
constraint_types = np.array(
[gf.ConstraintType.LOWER_UPPER for i in range(guess_params.shape[1])],
dtype=np.int32,
)
constraints = np.tile(bounds, (num_pixels, 1)).astype(np.float32)
data_cube_reshaped, data_cube_positions = gpufit_reshape(data_cube)
cpufit_results, fit_states, _, _, execution_time = gf.fit_constrained(
data_cube_reshaped,
None,
gf.ModelID.LORENTZ8_LINEAR,
guess_params,
tolerance=1e-12,
max_number_iterations=50,
# to compare with/without, set these two to None
constraints=constraints,
constraint_types=constraint_types,
user_info=freqs.astype(np.float32),
parameters_to_fit=np.array(
[1 if i < 8 else 0 for i, _ in enumerate(guess)], np.int32
), # yes this is required!
platform="cpu", # or swap this for 'gpu' if you have a gpu
)
print(f"cpufit execution time (s): {execution_time}", flush=True)
# extract data from cpufit to same shape as fit_wrapper returns
cpufit_params = []
for params, pos in zip(cpufit_results, data_cube_positions):
cpufit_params.append([tuple(pos), params, None])
return get_images(cpufit_params, data_cube.shape[1:3], guess)
# [c, m, width_0, freq_0, amp_0, width_1, ...]
cpufit_guess = [1.0, 0.0, 20.0, 2820.0, -0.015, 20.0, 2920.0,-0.015]
[cpufit_guess.append(0) for _ in range(6 * 3)]
# order: low, high for each param
bounds = [0.9, 1.1, -1e-3, 1e-3, 10, 30, 2790, 2850, -0.1, -0.0001, 10,
30, 2890, 2950, -0.1, -0.0001]
[bounds.append(np.nan) for _ in range(6 * 3 * 2)]
cpufit_res = cpufit_image(freqs, data_cube, cpufit_guess, bounds)
We’re now down to circa three seconds - much faster! Note I’ve added constraints on the parameters here, as Cpufit can go a bit more wild than Scipyfit (Scipyfit can take bounds too - an exercise for the reader). To properly test the speed of Cpufit we’ll need a much bigger dataset, and it will be interesting to see how it scales compared to Gpufit. Gpufit has never fit a dataset in more than a few of seconds for me - incredible! Also note that Cpufit is single-threaded. I haven’t looked into it, but I can imagine Cpufit could be even faster if it was parallelized. It would still be slower than Gpufit, but much easier to install.
Outlook
GPU programming can be a little bit annoying currently, I believe this is largely a NVIDIA problem. Hopefully this article is helpful for others that want to get started with Gpufit in our field. It would be nice to have a standard processing library for the widefield community with our pooled resources, although we’re probably too dilute currently.
I do intend to come back and benchmark the different fit methods, including Gpufit, on a real experimental dataset with 8 peaks, but to do that the issues with my PR will need to be resolved. Specifically I need to be able to compile with the Cpufit API exposed on a Windows machine - the only machines I have access to with a GPU are Windows.
Update 2023-03-20: QDMpy from Michael Volk/Harvard is available from Github6. I named my codebase QDMPy - something something great minds.
Update 2023-09-04: Cpufit API has been exposed to python see here. I will follow up with some benchmarking, though preliminary tests show Cpufit is much ($10\times$
) faster than the above SciPy version.
Update 2023-09-11: I have done a simple benchmark on an 8 peak dataset. With ~1.4Mpx gpufit takes 20s, cpufit 89s and the parallel scipyfit over 10m! With far less pixels (40kpx) gpufit still wins with 640ms, cpufit 2.4s and scipyfit 22s. Getting the GPU setup working can be a little annoying, and cpufit isn’t that much slower, so I’d recommend going that route. You still need to download the CUDA toolkit and compile etc. but it’s easier to get up and running.
-
Note that there is some cross-talk between sensors, i.e. each photosite at the camera is picking up signal from a distribution of defects and defect locations. This subtle effect can cause moderate errors if not taken into consideration during processing; see our recent paper. ↩︎
-
QDMlab paper and github repository. QDMlab extends the Gpufit library with useful routines such as automated image alignment and geomagnetic net moment source reconstruction. NB: requires MATLAB 2019b or later, the Signal Processing Toolbox, the Computer Vision Toolbox and the Optimization Toolbox. ↩︎
-
The full python script can be downloaded here. My branch of Gpufit can be found here. ↩︎
-
Gpufit paper, github repository and documentation. ↩︎
-
I don’t know if this is true, but in all of our lab computers I believe the GPU is not being used for graphics. If you know any details of how this works let me know! ↩︎