Basic Optimizations with H2MM_C#

Basic Analysis#

The base steps of analysis with H²MM are:

Load data
Make initial model
Optimize model
Viterbi analysis to classify photons and assess quality of optimization

Note

Steps 2-4 are usually repeated multiple times with different numbers of states and result compared to select the ideal model

Step 1: Load Data#

Now that the data is imported, we can start to perform the basic analysis of H²MM, for which we will need some data to analyze. So we will load some data with the following lines of code:

# This cell will only be used if you export your data to an ascii file,
# if you are using a software package such as FRETBursts, a different
# function (for FRETBursts, look at our other Jupyter Notebooks, it is
# the data_sort function)
def load_txtdata(filename):
    color = list()
    times = list()
    with open(filename,'r') as f:
        for i, line in enumerate(f):
            if i % 2 == 0:
                times.append(np.array([int(x) for x in line.split()],dtype=np.int64))
            else:
                color.append(np.array([int(x) for x in line.split()],dtype=np.uint8))
    return color, times

color2, times2 = load_txtdata('sample_data_2det.txt')

Done loading data

Data Organization#

So what does this data look like? It comes in 2 lists of arrays. One list contains the detectors (indices) of the data. The second, the times of the data.

color2 contains the detectors
times2 contains the times of the data

Both are lists of 1-D numpy arrays.

Note

The size of the detector list and time list, and of the arrays within must match. This is because each data point is a pair of numbers: the index of the detector, and the absolute time of the data point. Each array (element of the two lists) is a sequence of data that is to be analyzed as a separate instance of data that can be represented by the same hidden Markov model. Therefore the length of color2 must be the same as the length of times2 Similarly, the length of the first array in color2 must be the same as the length of the first array in times2. The lengths of the second arrays in color2 and times2 must also match, and so on. There is no need however for the lengths of the first and second arrays in color2 and times2 to match, as sequences of data points need not be the same length.

For the next step, it is important that the number of detectors is known, as this affects the initial model to be used. In the example loaded above, there are 2 detectors- 0 and 1.

Note

Indices must start from 0 and go in integer ascending order. I.E. if there are 2 detectors, you must use indices 0 and 1. Similarly, if there are 3 indices you must use indices 0, 1 and 2. And so on.

>>> color2[0], times2[0]
(array([1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0],
       dtype=uint64),
array([12763662, 12766593, 12769260, 12769677, 12775221, 12781399,
        12783890, 12795124, 12795391, 12796727, 12797800, 12798471,
        12798527, 12799163, 12799626, 12802748, 12808470, 12811473,
        12826887, 12841103, 12842716, 12842791, 12844203, 12844446,
        12845687, 12846645, 12847506, 12848115, 12850099, 12853216,
        12853259, 12853526, 12853632, 12853845, 12855027, 12858436,
        12858750, 12860935, 12861560, 12861894, 12869843], dtype=uint64))

Step 2: Create an Initial Model#

H²MM is fundamentally an optimization, which means that an initial, unoptimized model must be used as a starting point to initiate the optimization.

These models are represented by the h2mm_model in H2MM_C H2MM_C provides a simple function for creating initial models: factory_h2mm_model(). All you need to do is specify the number of states and the number of detectors.

Note

The number of states is arbitrary, as we will show later. You should compare several different numbers of states. The number of detectors, however is determined by your data. For the file sample_data_2det.txt there are 2 detectors.

So let’s make the initial model. The syntax is: hm.factory_h2mm_model(nstate: int, ndet: int):

So let’s make an initial model:

model_3s2d = hm.facotry_h2mm_model(3,2) # 3 states, 2 detectors

Step 3: Optimize the Model#

Now that we have an initial model, we can optimize it against the data. For this we use the h2mm_model.optimize() method.

The syntax is

model.optimize(indexes: list[numpy.ndarray], times: list[numpy.ndarray])

So let’s optimize the model:

model_3s2d.optimize(color2, times2)
model_3s2d

The model converged after 2846 iterations

nstate: 3, ndet: 2, nphot: 239701, niter: 2846, loglik: -120029.31040593554 converged state: 0x27
prior:
0.2272268169697541, 0.3356458088268423, 0.43712737420340364
trans:
0.9999737418742817, 2.625812450388196e-05, 1.2144162651673179e-12
1.4552276266225378e-05, 0.9999710725953526, 1.4375128381104806e-05
1.1159734391275203e-06, 1.0270624307327267e-05, 0.9999886134022535
obs:
0.3071558790693713, 0.6928441209306287
0.7830415401641657, 0.21695845983583428
0.8905661980180328, 0.10943380198196716

Examining the optimized model#

The core of a h2mm_model is composed of 3 arrays. In the following definition, N will indicate the number of states, and M the number of detectors.

h2mm_model.prior: the initial probability vector, of shape N, which represents the probability of a sequence beginning in each state.
h2mm_model.trans: the transition probability matrix, of shape N by N, the probability of a transition happenign from state i to state j. .. note:
```
The units of the transition probability matrix are in units of the clock-rate of the data
```
h2mm_model.obs: the emission probability matrix, of shame N by M. The probability that if the system is in state i, the detector j will be observed.

Note

All the above arrays are row stochastic meaning that the sum of the values in each row sum to 1. This is because these are all probabilities and each row enumerates all possibilities, thus their total probability must be 1. Calculate values accordingly

These are all printed as part of the reper for h2mm_model

There are a number of other secondary parameters stored in the h2mm_model object. The size of the matricies can be accessed with

h2mm_model.nstate the number of states
h2mm_model.ndet the number of detectors

Others are specific to the course of the optimization

h2mm_model.niter The number of iterations during the optimization
h2mm_model.nphot The number of data points in the input data on which the model was optimized

There are the statistical parameters that reflect the likelihood of the model:

h2mm_model.loglik The loglikelihood of the model. This represents the value which is being optimized.
h2mm_model.bic The Bayes Information Criterion, a statistical discriminator. \(BIC = k\ln(n) - 2\ln(LL)\) where \(k = nstate^2 - (ndet - 1)nstate - 1\), this is used for comparing loglikelihoods of different models with different numbers of states, but optimized against the same data.
h2mm_modle.k The number of free parameters in the model.

Step 4: Viterbi Analysis#

The final step in the H²MM analysis of a single model is to find the most likely state-path through the data. This is done using the Viterbi algorithm.

For this H2MM_C has the viterbi_path() function. This function returns the most likely state path as well as a few other useful values. These values are:

path: The most likely state of each data point
scale: the likelihood of each state assignment in path, i.e. the posterior probability
loglik: The loglikelihood of each trajectory, returned as a numpy array.
ICL: The Integrated Complete Likelihood, a statistical discriminator, which is essentially the \(BIC\), but for the most likely path instead of all possible paths.

The syntax is:

hm.viterbi_path(model: H2MM_C.h2mm_model, indexes: list[numpy.ndarray], times: list[numpy.ndarray])

path_3s2d, ll_3s2d, icl_3s2d = hm.viterbi_path(model_3s3d, color2, times2)
path_3s2d[0]

array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

dtype=uint64)

Loop Based Analysis#

Since we need to compare different models, it is generally more usefull to run the analysis in a loop, and then compare the different optimizations.

In particular, the ICL is very useful in identifying the best model.

We will also now use a data set with 3 detectors, so that we can see how different detector numbers affect the analysis code.

# load the data
color3 = list()
times3 = list()

i = 0
with open('sample_data_3det.txt','r') as f:
    for line in f:
        if i % 2 == 0:
            times3.append(np.array([int(x) for x in line.split()],dtype='Q'))
        else:
            color3.append(np.array([int(x) for x in line.split()],dtype='L'))
        i += 1

For this we will start with a simple loop, optimizing for 1 to 4 states (there’s no such thing as a 0 state model)

# initiate lists
models = list()
paths = list()
scales = list()
logliks = list()
ICL = list()

# loop from 1 to 4 states
for i in range(1,5):
    # make initial model with i states
    model_temp = hm.factory_h2mm_model(i, 3) # i states, 3 detectors
    # optimize model
    model_temp.optimize(color3, times3)
    # calculate ideal state path
    path_temp, scale_temp, ll_temp, icl = hm.viterbi_path(model_temp, color3, times3)
    # append results to lists
    models.append(model_temp)
    paths.append(path_temp)
    scales.append(scale_temp)
    logliks.append(ll_temp)
    ICL.append(icl)
models

The model converged after 2 iterations

The model converged after 83 iterations

The model converged after 191 iterations

Optimization reached maximum number of iterations

[nstate: 1, ndet: 3, nphot: 436084, niter: 2, loglik: -436870.2664181456 converged state: 0x27

prior:
1.0
trans:
1.0
obs:
0.40807734289723996, 0.14158969372873118, 0.45033296337402884,
nstate: 2, ndet: 3, nphot: 436084, niter: 83, loglik: -416808.46054400457 converged state: 0x27
prior:
0.28420054140898077, 0.7157994585910192
trans:
0.9999702727327167, 2.9727267283272445e-05
1.361494142422072e-05, 0.9999863850585758
obs:
0.15584402506221606, 0.2649819949425368, 0.579173979995247
0.5300103898756764, 0.08194016213702787, 0.38804944798729574,
nstate: 3, ndet: 3, nphot: 436084, niter: 191, loglik: -409379.1470199655 converged state: 0x27
prior:
0.25991614498503546, 0.4943766688551101, 0.2457071861598543
trans:
0.99997620100061, 2.1091530647357298e-05, 2.7074687426012416e-06
8.392831156401812e-06, 0.9999812080804615, 1.0399088382201267e-05
6.290351439278574e-06, 4.466205603341952e-05, 0.9999490475925273
obs:
0.14570425274135482, 0.2934431598919638, 0.5608525873666814
0.44173952215070467, 0.08763105601610255, 0.47062942183319284
0.8414287299383637, 0.07852868414074801, 0.08004258592088825,
nstate: 4, ndet: 3, nphot: 436084, niter: 3600, loglik: -408229.8870830556 converged state: 0x47
prior:
0.6260350194522536, 0.03065208094489406, 0.19098128141081058, 0.15233161819204177
trans:
0.755356522950715, 0.10496527792311883, 0.1396781991261662, 2.5473360035781177e-93
1.9483774568912126e-05, 0.9999676455004775, 1.0873612638551622e-05, 1.9971123150751066e-06
4.55729580226709e-06, 6.533151970496898e-06, 0.9999789904946441, 9.919057583121052e-06
2.89221723510831e-06, 2.6479397449634016e-06, 3.798997443069558e-05, 0.9999564698685893
obs:
0.6182798684674422, 0.3817201315325577, 2.4250145435136427e-68
0.14426145606515778, 0.2895270449307156, 0.5662114990041266
0.4416386060438939, 0.0870081949493097, 0.4713531990067965
0.8409805356960829, 0.07645731068492255, 0.08256215361899454]

Now that several models have been calculated, it is important to select the ideal model, so you do not use over- or under-fit models that do not properly represent the data.

There are two primary criterion for this:

A statistical discriminator like the ICL or BIC
Reasonableness of the model based on prior knowledge of the system.

So let’s plot the ICL to see what the best model is:

states = np.arange(1,5)
plt.scatter(states, ICL)
plt.xlabel("States")
plt.ylabel("ICL")
plt.title("ICL")

Text(0.5, 1.0, ‘ICL’)

states = np.arange(1,5)
plt.scatter(states, [model.bic for model in models])
plt.xlabel("States")
plt.ylabel("BIC")
plt.title("BIC")

Text(0.5, 1.0, ‘BIC’)

So it looks like the ideal model (minimum ICL) is the 4 state model.

But since this is also the most number of states optimized, a 5 state model might be even better. Even if the 4 state model is best, seeing that the ICL of the 5 state model increases, it is still best to optimize it so that we can be assured that the 4 state model is best. And if the 5 state model turns out to have an even smaller ICL, we should go on and optimize a 6 state model.

model_temp = hm.factory_h2mm_model(5,3)
model_temp.optimize(color3, times3)
path_temp, scale_temp, ll_temp, icl = hm.viterbi_path(model_temp, color3, times3)
models.append(model_temp)
paths.append(path_temp)
scales.append(scale_temp)
logliks.append(ll_temp)
ICL.append(icl)

Optimization reached maximum number of iterations

states = [model.nstate for model in models]
bic = [model.bic for model in models]

fig, ax = plt.subplots(1,2, figsize=(12,5))

ax[0].scatter(states, ICL)
ax[0].set_xlabel("States")
ax[0].set_ylabel("ICL")
ax[0].set_title("ICL")

ax[1].scatter(states, bic)
ax[1].set_xlabel("States")
ax[1].set_ylabel("BIC")
ax[1].set_title("BIC")

Text(0.5, 1.0, ‘BIC’)

Indeed, it looks like the 4 state model is the ideal model, so let’s examine it:

models[3]

nstate: 4, ndet: 3, nphot: 436084, niter: 3600, loglik: -408229.8870830556 converged state: 0x47
prior:
0.6260350194522536, 0.03065208094489406, 0.19098128141081058, 0.15233161819204177
trans:
0.755356522950715, 0.10496527792311883, 0.1396781991261662, 2.5473360035781177e-93
1.9483774568912126e-05, 0.9999676455004775, 1.0873612638551622e-05, 1.9971123150751066e-06
4.55729580226709e-06, 6.533151970496898e-06, 0.9999789904946441, 9.919057583121052e-06
2.89221723510831e-06, 2.6479397449634016e-06, 3.798997443069558e-05, 0.9999564698685893
obs:
0.6182798684674422, 0.3817201315325577, 2.4250145435136427e-68
0.14426145606515778, 0.2895270449307156, 0.5662114990041266
0.4416386060438939, 0.0870081949493097, 0.4713531990067965
0.8409805356960829, 0.07645731068492255, 0.08256215361899454

User Defined Initial Models#

Usually the factory_h2mm_model() function is sufficient for generating initial models, however, sometimes it is preferable to define the parameters of the initial model more manually.

This is done like so:

# define input prior, trans and obs matrices
prior = np.array([0.5,0.5])
trans = np.array([[0.99999, 0.00001], [0.00001, 0.99999]])
obs = np.array([[0.2, 0.3, 0.5], [0.4, 0.2, 0.4]])

# make the initial model
model_user = hm.h2mm_model(prior, trans, obs)

We can optimize this just as before:

model_user.optimize(color3, times3)
model_user

nstate: 2, ndet: 3, nphot: 436084, niter: 86, loglik: -416808.4605439878 converged state: 0x27
prior:
0.2842005858570078, 0.7157994141429922
trans:
0.9999702726711638, 2.9727328836273114e-05
1.3614977930264637e-05, 0.9999863850220698
obs:
0.15584402059957542, 0.2649819669048419, 0.5791740124955829
0.5300104525521383, 0.08194014608492564, 0.38804940136293614

Basic Optimizations with H2MM_C#

Basic Analysis#

Step 1: Load Data#

Data Organization#

Step 2: Create an Initial Model#

Step 3: Optimize the Model#

Examining the optimized model#

Step 4: Viterbi Analysis#

Loop Based Analysis#

User Defined Initial Models#

h2mm_model.evaluate() and Error Analysis#

`h2mm_model.evaluate()` and Error Analysis#