Simulations Tutorial#

See also

This can also be viewed as a Jupyter Notebook Download H2MM_Simulation_Tutorial.ipynb

Download data file here: sample_data_3det.txt

The other side of H2MM_C is a set of functions for producing simulated data.

The simulation functions of H2MM_C are starte with sim_. Within H2MM there are several “levels” of data:

  • times, treated as fixed points, part of normal data

  • state of each data point, these can be simulated from a h2mm_model and a set of times

  • index of each data point- the observable of H2MM, these can be simulated with times, a h2mm_model and (optionally) states

Note

These simulates are based purely on hidden Markov modeling. They are NOT molecular, fluorescence or other sort of simulation. Use another package if you want to simulate something while explicitly handling such complexities in the simulation

First, let’s get our imports out of the way.

import os
import numpy as np
from matplotlib import pyplot as plt

import H2MM_C as hm

Basic Simulation#

The most likely simulation function you will use is sim_phtraj_from_times(). This function takes a h2mm_model object and an array of arrival times (equivalent to 1 element of the times list given to EM_H2MM_C()/h2mm_model.optimize(), and nearly all the other non-simulation functions/methods, and returns a set of states and indices (the latter equivalent to one element of the indexes list.

So, first lets generate a random distribution of times:

time = np.cumsum(np.random.exponential(100, size=50).astype(int))

Then a model (we’ll make a rough approximation of teh model from the 3 detector setup from Basic Optimizations with H2MM_C

# define the arrays
prior = np.array([0.63, 0.03, 0.19, 0.15])
trans = np.array([[0.9997, 0.0001, 0.0001, 0.0001],
                  [2e-5, 1-3.2e-5, 1e-5, 2e-6],
                  [5e-6, 7e-6, 1-2.2e-5, 1e-5],
                  [3e-6, 3e-6, 4e-5, 1-4.6e-5]])
obs = np.array([[0.62, 0.37, 0.01],
                [0.14, 0.29, 0.57],
                [0.44, 0.09, 0.47],
                [0.84, 0.08, 0.08]])

# make the model
sim_model = hm.h2mm_model(prior, trans, obs)

And finally, simulate the data to get an array of states and indices. These are simulations producing the resutls of Viterbi and the data point indices respectively.

simstate, simcolor = hm.sim_phtraj_from_times(sim_model, time)
simstate, simcolor
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0], dtype=uint32),
array([1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
1, 1, 0, 1, 1, 1], dtype=uint32))

Recreating Data: Multiple Arrays#

Now, if we want to simulate a full data set, you will want to repeat the previous process many times, which means using a loop.

# initiate lists
simtimes = list()
simstates = list()
simcolors = list()

# loop to create each set
for _ in range(1000):
    # generate new time array
    simtime = np.cumsum(np.random.exponential(100, size=np.random.randint(50,150)).astype(int))
    # simulate data
    simstate, simcolor = hm.sim_phtraj_from_times(sim_model, simtime)
    # append arrays to lists
    simtimes.append(simtime)
    simstates.append(simstate)
    simcolors.append(simcolor)

Using Existing Times#

Another strategy, which can be a way to check the reasonableness of a model, is to use the actual times of an experiment, and the resulting h2mm_model, and then compare the real data to the simulated data.

Note

Markov processes are inherently random, you will need to come up with some metric along which to compare. You will certainly not get the same indices in the simulated data and real data. The ratios of the indices should be similar however- use you knowledge about the system to figure out a legitimate way to compare.

So let’s load the times from the 3 detector data, and simulate the data from that:

##############################################################
# The code here is just for loading the data
# load the data
# color3 = list() # to save memory, we will not load this
times3 = list()

i = 0
with open('sample_data_3det.txt','r') as f:
    for line in f:
        if i % 2 == 0:
            times3.append(np.array([int(x) for x in line.split()],dtype='Q'))
        # No need to load the color, so comment it out
#         else:
#             color3.append(np.array([int(x) for x in line.split()],dtype='L'))
        i += 1
# End of data loading segment
##############################################################

# initiate arrays
simstates3 = list()
simcolors3 = list()

# loop over each time array and simulate each set
for tm3 in times3:
    # conduct the simulation
    simstate3, simcolor3 = hm.sim_phtraj_from_times(sim_model, tm3)
    # append arrays to lists
    simstates3.append(simstate3)
    simcolors3.append(simcolor3)

Simulating from Components#

It is also possible to simulate first states, and then with a separate function, simulate the times. If you only want the state path, you can just do the first step, and then save the memory, computational time, and code complexity, or maybe you already have the state path, so just to the second half.

This is done with the sim_sparsestatepath() and sim_phtraj_from_state() functions.

Note

sim_sparsestatepath() uses only the h2mm_model.prior and h2mm_model.trans arrays to calculate the states, because indices are not calcualted, the h2mm_model.obs array will have no influence on the results.

Conversely, sim_phtraj_from_state() already has the states from input, and therefore the only important array for its simulation is the h2mm_model.obs array.

Simulating a State Path

simtime = np.cumsum(np.random.exponential(100, size=50).astype(int))

statepath = hm.sim_sparsestatepath(sim_model, simtime)
statepath
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2], dtype=uint32)

Simulating Data from States

color = hm.sim_phtraj_from_state(sim_model, statepath)
color
array([0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 2,
1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 2, 2, 0,
0, 2, 2, 2, 0, 0], dtype=uint32)

Setting the Random Seed for Reproducibility#

Since these simulations are based on a random number generator, results will be different each time. However, if repeatability is desired, the seed of the random number generator can be set with the keyword argument seed. The same syntax is used across all three simulation functions.

Note

The random seed is persitent, so it should only be set once. Each time the seed is set, the counter on the random number generator is reset.

simpath, simcolor = hm.sim_phtraj_from_times(sim_model, time, seed=100)
simpath, simcolor
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0], dtype=uint32),
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 1], dtype=uint32))