emic.inference¶
Epsilon-machine inference algorithms.
inference
¶
Inference module for epsilon-machine reconstruction.
CSSR
dataclass
¶
CSSR(config: CSSRConfig)
Bases: Generic[A]
Causal State Splitting Reconstruction algorithm (corrected version).
This implementation follows the three-phase structure from the original paper:
Phase I (Initialization): Build suffix tree and initialize with all histories in one state.
Phase II (Sufficiency): For each history length L from 1 to L_max: For each history h of length L: Let s be the suffix of h of length L-1 If P(next | h) differs significantly from P(next | state(s)): Create new state or reassign h
Phase III (Recursion/Determinism): Ensure transitions are deterministic by merging states that would receive the same history under extension.
Reference
Shalizi, C.R. & Klinkner, K.L. (2004). "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences"
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer epsilon-machine from sequence.
Source code in src/emic/inference/cssr/algorithm.py
__rrshift__
¶
__rrshift__(source: Iterable[A]) -> InferenceResult[A]
CSSRConfig
dataclass
¶
CSSRConfig(
max_history: int,
significance: float = 0.05,
min_count: int = 5,
test: str = "chi2",
max_iterations: int = 1000,
post_merge: bool = True,
merge_significance: float | None = None,
)
Configuration for the CSSR algorithm.
Attributes:
| Name | Type | Description |
|---|---|---|
max_history |
int
|
Maximum history length L to consider |
significance |
float
|
Significance level for statistical tests (lower = fewer states) |
min_count |
int
|
Minimum observations for a history to be considered |
test |
str
|
Statistical test type ("chi2", "ks", "g") |
max_iterations |
int
|
Maximum iterations for convergence |
post_merge |
bool
|
Enable post-convergence state merging for minimality |
merge_significance |
float | None
|
Significance for post-merge (None = use significance) |
Examples:
__post_init__
¶
Validate configuration parameters.
Source code in src/emic/inference/cssr/config.py
CSM
dataclass
¶
CSM(config: CSMConfig)
Bases: Generic[A]
Causal State Merging algorithm.
Infers an epsilon-machine from an observed sequence by: 1. Creating a state for each unique history of the specified length 2. Computing pairwise distances between state predictive distributions 3. Iteratively merging the closest pair until threshold is exceeded
This is a bottom-up approach that complements CSSR's top-down splitting.
Reference
This implementation follows the state-merging approach described in computational mechanics literature as an alternative to CSSR.
Examples:
>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.csm import CSM, CSMConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> csm = CSM(CSMConfig(history_length=5))
>>> result = csm.infer(sequence)
>>> len(result.machine.states)
2
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer epsilon-machine from sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
Iterable[A]
|
The observed sequence of symbols. |
required |
alphabet
|
frozenset[A] | None
|
The set of possible symbols (inferred if not provided). |
None
|
Returns:
| Type | Description |
|---|---|
InferenceResult[A]
|
InferenceResult containing the inferred machine and diagnostics. |
Raises:
| Type | Description |
|---|---|
InsufficientDataError
|
If sequence is too short for inference. |
Source code in src/emic/inference/csm/algorithm.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | |
__rrshift__
¶
__rrshift__(source: Iterable[A]) -> InferenceResult[A]
CSMConfig
dataclass
¶
CSMConfig(
history_length: int,
merge_threshold: float = 0.05,
distance_metric: DistanceMetric = "kl",
min_count: int = 5,
hierarchical: bool = False,
)
Configuration for the Causal State Merging (CSM) algorithm.
CSM is a bottom-up approach to epsilon-machine inference. Unlike CSSR, which splits states, CSM starts with the finest possible partition (each unique history of length L is a state) and merges states based on predictive equivalence.
Attributes:
| Name | Type | Description |
|---|---|---|
history_length |
int
|
Fixed history length for initial partition. All unique histories of this length become initial states. |
merge_threshold |
float
|
Distance threshold for merging. Two states are merged if their predictive distributions are closer than this threshold. Lower values = more states, higher values = fewer states. |
distance_metric |
DistanceMetric
|
Metric for comparing predictive distributions. - 'kl': Kullback-Leibler divergence (asymmetric) - 'hellinger': Hellinger distance (symmetric) - 'tv': Total variation distance (symmetric) - 'chi2': Chi-squared distance (symmetric) |
min_count |
int
|
Minimum observations required for a history to be considered. |
hierarchical |
bool
|
If True, return the full merge hierarchy (dendrogram). Useful for visualization and choosing merge threshold. |
Examples:
>>> config = CSMConfig(history_length=5, merge_threshold=0.05)
>>> config.history_length
5
>>> config.distance_metric
'kl'
__post_init__
¶
Validate configuration parameters.
Source code in src/emic/inference/csm/config.py
BSI
dataclass
¶
BSI(config: BSIConfig)
Bases: Generic[A]
Bayesian Structural Inference for epsilon-machine learning.
BSI uses Bayesian inference with Gibbs sampling to learn epsilon-machines. It provides uncertainty quantification and automatic model selection via marginal likelihood (evidence).
The algorithm places Dirichlet priors on transition probabilities and uses Gibbs sampling to explore the posterior over state assignments.
Implements the InferenceAlgorithm protocol.
Reference
Strelioff, C.C. & Crutchfield, J.P. (2014). "Bayesian Structural Inference for Hidden Processes"
Examples:
>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.bsi import BSI, BSIConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> bsi = BSI(BSIConfig(max_states=5, n_samples=100))
>>> result = bsi.infer(sequence)
>>> len(result.machine.states) <= 5
True
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer an epsilon-machine from the sequence using BSI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
Iterable[A]
|
Input sequence of symbols. |
required |
alphabet
|
frozenset[A] | None
|
Set of possible symbols. Inferred from data if None. |
None
|
Returns:
| Type | Description |
|---|---|
InferenceResult[A]
|
InferenceResult containing epsilon-machine and diagnostics. |
Raises:
| Type | Description |
|---|---|
InsufficientDataError
|
If sequence is too short. |
Source code in src/emic/inference/bsi/algorithm.py
__rrshift__
¶
__rrshift__(source: Iterable[A]) -> InferenceResult[A]
BSIConfig
dataclass
¶
BSIConfig(
max_states: int = 10,
max_history: int = 5,
alpha_prior: float = 1.0,
n_samples: int = 1000,
burnin: int = 200,
thin: int = 1,
seed: int | None = None,
)
Configuration for Bayesian Structural Inference.
BSI uses Bayesian inference to learn epsilon-machines with natural uncertainty quantification. It can automatically select the number of states via model evidence.
Attributes:
| Name | Type | Description |
|---|---|---|
max_states |
int
|
Maximum number of states to consider. |
max_history |
int
|
Maximum history length for likelihood computation. |
alpha_prior |
float
|
Dirichlet concentration for transition priors. Higher values = more uniform priors. |
n_samples |
int
|
Number of MCMC samples to draw. |
burnin |
int
|
Number of burn-in samples to discard. |
thin |
int
|
Thinning factor (keep every n-th sample). |
seed |
int | None
|
Random seed for reproducibility. |
Examples:
__post_init__
¶
Validate configuration parameters.
Source code in src/emic/inference/bsi/config.py
Spectral
dataclass
¶
Spectral(config: SpectralConfig)
Bases: Generic[A]
Spectral Learning algorithm for epsilon-machine inference.
Uses SVD of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine from the learned model.
This is a polynomial-time algorithm that is statistically consistent (converges to the true model with enough data).
Reference
Hsu, D., Kakade, S.M., & Zhang, T. (2012). "Spectral Algorithm for Learning Hidden Markov Models"
Examples:
>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.spectral import Spectral, SpectralConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> spectral = Spectral(SpectralConfig(max_history=5))
>>> result = spectral.infer(sequence)
>>> len(result.machine.states) >= 2 # Should find ~2 states
True
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer epsilon-machine from sequence using spectral methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
Iterable[A]
|
The observed sequence of symbols. |
required |
alphabet
|
frozenset[A] | None
|
The set of possible symbols (inferred if not provided). |
None
|
Returns:
| Type | Description |
|---|---|
InferenceResult[A]
|
InferenceResult containing the inferred machine and diagnostics. |
Raises:
| Type | Description |
|---|---|
InsufficientDataError
|
If sequence is too short for inference. |
Source code in src/emic/inference/spectral/algorithm.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
__rrshift__
¶
__rrshift__(source: Iterable[A]) -> InferenceResult[A]
Support: sequence >> Spectral(config).
SpectralConfig
dataclass
¶
SpectralConfig(
max_history: int | None = None,
rank_threshold: float = 0.001,
rank: int | None = None,
regularization: float = 1e-06,
min_count: int | None = None,
)
Configuration for the Spectral Learning algorithm.
Spectral learning uses SVD/eigendecomposition of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine.
Attributes:
| Name | Type | Description |
|---|---|---|
max_history |
int | None
|
Maximum history length for Hankel matrix rows/columns. If None (default), automatically adapts based on sample size and alphabet size. Larger values capture longer-range correlations but require more data to estimate reliably. |
rank_threshold |
float
|
Relative singular value threshold for rank selection.
Singular values below max_sv * threshold are discarded.
Lower values are more conservative (keep fewer singular values),
which can prevent overfitting but may miss structure.
Higher values keep more singular values but may include noise.
If the inferred machine has too many states, try lowering this.
If it has too few states, try raising it or use explicit |
rank |
int | None
|
Fixed rank (overrides threshold if set). Use this when you know the true number of hidden states. |
regularization |
float
|
Regularization for numerical stability. |
min_count |
int | None
|
Minimum observations for a history to be included. If None (default), automatically adapts based on sample size. |
Examples:
>>> # Default configuration (adaptive)
>>> config = SpectralConfig()
>>> config.max_history is None # Will be set adaptively
True
__post_init__
¶
Validate configuration parameters.
Source code in src/emic/inference/spectral/config.py
get_effective_params
¶
Get effective max_history and min_count for given data size.
The heuristic is based on ensuring the Hankel matrix is well-populated: - We need approximately |Σ|^L unique histories of length L - Each history needs min_count occurrences - So we need n >= |Σ|^L * min_count * constant
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of samples in the sequence. |
required |
alphabet_size
|
int
|
Size of the alphabet. |
required |
Returns:
| Type | Description |
|---|---|
tuple[int, int]
|
Tuple of (effective_max_history, effective_min_count). |
Source code in src/emic/inference/spectral/config.py
NSD
dataclass
¶
NSD(config: NSDConfig)
Bases: Generic[A]
Neural State Discovery for epsilon-machine learning.
NSD learns causal states by embedding histories into a vector space and clustering. This implementation uses a simplified approach without neural networks: it creates embeddings from next-symbol distributions and uses k-means clustering to discover states.
Reference
This is a simplified version inspired by representation learning approaches to computational mechanics.
Examples:
>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.nsd import NSD, NSDConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> nsd = NSD(NSDConfig(max_states=5))
>>> result = nsd.infer(sequence)
>>> len(result.machine.states) <= 5
True
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer an epsilon-machine from the sequence using NSD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
Iterable[A]
|
Input sequence of symbols. |
required |
alphabet
|
frozenset[A] | None
|
Set of possible symbols. Inferred from data if None. |
None
|
Returns:
| Type | Description |
|---|---|
InferenceResult[A]
|
InferenceResult containing epsilon-machine and discovery info. |
Raises:
| Type | Description |
|---|---|
InsufficientDataError
|
If sequence is too short. |
Source code in src/emic/inference/nsd/algorithm.py
__rrshift__
¶
__rrshift__(source: Iterable[A]) -> InferenceResult[A]
NSDConfig
dataclass
¶
NSDConfig(
max_states: int = 10,
history_length: int = 10,
embedding_dim: int = 16,
n_iterations: int = 100,
convergence_threshold: float = 0.0001,
seed: int | None = None,
)
Configuration for Neural State Discovery.
NSD learns state representations using a neural network approach. This is a simplified implementation that uses clustering on learned embeddings rather than full neural network training.
Attributes:
| Name | Type | Description |
|---|---|---|
max_states |
int
|
Maximum number of states to discover. |
history_length |
int
|
Length of history window for state embedding. |
embedding_dim |
int
|
Dimensionality of state embeddings. |
n_iterations |
int
|
Number of clustering iterations. |
convergence_threshold |
float
|
Threshold for convergence detection. |
seed |
int | None
|
Random seed for reproducibility. |
Examples:
__post_init__
¶
Validate configuration parameters.
Source code in src/emic/inference/nsd/config.py
InferenceAlgorithm
¶
Bases: Protocol[A]
Protocol for epsilon-machine inference algorithms.
Any algorithm that can infer an epsilon-machine from a sequence of symbols satisfies this protocol.
Examples:
>>> class MyAlgorithm:
... def infer(self, sequence, alphabet=None):
... ... # Return InferenceResult
infer
¶
infer(
sequence: Iterable[A],
alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]
Infer an epsilon-machine from the given sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
Iterable[A]
|
The observed symbols |
required |
alphabet
|
frozenset[A] | None
|
Known alphabet (inferred from sequence if None) |
None
|
Returns:
| Type | Description |
|---|---|
InferenceResult[A]
|
InferenceResult containing the machine and diagnostics |
Source code in src/emic/inference/protocol.py
InferenceResult
dataclass
¶
InferenceResult(
machine: EpsilonMachine[A],
sequence_length: int,
max_history_used: int,
num_histories_considered: int,
converged: bool = True,
iterations: int | None = None,
warnings: tuple[str, ...] = tuple(),
)
Bases: Generic[A]
The result of epsilon-machine inference.
Contains the inferred machine plus diagnostics and quality metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
machine |
EpsilonMachine[A]
|
The inferred epsilon-machine |
sequence_length |
int
|
Number of symbols in the input sequence |
max_history_used |
int
|
Maximum history length considered |
num_histories_considered |
int
|
Total number of histories analyzed |
converged |
bool
|
Whether the algorithm converged |
iterations |
int | None
|
Number of iterations (if applicable) |
warnings |
tuple[str, ...]
|
Any warnings generated during inference |
summary
¶
Return a human-readable summary.
Source code in src/emic/inference/result.py
__rshift__
¶
__rshift__(func: Callable[[EpsilonMachine[A]], T]) -> T
InferenceError
¶
InsufficientDataError
¶
Bases: InferenceError
Raised when sequence is too short for reliable inference.
Source code in src/emic/inference/errors.py
explain
¶
Return a human-readable explanation.
Source code in src/emic/inference/errors.py
NonConvergenceError
¶
Bases: InferenceError
Raised when algorithm fails to converge.
Source code in src/emic/inference/errors.py
explain
¶
Return a human-readable explanation.