Skip to content

emic.inference

Epsilon-machine inference algorithms.

inference

Inference module for epsilon-machine reconstruction.

CSSR dataclass

CSSR(config: CSSRConfig)

Bases: Generic[A]

Causal State Splitting Reconstruction algorithm (corrected version).

This implementation follows the three-phase structure from the original paper:

Phase I (Initialization): Build suffix tree and initialize with all histories in one state.

Phase II (Sufficiency): For each history length L from 1 to L_max: For each history h of length L: Let s be the suffix of h of length L-1 If P(next | h) differs significantly from P(next | state(s)): Create new state or reassign h

Phase III (Recursion/Determinism): Ensure transitions are deterministic by merging states that would receive the same history under extension.

Reference

Shalizi, C.R. & Klinkner, K.L. (2004). "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences"

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence.

Source code in src/emic/inference/cssr/algorithm.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """Infer epsilon-machine from sequence."""
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.min_count * (self.config.max_history + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="CSSR",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    # Phase I: Build suffix tree
    suffix_tree: SuffixTree[A] = SuffixTree(
        max_depth=self.config.max_history, alphabet=alphabet
    )
    suffix_tree.build_from_sequence(symbols)

    # Phase II: Level-by-level sufficiency testing
    partition = self._sufficiency_phase(suffix_tree, alphabet)

    # Phase III: Ensure determinism (merge equivalent states)
    partition = self._determinism_phase(partition, suffix_tree, alphabet)

    # Build machine
    machine = self._build_machine(partition, suffix_tree, alphabet)

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.max_history,
        num_histories_considered=len(suffix_tree),
        converged=True,
        iterations=self.config.max_history,
    )

__rrshift__

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> CSSR(config).

Source code in src/emic/inference/cssr/algorithm.py
def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> CSSR(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

CSSRConfig dataclass

CSSRConfig(
    max_history: int,
    significance: float = 0.05,
    min_count: int = 5,
    test: str = "chi2",
    max_iterations: int = 1000,
    post_merge: bool = True,
    merge_significance: float | None = None,
)

Configuration for the CSSR algorithm.

Attributes:

Name Type Description
max_history int

Maximum history length L to consider

significance float

Significance level for statistical tests (lower = fewer states)

min_count int

Minimum observations for a history to be considered

test str

Statistical test type ("chi2", "ks", "g")

max_iterations int

Maximum iterations for convergence

post_merge bool

Enable post-convergence state merging for minimality

merge_significance float | None

Significance for post-merge (None = use significance)

Examples:

>>> config = CSSRConfig(max_history=5, significance=0.01)
>>> config.max_history
5

__post_init__

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/cssr/config.py
def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_history < 1:
        msg = f"max_history must be >= 1, got {self.max_history}"
        raise ValueError(msg)
    if not (0 < self.significance < 1):
        msg = f"significance must be in (0, 1), got {self.significance}"
        raise ValueError(msg)
    if self.min_count < 1:
        msg = f"min_count must be >= 1, got {self.min_count}"
        raise ValueError(msg)
    if self.test not in ("chi2", "ks", "g", "proportion"):
        msg = f"test must be 'chi2', 'ks', 'g', or 'proportion', got {self.test}"
        raise ValueError(msg)
    if self.merge_significance is not None and not (0 < self.merge_significance < 1):
        msg = f"merge_significance must be in (0, 1), got {self.merge_significance}"
        raise ValueError(msg)

CSM dataclass

CSM(config: CSMConfig)

Bases: Generic[A]

Causal State Merging algorithm.

Infers an epsilon-machine from an observed sequence by: 1. Creating a state for each unique history of the specified length 2. Computing pairwise distances between state predictive distributions 3. Iteratively merging the closest pair until threshold is exceeded

This is a bottom-up approach that complements CSSR's top-down splitting.

Reference

This implementation follows the state-merging approach described in computational mechanics literature as an alternative to CSSR.

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.csm import CSM, CSMConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> csm = CSM(CSMConfig(history_length=5))
>>> result = csm.infer(sequence)
>>> len(result.machine.states)
2

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence.

Parameters:

Name Type Description Default
sequence Iterable[A]

The observed sequence of symbols.

required
alphabet frozenset[A] | None

The set of possible symbols (inferred if not provided).

None

Returns:

Type Description
InferenceResult[A]

InferenceResult containing the inferred machine and diagnostics.

Raises:

Type Description
InsufficientDataError

If sequence is too short for inference.

Source code in src/emic/inference/csm/algorithm.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer epsilon-machine from sequence.

    Args:
        sequence: The observed sequence of symbols.
        alphabet: The set of possible symbols (inferred if not provided).

    Returns:
        InferenceResult containing the inferred machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short for inference.
    """
    # Convert to list for multiple passes
    symbols = list(sequence)
    n = len(symbols)

    # Check minimum data requirement
    min_required = self.config.min_count * (self.config.history_length + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="CSM",
        )

    # Infer alphabet if not provided
    if alphabet is None:
        alphabet = frozenset(symbols)

    # Collect history statistics
    history_stats = self._collect_history_stats(symbols)

    # Initialize partition: each valid history is its own state
    partition, state_to_histories = self._initialize_partition(history_stats)

    if not partition:
        # No valid histories, create trivial machine
        return self._build_trivial_result(alphabet, n)

    # Compute initial distances between all state pairs
    distances = self._compute_all_distances(partition, state_to_histories, history_stats)

    # Merge history for hierarchical option
    merge_history: list[MergeInfo] = []

    # Iteratively merge closest pairs
    num_merges = 0
    while len(partition) > 1:
        # Find closest pair
        min_dist = float("inf")
        best_pair: tuple[str, str] | None = None

        for (s1, s2), dist in distances.items():
            if dist < min_dist:
                min_dist = dist
                best_pair = (s1, s2)

        if best_pair is None or min_dist > self.config.merge_threshold:
            break

        # Merge the pair
        s1, s2 = best_pair
        new_state = self._merge_states(s1, s2, partition, state_to_histories, history_stats)
        num_merges += 1

        if self.config.hierarchical:
            merge_history.append(MergeInfo(s1, s2, new_state, min_dist))

        # Update distances
        distances = self._update_distances_after_merge(
            s1, s2, new_state, partition, state_to_histories, history_stats, distances
        )

    # Build machine from final partition
    machine = self._build_machine(partition, state_to_histories, history_stats, alphabet)

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.history_length,
        num_histories_considered=len(history_stats),
        converged=True,
        iterations=num_merges,
    )

__rrshift__

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> CSM(config).

Source code in src/emic/inference/csm/algorithm.py
def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> CSM(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

CSMConfig dataclass

CSMConfig(
    history_length: int,
    merge_threshold: float = 0.05,
    distance_metric: DistanceMetric = "kl",
    min_count: int = 5,
    hierarchical: bool = False,
)

Configuration for the Causal State Merging (CSM) algorithm.

CSM is a bottom-up approach to epsilon-machine inference. Unlike CSSR, which splits states, CSM starts with the finest possible partition (each unique history of length L is a state) and merges states based on predictive equivalence.

Attributes:

Name Type Description
history_length int

Fixed history length for initial partition. All unique histories of this length become initial states.

merge_threshold float

Distance threshold for merging. Two states are merged if their predictive distributions are closer than this threshold. Lower values = more states, higher values = fewer states.

distance_metric DistanceMetric

Metric for comparing predictive distributions. - 'kl': Kullback-Leibler divergence (asymmetric) - 'hellinger': Hellinger distance (symmetric) - 'tv': Total variation distance (symmetric) - 'chi2': Chi-squared distance (symmetric)

min_count int

Minimum observations required for a history to be considered.

hierarchical bool

If True, return the full merge hierarchy (dendrogram). Useful for visualization and choosing merge threshold.

Examples:

>>> config = CSMConfig(history_length=5, merge_threshold=0.05)
>>> config.history_length
5
>>> config.distance_metric
'kl'

__post_init__

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/csm/config.py
def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.history_length < 1:
        msg = f"history_length must be >= 1, got {self.history_length}"
        raise ValueError(msg)
    if self.merge_threshold <= 0:
        msg = f"merge_threshold must be > 0, got {self.merge_threshold}"
        raise ValueError(msg)
    if self.distance_metric not in ("kl", "hellinger", "tv", "chi2"):
        msg = f"distance_metric must be 'kl', 'hellinger', 'tv', or 'chi2', got {self.distance_metric}"
        raise ValueError(msg)
    if self.min_count < 1:
        msg = f"min_count must be >= 1, got {self.min_count}"
        raise ValueError(msg)

BSI dataclass

BSI(config: BSIConfig)

Bases: Generic[A]

Bayesian Structural Inference for epsilon-machine learning.

BSI uses Bayesian inference with Gibbs sampling to learn epsilon-machines. It provides uncertainty quantification and automatic model selection via marginal likelihood (evidence).

The algorithm places Dirichlet priors on transition probabilities and uses Gibbs sampling to explore the posterior over state assignments.

Implements the InferenceAlgorithm protocol.

Reference

Strelioff, C.C. & Crutchfield, J.P. (2014). "Bayesian Structural Inference for Hidden Processes"

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.bsi import BSI, BSIConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> bsi = BSI(BSIConfig(max_states=5, n_samples=100))
>>> result = bsi.infer(sequence)
>>> len(result.machine.states) <= 5
True

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the sequence using BSI.

Parameters:

Name Type Description Default
sequence Iterable[A]

Input sequence of symbols.

required
alphabet frozenset[A] | None

Set of possible symbols. Inferred from data if None.

None

Returns:

Type Description
InferenceResult[A]

InferenceResult containing epsilon-machine and diagnostics.

Raises:

Type Description
InsufficientDataError

If sequence is too short.

Source code in src/emic/inference/bsi/algorithm.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the sequence using BSI.

    Args:
        sequence: Input sequence of symbols.
        alphabet: Set of possible symbols. Inferred from data if None.

    Returns:
        InferenceResult containing epsilon-machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short.
    """
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.max_history * 10
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="BSI",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    # Initialize RNG
    rng = random.Random(self.config.seed)

    # Build suffix tree for counting
    suffix_counts = self._build_suffix_counts(symbols)

    # Try different numbers of states, compute evidence
    best_machine: EpsilonMachine[A] | None = None
    best_evidence = float("-inf")

    for n_states in range(1, self.config.max_states + 1):
        machine, evidence = self._infer_with_n_states(
            symbols, alphabet, suffix_counts, n_states, rng
        )
        if evidence > best_evidence:
            best_evidence = evidence
            best_machine = machine

    assert best_machine is not None

    return InferenceResult(
        machine=best_machine,
        sequence_length=n,
        max_history_used=self.config.max_history,
        num_histories_considered=len(suffix_counts),
    )

__rrshift__

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> BSI(config).

Source code in src/emic/inference/bsi/algorithm.py
def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> BSI(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

BSIConfig dataclass

BSIConfig(
    max_states: int = 10,
    max_history: int = 5,
    alpha_prior: float = 1.0,
    n_samples: int = 1000,
    burnin: int = 200,
    thin: int = 1,
    seed: int | None = None,
)

Configuration for Bayesian Structural Inference.

BSI uses Bayesian inference to learn epsilon-machines with natural uncertainty quantification. It can automatically select the number of states via model evidence.

Attributes:

Name Type Description
max_states int

Maximum number of states to consider.

max_history int

Maximum history length for likelihood computation.

alpha_prior float

Dirichlet concentration for transition priors. Higher values = more uniform priors.

n_samples int

Number of MCMC samples to draw.

burnin int

Number of burn-in samples to discard.

thin int

Thinning factor (keep every n-th sample).

seed int | None

Random seed for reproducibility.

Examples:

>>> config = BSIConfig(max_states=5, n_samples=500)
>>> config.max_states
5

__post_init__

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/bsi/config.py
def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_states < 1:
        msg = f"max_states must be >= 1, got {self.max_states}"
        raise ValueError(msg)
    if self.max_history < 1:
        msg = f"max_history must be >= 1, got {self.max_history}"
        raise ValueError(msg)
    if self.alpha_prior <= 0:
        msg = f"alpha_prior must be > 0, got {self.alpha_prior}"
        raise ValueError(msg)
    if self.n_samples < 1:
        msg = f"n_samples must be >= 1, got {self.n_samples}"
        raise ValueError(msg)
    if self.burnin < 0:
        msg = f"burnin must be >= 0, got {self.burnin}"
        raise ValueError(msg)
    if self.thin < 1:
        msg = f"thin must be >= 1, got {self.thin}"
        raise ValueError(msg)

Spectral dataclass

Spectral(config: SpectralConfig)

Bases: Generic[A]

Spectral Learning algorithm for epsilon-machine inference.

Uses SVD of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine from the learned model.

This is a polynomial-time algorithm that is statistically consistent (converges to the true model with enough data).

Reference

Hsu, D., Kakade, S.M., & Zhang, T. (2012). "Spectral Algorithm for Learning Hidden Markov Models"

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.spectral import Spectral, SpectralConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> spectral = Spectral(SpectralConfig(max_history=5))
>>> result = spectral.infer(sequence)
>>> len(result.machine.states) >= 2  # Should find ~2 states
True

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence using spectral methods.

Parameters:

Name Type Description Default
sequence Iterable[A]

The observed sequence of symbols.

required
alphabet frozenset[A] | None

The set of possible symbols (inferred if not provided).

None

Returns:

Type Description
InferenceResult[A]

InferenceResult containing the inferred machine and diagnostics.

Raises:

Type Description
InsufficientDataError

If sequence is too short for inference.

Source code in src/emic/inference/spectral/algorithm.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer epsilon-machine from sequence using spectral methods.

    Args:
        sequence: The observed sequence of symbols.
        alphabet: The set of possible symbols (inferred if not provided).

    Returns:
        InferenceResult containing the inferred machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short for inference.
    """
    symbols = list(sequence)
    n = len(symbols)

    if alphabet is None:
        alphabet = frozenset(symbols)

    alphabet_list = sorted(alphabet, key=str)

    # Get effective parameters (may be adaptive)
    max_history, min_count = self.config.get_effective_params(
        n_samples=n, alphabet_size=len(alphabet_list)
    )

    # Check minimum data requirement
    min_required = min_count * (2 * max_history + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="Spectral",
        )

    # Step 1: Build Hankel matrices
    hankel = build_hankel_matrices(
        symbols=symbols,
        alphabet=alphabet_list,
        max_history=max_history,
        min_count=min_count,
    )

    # Early exit for degenerate cases
    if hankel.is_empty:
        machine = build_trivial_machine(symbols, alphabet)
        return InferenceResult(
            machine=machine,
            sequence_length=n,
            max_history_used=max_history,
            num_histories_considered=0,
            converged=True,
            iterations=1,
        )

    # Step 2: Perform SVD and determine rank
    svd = compute_svd(
        hankel.H,
        rank=self.config.rank,
        rank_threshold=self.config.rank_threshold,
    )

    # Step 3: Extract observable operators
    op_result = extract_operators(
        H_x=hankel.H_x,
        svd=svd,
        alphabet=alphabet_list,
        regularization=self.config.regularization,
    )

    # Step 4: Convert to epsilon-machine
    # Pass SVD and H for proper b_1 and b_inf computation
    machine = build_machine_from_operators(
        operators=op_result.operators,
        alphabet=alphabet_list,
        symbols=symbols,
        svd=svd,
        H=hankel.H,
    )

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=max_history,
        num_histories_considered=len(hankel.histories),
        converged=True,
        iterations=op_result.rank,
    )

__rrshift__

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> Spectral(config).

Source code in src/emic/inference/spectral/algorithm.py
def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> Spectral(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

SpectralConfig dataclass

SpectralConfig(
    max_history: int | None = None,
    rank_threshold: float = 0.001,
    rank: int | None = None,
    regularization: float = 1e-06,
    min_count: int | None = None,
)

Configuration for the Spectral Learning algorithm.

Spectral learning uses SVD/eigendecomposition of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine.

Attributes:

Name Type Description
max_history int | None

Maximum history length for Hankel matrix rows/columns. If None (default), automatically adapts based on sample size and alphabet size. Larger values capture longer-range correlations but require more data to estimate reliably.

rank_threshold float

Relative singular value threshold for rank selection. Singular values below max_sv * threshold are discarded. Lower values are more conservative (keep fewer singular values), which can prevent overfitting but may miss structure. Higher values keep more singular values but may include noise. If the inferred machine has too many states, try lowering this. If it has too few states, try raising it or use explicit rank.

rank int | None

Fixed rank (overrides threshold if set). Use this when you know the true number of hidden states.

regularization float

Regularization for numerical stability.

min_count int | None

Minimum observations for a history to be included. If None (default), automatically adapts based on sample size.

Examples:

>>> # Default configuration (adaptive)
>>> config = SpectralConfig()
>>> config.max_history is None  # Will be set adaptively
True
>>> # Fixed history length for large data
>>> config = SpectralConfig(max_history=10)
>>> # Fixed rank when you know the true model
>>> config = SpectralConfig(rank=2)

__post_init__

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/spectral/config.py
def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_history is not None and self.max_history < 1:
        msg = f"max_history must be >= 1 or None, got {self.max_history}"
        raise ValueError(msg)
    if self.rank_threshold <= 0 or self.rank_threshold >= 1:
        msg = f"rank_threshold must be in (0, 1), got {self.rank_threshold}"
        raise ValueError(msg)
    if self.rank is not None and self.rank < 1:
        msg = f"rank must be >= 1, got {self.rank}"
        raise ValueError(msg)
    if self.regularization < 0:
        msg = f"regularization must be >= 0, got {self.regularization}"
        raise ValueError(msg)
    if self.min_count is not None and self.min_count < 1:
        msg = f"min_count must be >= 1 or None, got {self.min_count}"
        raise ValueError(msg)

get_effective_params

get_effective_params(
    n_samples: int, alphabet_size: int
) -> tuple[int, int]

Get effective max_history and min_count for given data size.

The heuristic is based on ensuring the Hankel matrix is well-populated: - We need approximately |Σ|^L unique histories of length L - Each history needs min_count occurrences - So we need n >= |Σ|^L * min_count * constant

Parameters:

Name Type Description Default
n_samples int

Number of samples in the sequence.

required
alphabet_size int

Size of the alphabet.

required

Returns:

Type Description
tuple[int, int]

Tuple of (effective_max_history, effective_min_count).

Source code in src/emic/inference/spectral/config.py
def get_effective_params(self, n_samples: int, alphabet_size: int) -> tuple[int, int]:
    """
    Get effective max_history and min_count for given data size.

    The heuristic is based on ensuring the Hankel matrix is well-populated:
    - We need approximately |Σ|^L unique histories of length L
    - Each history needs min_count occurrences
    - So we need n >= |Σ|^L * min_count * constant

    Args:
        n_samples: Number of samples in the sequence.
        alphabet_size: Size of the alphabet.

    Returns:
        Tuple of (effective_max_history, effective_min_count).
    """
    if self.max_history is not None and self.min_count is not None:
        return (self.max_history, self.min_count)

    # Adaptive heuristic based on sample size and alphabet.
    # We use L = floor(log(n/50) / log(|A|)), giving a history length
    # such that the expected count per cell is at least 50.
    safety_factor = 50

    if alphabet_size < 2:
        alphabet_size = 2  # Avoid log(1) = 0

    if self.max_history is not None:
        effective_max_history = self.max_history
    else:
        # L such that |Σ|^L * safety_factor <= n
        # L <= log(n / safety_factor) / log(|Σ|)
        max_L = math.log(max(1, n_samples / safety_factor)) / math.log(alphabet_size)
        effective_max_history = max(2, min(10, int(max_L)))

    if self.min_count is not None:
        effective_min_count = self.min_count
    else:
        # Adaptive min_count: higher for small data, lower for large
        # Scale roughly as sqrt(n) / 10, capped between 2 and 10
        effective_min_count = max(2, min(10, int(math.sqrt(n_samples) / 10)))

    return (effective_max_history, effective_min_count)

NSD dataclass

NSD(config: NSDConfig)

Bases: Generic[A]

Neural State Discovery for epsilon-machine learning.

NSD learns causal states by embedding histories into a vector space and clustering. This implementation uses a simplified approach without neural networks: it creates embeddings from next-symbol distributions and uses k-means clustering to discover states.

Reference

This is a simplified version inspired by representation learning approaches to computational mechanics.

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.nsd import NSD, NSDConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> nsd = NSD(NSDConfig(max_states=5))
>>> result = nsd.infer(sequence)
>>> len(result.machine.states) <= 5
True

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the sequence using NSD.

Parameters:

Name Type Description Default
sequence Iterable[A]

Input sequence of symbols.

required
alphabet frozenset[A] | None

Set of possible symbols. Inferred from data if None.

None

Returns:

Type Description
InferenceResult[A]

InferenceResult containing epsilon-machine and discovery info.

Raises:

Type Description
InsufficientDataError

If sequence is too short.

Source code in src/emic/inference/nsd/algorithm.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the sequence using NSD.

    Args:
        sequence: Input sequence of symbols.
        alphabet: Set of possible symbols. Inferred from data if None.

    Returns:
        InferenceResult containing epsilon-machine and discovery info.

    Raises:
        InsufficientDataError: If sequence is too short.
    """
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.history_length * 10
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="NSD",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    symbols_list = sorted(alphabet)  # type: ignore[type-var]
    n_symbols = len(symbols_list)
    sym_to_idx = {s: i for i, s in enumerate(symbols_list)}

    # Initialize RNG
    rng = random.Random(self.config.seed)

    # Build history embeddings based on predictive distributions
    histories, embeddings = self._build_embeddings(symbols, symbols_list, sym_to_idx, n_symbols)

    if len(histories) == 0:
        # Fallback for very short sequences
        machine = self._build_trivial_machine(alphabet, symbols_list)
        return InferenceResult(
            machine=machine,
            sequence_length=n,
            max_history_used=self.config.history_length,
            num_histories_considered=0,
        )

    # Find optimal number of states using gap statistic heuristic
    best_k = 1
    best_score = float("-inf")

    for k in range(1, min(self.config.max_states + 1, len(histories) + 1)):
        labels, _, inertia = self._kmeans(embeddings, k, rng)
        # Simplified model selection: prefer fewer states unless significantly better
        score = -inertia - 0.5 * k * math.log(len(histories))
        if score > best_score:
            best_score = score
            best_k = k

    # Run final clustering with best k
    labels, _, _ = self._kmeans(embeddings, best_k, rng)

    # Build machine from clustering
    machine = self._build_machine(
        symbols, histories, labels, symbols_list, sym_to_idx, best_k, alphabet
    )

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.history_length,
        num_histories_considered=len(histories),
    )

__rrshift__

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> NSD(config).

Source code in src/emic/inference/nsd/algorithm.py
def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> NSD(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

NSDConfig dataclass

NSDConfig(
    max_states: int = 10,
    history_length: int = 10,
    embedding_dim: int = 16,
    n_iterations: int = 100,
    convergence_threshold: float = 0.0001,
    seed: int | None = None,
)

Configuration for Neural State Discovery.

NSD learns state representations using a neural network approach. This is a simplified implementation that uses clustering on learned embeddings rather than full neural network training.

Attributes:

Name Type Description
max_states int

Maximum number of states to discover.

history_length int

Length of history window for state embedding.

embedding_dim int

Dimensionality of state embeddings.

n_iterations int

Number of clustering iterations.

convergence_threshold float

Threshold for convergence detection.

seed int | None

Random seed for reproducibility.

Examples:

>>> config = NSDConfig(max_states=5, history_length=10)
>>> config.history_length
10

__post_init__

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/nsd/config.py
def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_states < 1:
        msg = f"max_states must be >= 1, got {self.max_states}"
        raise ValueError(msg)
    if self.history_length < 1:
        msg = f"history_length must be >= 1, got {self.history_length}"
        raise ValueError(msg)
    if self.embedding_dim < 1:
        msg = f"embedding_dim must be >= 1, got {self.embedding_dim}"
        raise ValueError(msg)
    if self.n_iterations < 1:
        msg = f"n_iterations must be >= 1, got {self.n_iterations}"
        raise ValueError(msg)
    if self.convergence_threshold <= 0:
        msg = f"convergence_threshold must be > 0, got {self.convergence_threshold}"
        raise ValueError(msg)

InferenceAlgorithm

Bases: Protocol[A]

Protocol for epsilon-machine inference algorithms.

Any algorithm that can infer an epsilon-machine from a sequence of symbols satisfies this protocol.

Examples:

>>> class MyAlgorithm:
...     def infer(self, sequence, alphabet=None):
...         ...  # Return InferenceResult

infer

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the given sequence.

Parameters:

Name Type Description Default
sequence Iterable[A]

The observed symbols

required
alphabet frozenset[A] | None

Known alphabet (inferred from sequence if None)

None

Returns:

Type Description
InferenceResult[A]

InferenceResult containing the machine and diagnostics

Source code in src/emic/inference/protocol.py
def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the given sequence.

    Args:
        sequence: The observed symbols
        alphabet: Known alphabet (inferred from sequence if None)

    Returns:
        InferenceResult containing the machine and diagnostics
    """
    ...

InferenceResult dataclass

InferenceResult(
    machine: EpsilonMachine[A],
    sequence_length: int,
    max_history_used: int,
    num_histories_considered: int,
    converged: bool = True,
    iterations: int | None = None,
    warnings: tuple[str, ...] = tuple(),
)

Bases: Generic[A]

The result of epsilon-machine inference.

Contains the inferred machine plus diagnostics and quality metrics.

Attributes:

Name Type Description
machine EpsilonMachine[A]

The inferred epsilon-machine

sequence_length int

Number of symbols in the input sequence

max_history_used int

Maximum history length considered

num_histories_considered int

Total number of histories analyzed

converged bool

Whether the algorithm converged

iterations int | None

Number of iterations (if applicable)

warnings tuple[str, ...]

Any warnings generated during inference

summary

summary() -> str

Return a human-readable summary.

Source code in src/emic/inference/result.py
def summary(self) -> str:
    """Return a human-readable summary."""
    return (
        f"Inferred ε-machine:\n"
        f"  States: {len(self.machine.states)}\n"
        f"  Alphabet size: {len(self.machine.alphabet)}\n"
        f"  Sequence length: {self.sequence_length}\n"
        f"  Max history: {self.max_history_used}\n"
        f"  Converged: {self.converged}\n"
    )

__rshift__

__rshift__(func: Callable[[EpsilonMachine[A]], T]) -> T

Pipeline operator for chaining analysis.

Allows: result >> analyze

Source code in src/emic/inference/result.py
def __rshift__(self, func: Callable[[EpsilonMachine[A]], T]) -> T:
    """
    Pipeline operator for chaining analysis.

    Allows: result >> analyze
    """
    return func(self.machine)

InferenceError

Bases: Exception

Base class for inference errors.

explain

explain() -> str

Return a human-readable explanation.

Source code in src/emic/inference/errors.py
def explain(self) -> str:
    """Return a human-readable explanation."""
    return str(self)

InsufficientDataError

InsufficientDataError(
    required: int, provided: int, algorithm: str = "unknown"
)

Bases: InferenceError

Raised when sequence is too short for reliable inference.

Source code in src/emic/inference/errors.py
def __init__(
    self,
    required: int,
    provided: int,
    algorithm: str = "unknown",
) -> None:
    super().__init__(
        f"Insufficient data for {algorithm}: need {required} symbols, got {provided}"
    )
    self.required = required
    self.provided = provided
    self.algorithm = algorithm

explain

explain() -> str

Return a human-readable explanation.

Source code in src/emic/inference/errors.py
def explain(self) -> str:
    """Return a human-readable explanation."""
    return (
        f"The sequence you provided has {self.provided} symbols, "
        f"but the {self.algorithm} algorithm needs at least {self.required} "
        f"to produce reliable results. Try:\n"
        f"  • Using a longer sequence\n"
        f"  • Reducing max_history parameter\n"
        f"  • Reducing min_count parameter (less reliable)"
    )

NonConvergenceError

NonConvergenceError(iterations: int, tolerance: float)

Bases: InferenceError

Raised when algorithm fails to converge.

Source code in src/emic/inference/errors.py
def __init__(self, iterations: int, tolerance: float) -> None:
    super().__init__(f"Algorithm did not converge after {iterations} iterations")
    self.iterations = iterations
    self.tolerance = tolerance

explain

explain() -> str

Return a human-readable explanation.

Source code in src/emic/inference/errors.py
def explain(self) -> str:
    """Return a human-readable explanation."""
    return (
        f"The algorithm did not stabilize after {self.iterations} iterations. "
        f"This might indicate:\n"
        f"  • The process has complex structure requiring more iterations\n"
        f"  • The significance level ({self.tolerance}) is too sensitive\n"
        f"  • The data contains anomalies\n"
        f"Try increasing max_iterations or significance level."
    )