emic.inference¶

Epsilon-machine inference algorithms.

inference ¶

Inference module for epsilon-machine reconstruction.

CSSR `dataclass` ¶

CSSR(config: CSSRConfig)

Bases: Generic[A]

Causal State Splitting Reconstruction algorithm (corrected version).

This implementation follows the three-phase structure from the original paper:

Phase I (Initialization): Build suffix tree and initialize with all histories in one state.

Phase II (Sufficiency): For each history length L from 1 to L_max: For each history h of length L: Let s be the suffix of h of length L-1 If P(next | h) differs significantly from P(next | state(s)): Create new state or reassign h

Phase III (Recursion/Determinism): Ensure transitions are deterministic by merging states that would receive the same history under extension.

Reference

Shalizi, C.R. & Klinkner, K.L. (2004). "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences"

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence.

Source code in src/emic/inference/cssr/algorithm.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """Infer epsilon-machine from sequence."""
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.min_count * (self.config.max_history + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="CSSR",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    # Phase I: Build suffix tree
    suffix_tree: SuffixTree[A] = SuffixTree(
        max_depth=self.config.max_history, alphabet=alphabet
    )
    suffix_tree.build_from_sequence(symbols)

    # Phase II: Level-by-level sufficiency testing
    partition = self._sufficiency_phase(suffix_tree, alphabet)

    # Phase III: Ensure determinism (merge equivalent states)
    partition = self._determinism_phase(partition, suffix_tree, alphabet)

    # Build machine
    machine = self._build_machine(partition, suffix_tree, alphabet)

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.max_history,
        num_histories_considered=len(suffix_tree),
        converged=True,
        iterations=self.config.max_history,
    )

rrshift ¶

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> CSSR(config).

Source code in src/emic/inference/cssr/algorithm.py

def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> CSSR(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

CSSRConfig `dataclass` ¶

CSSRConfig(
    max_history: int,
    significance: float = 0.05,
    min_count: int = 5,
    test: str = "chi2",
    max_iterations: int = 1000,
    post_merge: bool = True,
    merge_significance: float | None = None,
)

Configuration for the CSSR algorithm.

Attributes:

Name	Type	Description
`max_history`	`int`	Maximum history length L to consider
`significance`	`float`	Significance level for statistical tests (lower = fewer states)
`min_count`	`int`	Minimum observations for a history to be considered
`test`	`str`	Statistical test type ("chi2", "ks", "g")
`max_iterations`	`int`	Maximum iterations for convergence
`post_merge`	`bool`	Enable post-convergence state merging for minimality
`merge_significance`	`float \| None`	Significance for post-merge (None = use significance)

Examples:

>>> config = CSSRConfig(max_history=5, significance=0.01)
>>> config.max_history
5

__post_init__ ¶

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/cssr/config.py

def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_history < 1:
        msg = f"max_history must be >= 1, got {self.max_history}"
        raise ValueError(msg)
    if not (0 < self.significance < 1):
        msg = f"significance must be in (0, 1), got {self.significance}"
        raise ValueError(msg)
    if self.min_count < 1:
        msg = f"min_count must be >= 1, got {self.min_count}"
        raise ValueError(msg)
    if self.test not in ("chi2", "ks", "g", "proportion"):
        msg = f"test must be 'chi2', 'ks', 'g', or 'proportion', got {self.test}"
        raise ValueError(msg)
    if self.merge_significance is not None and not (0 < self.merge_significance < 1):
        msg = f"merge_significance must be in (0, 1), got {self.merge_significance}"
        raise ValueError(msg)

CSM `dataclass` ¶

CSM(config: CSMConfig)

Bases: Generic[A]

Causal State Merging algorithm.

Infers an epsilon-machine from an observed sequence by: 1. Creating a state for each unique history of the specified length 2. Computing pairwise distances between state predictive distributions 3. Iteratively merging the closest pair until threshold is exceeded

This is a bottom-up approach that complements CSSR's top-down splitting.

Reference

This implementation follows the state-merging approach described in computational mechanics literature as an alternative to CSSR.

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.csm import CSM, CSMConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> csm = CSM(CSMConfig(history_length=5))
>>> result = csm.infer(sequence)
>>> len(result.machine.states)
2

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence.

Parameters:

Name	Type	Description	Default
`sequence`	`Iterable[A]`	The observed sequence of symbols.	required
`alphabet`	`frozenset[A] \| None`	The set of possible symbols (inferred if not provided).	`None`

Returns:

Type	Description
`InferenceResult[A]`	InferenceResult containing the inferred machine and diagnostics.

Raises:

Type	Description
`InsufficientDataError`	If sequence is too short for inference.

Source code in src/emic/inference/csm/algorithm.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer epsilon-machine from sequence.

    Args:
        sequence: The observed sequence of symbols.
        alphabet: The set of possible symbols (inferred if not provided).

    Returns:
        InferenceResult containing the inferred machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short for inference.
    """
    # Convert to list for multiple passes
    symbols = list(sequence)
    n = len(symbols)

    # Check minimum data requirement
    min_required = self.config.min_count * (self.config.history_length + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="CSM",
        )

    # Infer alphabet if not provided
    if alphabet is None:
        alphabet = frozenset(symbols)

    # Collect history statistics
    history_stats = self._collect_history_stats(symbols)

    # Initialize partition: each valid history is its own state
    partition, state_to_histories = self._initialize_partition(history_stats)

    if not partition:
        # No valid histories, create trivial machine
        return self._build_trivial_result(alphabet, n)

    # Compute initial distances between all state pairs
    distances = self._compute_all_distances(partition, state_to_histories, history_stats)

    # Merge history for hierarchical option
    merge_history: list[MergeInfo] = []

    # Iteratively merge closest pairs
    num_merges = 0
    while len(partition) > 1:
        # Find closest pair
        min_dist = float("inf")
        best_pair: tuple[str, str] | None = None

        for (s1, s2), dist in distances.items():
            if dist < min_dist:
                min_dist = dist
                best_pair = (s1, s2)

        if best_pair is None or min_dist > self.config.merge_threshold:
            break

        # Merge the pair
        s1, s2 = best_pair
        new_state = self._merge_states(s1, s2, partition, state_to_histories, history_stats)
        num_merges += 1

        if self.config.hierarchical:
            merge_history.append(MergeInfo(s1, s2, new_state, min_dist))

        # Update distances
        distances = self._update_distances_after_merge(
            s1, s2, new_state, partition, state_to_histories, history_stats, distances
        )

    # Build machine from final partition
    machine = self._build_machine(partition, state_to_histories, history_stats, alphabet)

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.history_length,
        num_histories_considered=len(history_stats),
        converged=True,
        iterations=num_merges,
    )

rrshift ¶

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> CSM(config).

Source code in src/emic/inference/csm/algorithm.py

def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> CSM(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

CSMConfig `dataclass` ¶

CSMConfig(
    history_length: int,
    merge_threshold: float = 0.05,
    distance_metric: DistanceMetric = "kl",
    min_count: int = 5,
    hierarchical: bool = False,
)

Configuration for the Causal State Merging (CSM) algorithm.

CSM is a bottom-up approach to epsilon-machine inference. Unlike CSSR, which splits states, CSM starts with the finest possible partition (each unique history of length L is a state) and merges states based on predictive equivalence.

Attributes:

Name	Type	Description
`history_length`	`int`	Fixed history length for initial partition. All unique histories of this length become initial states.
`merge_threshold`	`float`	Distance threshold for merging. Two states are merged if their predictive distributions are closer than this threshold. Lower values = more states, higher values = fewer states.
`distance_metric`	`DistanceMetric`	Metric for comparing predictive distributions. - 'kl': Kullback-Leibler divergence (asymmetric) - 'hellinger': Hellinger distance (symmetric) - 'tv': Total variation distance (symmetric) - 'chi2': Chi-squared distance (symmetric)
`min_count`	`int`	Minimum observations required for a history to be considered.
`hierarchical`	`bool`	If True, return the full merge hierarchy (dendrogram). Useful for visualization and choosing merge threshold.

Examples:

>>> config = CSMConfig(history_length=5, merge_threshold=0.05)
>>> config.history_length
5
>>> config.distance_metric
'kl'

__post_init__ ¶

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/csm/config.py

def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.history_length < 1:
        msg = f"history_length must be >= 1, got {self.history_length}"
        raise ValueError(msg)
    if self.merge_threshold <= 0:
        msg = f"merge_threshold must be > 0, got {self.merge_threshold}"
        raise ValueError(msg)
    if self.distance_metric not in ("kl", "hellinger", "tv", "chi2"):
        msg = f"distance_metric must be 'kl', 'hellinger', 'tv', or 'chi2', got {self.distance_metric}"
        raise ValueError(msg)
    if self.min_count < 1:
        msg = f"min_count must be >= 1, got {self.min_count}"
        raise ValueError(msg)

BSI `dataclass` ¶

BSI(config: BSIConfig)

Bases: Generic[A]

Bayesian Structural Inference for epsilon-machine learning.

BSI uses Bayesian inference with Gibbs sampling to learn epsilon-machines. It provides uncertainty quantification and automatic model selection via marginal likelihood (evidence).

The algorithm places Dirichlet priors on transition probabilities and uses Gibbs sampling to explore the posterior over state assignments.

Implements the InferenceAlgorithm protocol.

Reference

Strelioff, C.C. & Crutchfield, J.P. (2014). "Bayesian Structural Inference for Hidden Processes"

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.bsi import BSI, BSIConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> bsi = BSI(BSIConfig(max_states=5, n_samples=100))
>>> result = bsi.infer(sequence)
>>> len(result.machine.states) <= 5
True

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the sequence using BSI.

Parameters:

Name	Type	Description	Default
`sequence`	`Iterable[A]`	Input sequence of symbols.	required
`alphabet`	`frozenset[A] \| None`	Set of possible symbols. Inferred from data if None.	`None`

Returns:

Type	Description
`InferenceResult[A]`	InferenceResult containing epsilon-machine and diagnostics.

Raises:

Type	Description
`InsufficientDataError`	If sequence is too short.

Source code in src/emic/inference/bsi/algorithm.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the sequence using BSI.

    Args:
        sequence: Input sequence of symbols.
        alphabet: Set of possible symbols. Inferred from data if None.

    Returns:
        InferenceResult containing epsilon-machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short.
    """
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.max_history * 10
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="BSI",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    # Initialize RNG
    rng = random.Random(self.config.seed)

    # Build suffix tree for counting
    suffix_counts = self._build_suffix_counts(symbols)

    # Try different numbers of states, compute evidence
    best_machine: EpsilonMachine[A] | None = None
    best_evidence = float("-inf")

    for n_states in range(1, self.config.max_states + 1):
        machine, evidence = self._infer_with_n_states(
            symbols, alphabet, suffix_counts, n_states, rng
        )
        if evidence > best_evidence:
            best_evidence = evidence
            best_machine = machine

    assert best_machine is not None

    return InferenceResult(
        machine=best_machine,
        sequence_length=n,
        max_history_used=self.config.max_history,
        num_histories_considered=len(suffix_counts),
    )

rrshift ¶

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> BSI(config).

Source code in src/emic/inference/bsi/algorithm.py

def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> BSI(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

BSIConfig `dataclass` ¶

BSIConfig(
    max_states: int = 10,
    max_history: int = 5,
    alpha_prior: float = 1.0,
    n_samples: int = 1000,
    burnin: int = 200,
    thin: int = 1,
    seed: int | None = None,
)

Configuration for Bayesian Structural Inference.

BSI uses Bayesian inference to learn epsilon-machines with natural uncertainty quantification. It can automatically select the number of states via model evidence.

Attributes:

Name	Type	Description
`max_states`	`int`	Maximum number of states to consider.
`max_history`	`int`	Maximum history length for likelihood computation.
`alpha_prior`	`float`	Dirichlet concentration for transition priors. Higher values = more uniform priors.
`n_samples`	`int`	Number of MCMC samples to draw.
`burnin`	`int`	Number of burn-in samples to discard.
`thin`	`int`	Thinning factor (keep every n-th sample).
`seed`	`int \| None`	Random seed for reproducibility.

Examples:

>>> config = BSIConfig(max_states=5, n_samples=500)
>>> config.max_states
5

__post_init__ ¶

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/bsi/config.py

def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_states < 1:
        msg = f"max_states must be >= 1, got {self.max_states}"
        raise ValueError(msg)
    if self.max_history < 1:
        msg = f"max_history must be >= 1, got {self.max_history}"
        raise ValueError(msg)
    if self.alpha_prior <= 0:
        msg = f"alpha_prior must be > 0, got {self.alpha_prior}"
        raise ValueError(msg)
    if self.n_samples < 1:
        msg = f"n_samples must be >= 1, got {self.n_samples}"
        raise ValueError(msg)
    if self.burnin < 0:
        msg = f"burnin must be >= 0, got {self.burnin}"
        raise ValueError(msg)
    if self.thin < 1:
        msg = f"thin must be >= 1, got {self.thin}"
        raise ValueError(msg)

Spectral `dataclass` ¶

Spectral(config: SpectralConfig)

Bases: Generic[A]

Spectral Learning algorithm for epsilon-machine inference.

Uses SVD of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine from the learned model.

This is a polynomial-time algorithm that is statistically consistent (converges to the true model with enough data).

Reference

Hsu, D., Kakade, S.M., & Zhang, T. (2012). "Spectral Algorithm for Learning Hidden Markov Models"

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.spectral import Spectral, SpectralConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(10000)(source))
>>>
>>> spectral = Spectral(SpectralConfig(max_history=5))
>>> result = spectral.infer(sequence)
>>> len(result.machine.states) >= 2  # Should find ~2 states
True

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer epsilon-machine from sequence using spectral methods.

Parameters:

Name	Type	Description	Default
`sequence`	`Iterable[A]`	The observed sequence of symbols.	required
`alphabet`	`frozenset[A] \| None`	The set of possible symbols (inferred if not provided).	`None`

Returns:

Type	Description
`InferenceResult[A]`	InferenceResult containing the inferred machine and diagnostics.

Raises:

Type	Description
`InsufficientDataError`	If sequence is too short for inference.

Source code in src/emic/inference/spectral/algorithm.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer epsilon-machine from sequence using spectral methods.

    Args:
        sequence: The observed sequence of symbols.
        alphabet: The set of possible symbols (inferred if not provided).

    Returns:
        InferenceResult containing the inferred machine and diagnostics.

    Raises:
        InsufficientDataError: If sequence is too short for inference.
    """
    symbols = list(sequence)
    n = len(symbols)

    if alphabet is None:
        alphabet = frozenset(symbols)

    alphabet_list = sorted(alphabet, key=str)

    # Get effective parameters (may be adaptive)
    max_history, min_count = self.config.get_effective_params(
        n_samples=n, alphabet_size=len(alphabet_list)
    )

    # Check minimum data requirement
    min_required = min_count * (2 * max_history + 1) * 2
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="Spectral",
        )

    # Step 1: Build Hankel matrices
    hankel = build_hankel_matrices(
        symbols=symbols,
        alphabet=alphabet_list,
        max_history=max_history,
        min_count=min_count,
    )

    # Early exit for degenerate cases
    if hankel.is_empty:
        machine = build_trivial_machine(symbols, alphabet)
        return InferenceResult(
            machine=machine,
            sequence_length=n,
            max_history_used=max_history,
            num_histories_considered=0,
            converged=True,
            iterations=1,
        )

    # Step 2: Perform SVD and determine rank
    svd = compute_svd(
        hankel.H,
        rank=self.config.rank,
        rank_threshold=self.config.rank_threshold,
    )

    # Step 3: Extract observable operators
    op_result = extract_operators(
        H_x=hankel.H_x,
        svd=svd,
        alphabet=alphabet_list,
        regularization=self.config.regularization,
    )

    # Step 4: Convert to epsilon-machine
    # Pass SVD and H for proper b_1 and b_inf computation
    machine = build_machine_from_operators(
        operators=op_result.operators,
        alphabet=alphabet_list,
        symbols=symbols,
        svd=svd,
        H=hankel.H,
    )

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=max_history,
        num_histories_considered=len(hankel.histories),
        converged=True,
        iterations=op_result.rank,
    )

rrshift ¶

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> Spectral(config).

Source code in src/emic/inference/spectral/algorithm.py

def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> Spectral(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

SpectralConfig `dataclass` ¶

SpectralConfig(
    max_history: int | None = None,
    rank_threshold: float = 0.001,
    rank: int | None = None,
    regularization: float = 1e-06,
    min_count: int | None = None,
)

Configuration for the Spectral Learning algorithm.

Spectral learning uses SVD/eigendecomposition of Hankel matrices to learn observable operator representations, then extracts the epsilon-machine.

Attributes:

Name	Type	Description
`max_history`	`int \| None`	Maximum history length for Hankel matrix rows/columns. If None (default), automatically adapts based on sample size and alphabet size. Larger values capture longer-range correlations but require more data to estimate reliably.
`rank_threshold`	`float`	Relative singular value threshold for rank selection. Singular values below max_sv * threshold are discarded. Lower values are more conservative (keep fewer singular values), which can prevent overfitting but may miss structure. Higher values keep more singular values but may include noise. If the inferred machine has too many states, try lowering this. If it has too few states, try raising it or use explicit `rank`.
`rank`	`int \| None`	Fixed rank (overrides threshold if set). Use this when you know the true number of hidden states.
`regularization`	`float`	Regularization for numerical stability.
`min_count`	`int \| None`	Minimum observations for a history to be included. If None (default), automatically adapts based on sample size.

Examples:

>>> # Default configuration (adaptive)
>>> config = SpectralConfig()
>>> config.max_history is None  # Will be set adaptively
True

>>> # Fixed history length for large data
>>> config = SpectralConfig(max_history=10)

>>> # Fixed rank when you know the true model
>>> config = SpectralConfig(rank=2)

__post_init__ ¶

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/spectral/config.py

def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_history is not None and self.max_history < 1:
        msg = f"max_history must be >= 1 or None, got {self.max_history}"
        raise ValueError(msg)
    if self.rank_threshold <= 0 or self.rank_threshold >= 1:
        msg = f"rank_threshold must be in (0, 1), got {self.rank_threshold}"
        raise ValueError(msg)
    if self.rank is not None and self.rank < 1:
        msg = f"rank must be >= 1, got {self.rank}"
        raise ValueError(msg)
    if self.regularization < 0:
        msg = f"regularization must be >= 0, got {self.regularization}"
        raise ValueError(msg)
    if self.min_count is not None and self.min_count < 1:
        msg = f"min_count must be >= 1 or None, got {self.min_count}"
        raise ValueError(msg)

get_effective_params ¶

get_effective_params(
    n_samples: int, alphabet_size: int
) -> tuple[int, int]

Get effective max_history and min_count for given data size.

The heuristic is based on ensuring the Hankel matrix is well-populated: - We need approximately |Σ|^L unique histories of length L - Each history needs min_count occurrences - So we need n >= |Σ|^L * min_count * constant

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of samples in the sequence.	required
`alphabet_size`	`int`	Size of the alphabet.	required

Returns:

Type	Description
`tuple[int, int]`	Tuple of (effective_max_history, effective_min_count).

Source code in src/emic/inference/spectral/config.py

def get_effective_params(self, n_samples: int, alphabet_size: int) -> tuple[int, int]:
    """
    Get effective max_history and min_count for given data size.

    The heuristic is based on ensuring the Hankel matrix is well-populated:
    - We need approximately |Σ|^L unique histories of length L
    - Each history needs min_count occurrences
    - So we need n >= |Σ|^L * min_count * constant

    Args:
        n_samples: Number of samples in the sequence.
        alphabet_size: Size of the alphabet.

    Returns:
        Tuple of (effective_max_history, effective_min_count).
    """
    if self.max_history is not None and self.min_count is not None:
        return (self.max_history, self.min_count)

    # Adaptive heuristic based on sample size and alphabet.
    # We use L = floor(log(n/50) / log(|A|)), giving a history length
    # such that the expected count per cell is at least 50.
    safety_factor = 50

    if alphabet_size < 2:
        alphabet_size = 2  # Avoid log(1) = 0

    if self.max_history is not None:
        effective_max_history = self.max_history
    else:
        # L such that |Σ|^L * safety_factor <= n
        # L <= log(n / safety_factor) / log(|Σ|)
        max_L = math.log(max(1, n_samples / safety_factor)) / math.log(alphabet_size)
        effective_max_history = max(2, min(10, int(max_L)))

    if self.min_count is not None:
        effective_min_count = self.min_count
    else:
        # Adaptive min_count: higher for small data, lower for large
        # Scale roughly as sqrt(n) / 10, capped between 2 and 10
        effective_min_count = max(2, min(10, int(math.sqrt(n_samples) / 10)))

    return (effective_max_history, effective_min_count)

NSD `dataclass` ¶

NSD(config: NSDConfig)

Bases: Generic[A]

Neural State Discovery for epsilon-machine learning.

NSD learns causal states by embedding histories into a vector space and clustering. This implementation uses a simplified approach without neural networks: it creates embeddings from next-symbol distributions and uses k-means clustering to discover states.

Reference

This is a simplified version inspired by representation learning approaches to computational mechanics.

Examples:

>>> from emic.sources.synthetic.golden_mean import GoldenMeanSource
>>> from emic.sources.transforms.take import TakeN
>>> from emic.inference.nsd import NSD, NSDConfig
>>>
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> sequence = list(TakeN(5000)(source))
>>>
>>> nsd = NSD(NSDConfig(max_states=5))
>>> result = nsd.infer(sequence)
>>> len(result.machine.states) <= 5
True

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the sequence using NSD.

Parameters:

Name	Type	Description	Default
`sequence`	`Iterable[A]`	Input sequence of symbols.	required
`alphabet`	`frozenset[A] \| None`	Set of possible symbols. Inferred from data if None.	`None`

Returns:

Type	Description
`InferenceResult[A]`	InferenceResult containing epsilon-machine and discovery info.

Raises:

Type	Description
`InsufficientDataError`	If sequence is too short.

Source code in src/emic/inference/nsd/algorithm.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the sequence using NSD.

    Args:
        sequence: Input sequence of symbols.
        alphabet: Set of possible symbols. Inferred from data if None.

    Returns:
        InferenceResult containing epsilon-machine and discovery info.

    Raises:
        InsufficientDataError: If sequence is too short.
    """
    symbols = list(sequence)
    n = len(symbols)

    min_required = self.config.history_length * 10
    if n < min_required:
        raise InsufficientDataError(
            required=min_required,
            provided=n,
            algorithm="NSD",
        )

    if alphabet is None:
        alphabet = frozenset(symbols)

    symbols_list = sorted(alphabet)  # type: ignore[type-var]
    n_symbols = len(symbols_list)
    sym_to_idx = {s: i for i, s in enumerate(symbols_list)}

    # Initialize RNG
    rng = random.Random(self.config.seed)

    # Build history embeddings based on predictive distributions
    histories, embeddings = self._build_embeddings(symbols, symbols_list, sym_to_idx, n_symbols)

    if len(histories) == 0:
        # Fallback for very short sequences
        machine = self._build_trivial_machine(alphabet, symbols_list)
        return InferenceResult(
            machine=machine,
            sequence_length=n,
            max_history_used=self.config.history_length,
            num_histories_considered=0,
        )

    # Find optimal number of states using gap statistic heuristic
    best_k = 1
    best_score = float("-inf")

    for k in range(1, min(self.config.max_states + 1, len(histories) + 1)):
        labels, _, inertia = self._kmeans(embeddings, k, rng)
        # Simplified model selection: prefer fewer states unless significantly better
        score = -inertia - 0.5 * k * math.log(len(histories))
        if score > best_score:
            best_score = score
            best_k = k

    # Run final clustering with best k
    labels, _, _ = self._kmeans(embeddings, best_k, rng)

    # Build machine from clustering
    machine = self._build_machine(
        symbols, histories, labels, symbols_list, sym_to_idx, best_k, alphabet
    )

    return InferenceResult(
        machine=machine,
        sequence_length=n,
        max_history_used=self.config.history_length,
        num_histories_considered=len(histories),
    )

rrshift ¶

__rrshift__(source: Iterable[A]) -> InferenceResult[A]

Support: sequence >> NSD(config).

Source code in src/emic/inference/nsd/algorithm.py

def __rrshift__(self, source: Iterable[A]) -> InferenceResult[A]:
    """Support: sequence >> NSD(config)."""
    alphabet = getattr(source, "alphabet", None)
    return self.infer(source, alphabet=alphabet)

NSDConfig `dataclass` ¶

NSDConfig(
    max_states: int = 10,
    history_length: int = 10,
    embedding_dim: int = 16,
    n_iterations: int = 100,
    convergence_threshold: float = 0.0001,
    seed: int | None = None,
)

Configuration for Neural State Discovery.

NSD learns state representations using a neural network approach. This is a simplified implementation that uses clustering on learned embeddings rather than full neural network training.

Attributes:

Name	Type	Description
`max_states`	`int`	Maximum number of states to discover.
`history_length`	`int`	Length of history window for state embedding.
`embedding_dim`	`int`	Dimensionality of state embeddings.
`n_iterations`	`int`	Number of clustering iterations.
`convergence_threshold`	`float`	Threshold for convergence detection.
`seed`	`int \| None`	Random seed for reproducibility.

Examples:

>>> config = NSDConfig(max_states=5, history_length=10)
>>> config.history_length
10

__post_init__ ¶

__post_init__() -> None

Validate configuration parameters.

Source code in src/emic/inference/nsd/config.py

def __post_init__(self) -> None:
    """Validate configuration parameters."""
    if self.max_states < 1:
        msg = f"max_states must be >= 1, got {self.max_states}"
        raise ValueError(msg)
    if self.history_length < 1:
        msg = f"history_length must be >= 1, got {self.history_length}"
        raise ValueError(msg)
    if self.embedding_dim < 1:
        msg = f"embedding_dim must be >= 1, got {self.embedding_dim}"
        raise ValueError(msg)
    if self.n_iterations < 1:
        msg = f"n_iterations must be >= 1, got {self.n_iterations}"
        raise ValueError(msg)
    if self.convergence_threshold <= 0:
        msg = f"convergence_threshold must be > 0, got {self.convergence_threshold}"
        raise ValueError(msg)

InferenceAlgorithm ¶

Bases: Protocol[A]

Protocol for epsilon-machine inference algorithms.

Any algorithm that can infer an epsilon-machine from a sequence of symbols satisfies this protocol.

Examples:

>>> class MyAlgorithm:
...     def infer(self, sequence, alphabet=None):
...         ...  # Return InferenceResult

infer ¶

infer(
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]

Infer an epsilon-machine from the given sequence.

Parameters:

Name	Type	Description	Default
`sequence`	`Iterable[A]`	The observed symbols	required
`alphabet`	`frozenset[A] \| None`	Known alphabet (inferred from sequence if None)	`None`

Returns:

Type	Description
`InferenceResult[A]`	InferenceResult containing the machine and diagnostics

Source code in src/emic/inference/protocol.py

def infer(
    self,
    sequence: Iterable[A],
    alphabet: frozenset[A] | None = None,
) -> InferenceResult[A]:
    """
    Infer an epsilon-machine from the given sequence.

    Args:
        sequence: The observed symbols
        alphabet: Known alphabet (inferred from sequence if None)

    Returns:
        InferenceResult containing the machine and diagnostics
    """
    ...

InferenceResult `dataclass` ¶

InferenceResult(
    machine: EpsilonMachine[A],
    sequence_length: int,
    max_history_used: int,
    num_histories_considered: int,
    converged: bool = True,
    iterations: int | None = None,
    warnings: tuple[str, ...] = tuple(),
)

Bases: Generic[A]

The result of epsilon-machine inference.

Contains the inferred machine plus diagnostics and quality metrics.

Attributes:

Name	Type	Description
`machine`	`EpsilonMachine[A]`	The inferred epsilon-machine
`sequence_length`	`int`	Number of symbols in the input sequence
`max_history_used`	`int`	Maximum history length considered
`num_histories_considered`	`int`	Total number of histories analyzed
`converged`	`bool`	Whether the algorithm converged
`iterations`	`int \| None`	Number of iterations (if applicable)
`warnings`	`tuple[str, ...]`	Any warnings generated during inference

summary ¶

summary() -> str

Return a human-readable summary.

Source code in src/emic/inference/result.py

def summary(self) -> str:
    """Return a human-readable summary."""
    return (
        f"Inferred ε-machine:\n"
        f"  States: {len(self.machine.states)}\n"
        f"  Alphabet size: {len(self.machine.alphabet)}\n"
        f"  Sequence length: {self.sequence_length}\n"
        f"  Max history: {self.max_history_used}\n"
        f"  Converged: {self.converged}\n"
    )

rshift ¶

__rshift__(func: Callable[[EpsilonMachine[A]], T]) -> T

Pipeline operator for chaining analysis.

Allows: result >> analyze

Source code in src/emic/inference/result.py

def __rshift__(self, func: Callable[[EpsilonMachine[A]], T]) -> T:
    """
    Pipeline operator for chaining analysis.

    Allows: result >> analyze
    """
    return func(self.machine)

InferenceError ¶

Bases: Exception

Base class for inference errors.