Skip to content

emic.sources

Data sources for epsilon-machine inference.

sources

Source protocol and implementations for sequence generation.

Public API
  • SequenceSource (Protocol)
  • SeededSource (Protocol)
  • StochasticSource (Base class)
  • GoldenMeanSource
  • EvenProcessSource
  • BiasedCoinSource
  • PeriodicSource
  • SequenceData
  • TakeN
  • SkipN
  • BitFlipNoise

SequenceSource

Bases: Protocol[A_co]

A source of symbols for epsilon-machine inference.

Any object that is iterable over symbols and knows its alphabet satisfies this protocol.

Examples:

>>> class MySource:
...     @property
...     def alphabet(self) -> frozenset[int]:
...         return frozenset({0, 1})
...     def __iter__(self):
...         yield from [0, 1, 0, 1]
>>> source: SequenceSource[int] = MySource()

alphabet property

alphabet: frozenset[A_co]

The set of possible symbols.

__iter__

__iter__() -> Iterator[A_co]

Yield symbols from the source.

Source code in src/emic/sources/protocol.py
def __iter__(self) -> Iterator[A_co]:
    """Yield symbols from the source."""
    ...

SeededSource

Bases: SequenceSource[A_co], Protocol[A_co]

A source that can be seeded for reproducibility.

Extends SequenceSource with seed management for stochastic sources.

seed property

seed: int | None

The random seed, if set.

with_seed

with_seed(seed: int) -> SeededSource[A_co]

Return a new source with the given seed.

Source code in src/emic/sources/protocol.py
def with_seed(self, seed: int) -> "SeededSource[A_co]":
    """Return a new source with the given seed."""
    ...

StochasticSource dataclass

StochasticSource(
    _alphabet: frozenset[A] = _empty_frozenset(),
    _seed: int | None = None,
    _rng: Random = Random(),
)

Bases: Generic[A]

Base class for stochastic process sources.

Handles random state management and provides common functionality. Not frozen because it maintains RNG state.

Subclasses should: 1. Set _alphabet in post_init 2. Implement iter to yield symbols 3. Implement with_seed to return a properly typed copy

Attributes:

Name Type Description
_alphabet frozenset[A]

The set of possible symbols

_seed int | None

The random seed (None for unseeded)

_rng Random

The random number generator

Examples:

>>> class MyStohasticSource(StochasticSource[int]):
...     def __post_init__(self):
...         super().__post_init__()
...         object.__setattr__(self, '_alphabet', frozenset({0, 1}))
...     def __iter__(self):
...         while True:
...             yield self._rng.choice([0, 1])

alphabet property

alphabet: frozenset[A]

The set of possible symbols.

seed property

seed: int | None

The random seed, if set.

__post_init__

__post_init__() -> None

Initialize RNG with seed if provided.

Source code in src/emic/sources/base.py
def __post_init__(self) -> None:
    """Initialize RNG with seed if provided."""
    if self._seed is not None:
        self._rng.seed(self._seed)

with_seed

with_seed(seed: int) -> Self

Return a new source with the given seed.

Subclasses should override to return the correct type.

Parameters:

Name Type Description Default
seed int

The random seed to use.

required

Returns:

Type Description
Self

A new source instance with the given seed.

Source code in src/emic/sources/base.py
def with_seed(self, seed: int) -> Self:
    """
    Return a new source with the given seed.

    Subclasses should override to return the correct type.

    Args:
        seed: The random seed to use.

    Returns:
        A new source instance with the given seed.
    """
    raise NotImplementedError("Subclasses must implement with_seed")

__iter__

__iter__() -> Iterator[A]

Yield symbols from the source.

Subclasses must implement this method.

Source code in src/emic/sources/base.py
def __iter__(self) -> Iterator[A]:
    """
    Yield symbols from the source.

    Subclasses must implement this method.
    """
    raise NotImplementedError("Subclasses must implement __iter__")

__rshift__

__rshift__(transform: object) -> object

Pipeline operator for composing sources with transforms.

Usage

source >> TakeN(1000) >> CSSR()

Parameters:

Name Type Description Default
transform object

A callable that accepts this source.

required

Returns:

Type Description
object

The result of applying the transform to this source.

Source code in src/emic/sources/base.py
def __rshift__(self, transform: object) -> object:
    """
    Pipeline operator for composing sources with transforms.

    Usage:
        source >> TakeN(1000) >> CSSR()

    Args:
        transform: A callable that accepts this source.

    Returns:
        The result of applying the transform to this source.
    """
    if callable(transform):
        return transform(self)
    return NotImplemented

GoldenMeanSource dataclass

GoldenMeanSource(
    _alphabet: frozenset[int] = (
        lambda: frozenset({0, 1})
    )(),
    _seed: int | None = None,
    _rng: Random = Random(),
    p: float = 0.5,
)

Bases: StochasticSource[int]

The Golden Mean Process.

A binary process where consecutive 1s are forbidden. After emitting a 1, the next symbol must be 0.

State machine

A --0 (p)--> A A --1 (1-p)--> B B --0 (1.0)--> A

Parameters:

Name Type Description Default
p float

Probability of emitting 0 from state A (default: 0.5)

0.5
Statistical properties
  • Entropy rate: h = (2/3) * H(p) where H is binary entropy
  • Statistical complexity: C_μ ≈ 0.918 bits for p=0.5

Examples:

>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> it = iter(source)
>>> symbols = [next(it) for _ in range(100)]
>>> # No consecutive 1s in output
>>> '11' not in ''.join(map(str, symbols))
True

true_machine property

true_machine: EpsilonMachine[int]

Return the known epsilon-machine for this process.

The Golden Mean process has exactly 2 causal states.

Returns:

Type Description
EpsilonMachine[int]

The epsilon-machine that generates this process.

__post_init__

__post_init__() -> None

Validate parameters and initialize RNG.

Source code in src/emic/sources/synthetic/golden_mean.py
def __post_init__(self) -> None:
    """Validate parameters and initialize RNG."""
    super().__post_init__()
    if not (0 < self.p < 1):
        msg = f"p must be in (0, 1), got {self.p}"
        raise ValueError(msg)

__iter__

__iter__() -> Iterator[int]

Generate symbols from the Golden Mean process.

Yields:

Type Description
int

Symbols from {0, 1} with no consecutive 1s.

Source code in src/emic/sources/synthetic/golden_mean.py
def __iter__(self) -> Iterator[int]:
    """
    Generate symbols from the Golden Mean process.

    Yields:
        Symbols from {0, 1} with no consecutive 1s.
    """
    state = "A"
    while True:
        if state == "A":
            if self._rng.random() < self.p:
                yield 0
                state = "A"
            else:
                yield 1
                state = "B"
        else:  # state == 'B'
            yield 0
            state = "A"

with_seed

with_seed(seed: int) -> GoldenMeanSource

Return a new source with the given seed.

Parameters:

Name Type Description Default
seed int

The random seed to use.

required

Returns:

Type Description
GoldenMeanSource

A new GoldenMeanSource with the given seed.

Source code in src/emic/sources/synthetic/golden_mean.py
def with_seed(self, seed: int) -> GoldenMeanSource:
    """
    Return a new source with the given seed.

    Args:
        seed: The random seed to use.

    Returns:
        A new GoldenMeanSource with the given seed.
    """
    return GoldenMeanSource(p=self.p, _seed=seed)

EvenProcessSource dataclass

EvenProcessSource(
    _alphabet: frozenset[int] = (
        lambda: frozenset({0, 1})
    )(),
    _seed: int | None = None,
    _rng: Random = Random(),
    p: float = 0.5,
)

Bases: StochasticSource[int]

The Even Process.

A binary process where 1s must appear in runs of even length. After emitting a 1, must emit another 1 before any 0.

State machine

A --0 (p)--> A A --1 (1-p)--> B B --1 (1.0)--> A

Parameters:

Name Type Description Default
p float

Probability of emitting 0 from state A (default: 0.5)

0.5

Examples:

>>> source = EvenProcessSource(p=0.5, _seed=1)
>>> it = iter(source)
>>> symbols = [next(it) for _ in range(100)]
>>> # Count runs of 1s - all should be even length
>>> s = ''.join(map(str, symbols))
>>> import re
>>> all(len(run) % 2 == 0 for run in re.findall('1+', s))
True

true_machine property

true_machine: EpsilonMachine[int]

Return the known epsilon-machine for this process.

The Even process has exactly 2 causal states.

Returns:

Type Description
EpsilonMachine[int]

The epsilon-machine that generates this process.

__post_init__

__post_init__() -> None

Validate parameters and initialize RNG.

Source code in src/emic/sources/synthetic/even_process.py
def __post_init__(self) -> None:
    """Validate parameters and initialize RNG."""
    super().__post_init__()
    if not (0 < self.p < 1):
        msg = f"p must be in (0, 1), got {self.p}"
        raise ValueError(msg)

__iter__

__iter__() -> Iterator[int]

Generate symbols from the Even process.

Yields:

Type Description
int

Symbols from {0, 1} where runs of 1s have even length.

Source code in src/emic/sources/synthetic/even_process.py
def __iter__(self) -> Iterator[int]:
    """
    Generate symbols from the Even process.

    Yields:
        Symbols from {0, 1} where runs of 1s have even length.
    """
    state = "A"
    while True:
        if state == "A":
            if self._rng.random() < self.p:
                yield 0
                state = "A"
            else:
                yield 1
                state = "B"
        else:  # state == 'B'
            yield 1
            state = "A"

with_seed

with_seed(seed: int) -> EvenProcessSource

Return a new source with the given seed.

Parameters:

Name Type Description Default
seed int

The random seed to use.

required

Returns:

Type Description
EvenProcessSource

A new EvenProcessSource with the given seed.

Source code in src/emic/sources/synthetic/even_process.py
def with_seed(self, seed: int) -> EvenProcessSource:
    """
    Return a new source with the given seed.

    Args:
        seed: The random seed to use.

    Returns:
        A new EvenProcessSource with the given seed.
    """
    return EvenProcessSource(p=self.p, _seed=seed)

BiasedCoinSource dataclass

BiasedCoinSource(
    _alphabet: frozenset[int] = (
        lambda: frozenset({0, 1})
    )(),
    _seed: int | None = None,
    _rng: Random = Random(),
    p: float = 0.5,
)

Bases: StochasticSource[int]

Independent identically distributed binary source.

The simplest stochastic process: each symbol is independent. The epsilon-machine has exactly one state.

Parameters:

Name Type Description Default
p float

Probability of emitting 1 (default: 0.5)

0.5
Statistical properties
  • Entropy rate: h = H(p) = -plog(p) - (1-p)log(1-p)
  • Statistical complexity: C_μ = 0 (no memory needed)

Examples:

>>> source = BiasedCoinSource(p=0.7, _seed=42)
>>> symbols = [next(iter(source)) for _ in range(1000)]
>>> # Should be roughly 70% ones
>>> 0.65 < sum(symbols) / len(symbols) < 0.75
True

true_machine property

true_machine: EpsilonMachine[int]

Return the known epsilon-machine for this process.

An IID process has exactly 1 causal state.

Returns:

Type Description
EpsilonMachine[int]

The epsilon-machine that generates this process.

__post_init__

__post_init__() -> None

Validate parameters and initialize RNG.

Source code in src/emic/sources/synthetic/biased_coin.py
def __post_init__(self) -> None:
    """Validate parameters and initialize RNG."""
    super().__post_init__()
    if not (0 <= self.p <= 1):
        msg = f"p must be in [0, 1], got {self.p}"
        raise ValueError(msg)

__iter__

__iter__() -> Iterator[int]

Generate IID symbols.

Yields:

Type Description
int

Symbols from {0, 1} drawn independently.

Source code in src/emic/sources/synthetic/biased_coin.py
def __iter__(self) -> Iterator[int]:
    """
    Generate IID symbols.

    Yields:
        Symbols from {0, 1} drawn independently.
    """
    while True:
        yield 1 if self._rng.random() < self.p else 0

with_seed

with_seed(seed: int) -> BiasedCoinSource

Return a new source with the given seed.

Parameters:

Name Type Description Default
seed int

The random seed to use.

required

Returns:

Type Description
BiasedCoinSource

A new BiasedCoinSource with the given seed.

Source code in src/emic/sources/synthetic/biased_coin.py
def with_seed(self, seed: int) -> BiasedCoinSource:
    """
    Return a new source with the given seed.

    Args:
        seed: The random seed to use.

    Returns:
        A new BiasedCoinSource with the given seed.
    """
    return BiasedCoinSource(p=self.p, _seed=seed)

PeriodicSource dataclass

PeriodicSource(pattern: tuple[A, ...])

Bases: Generic[A]

A deterministic periodic process.

Repeats a fixed pattern indefinitely. The epsilon-machine has N states (one per position in pattern).

Parameters:

Name Type Description Default
pattern tuple[A, ...]

The repeating sequence of symbols

required
Statistical properties
  • Entropy rate: h = 0 (deterministic)
  • Statistical complexity: C_μ = log(N) where N = len(pattern)

Examples:

>>> source = PeriodicSource(pattern=(0, 1, 0))
>>> it = iter(source)
>>> [next(it) for _ in range(9)]
[0, 1, 0, 0, 1, 0, 0, 1, 0]

alphabet property

alphabet: frozenset[A]

The set of symbols in the pattern.

true_machine property

true_machine: EpsilonMachine[A]

Return the known epsilon-machine for this process.

A periodic process with period N has exactly N causal states, arranged in a cycle.

Returns:

Type Description
EpsilonMachine[A]

The epsilon-machine that generates this process.

__post_init__

__post_init__() -> None

Validate pattern and set alphabet.

Source code in src/emic/sources/synthetic/periodic.py
def __post_init__(self) -> None:
    """Validate pattern and set alphabet."""
    if len(self.pattern) == 0:
        msg = "Pattern must be non-empty"
        raise ValueError(msg)
    object.__setattr__(self, "_alphabet", frozenset(self.pattern))

__iter__

__iter__() -> Iterator[A]

Generate symbols by repeating the pattern.

Yields:

Type Description
A

Symbols from the pattern, cycling indefinitely.

Source code in src/emic/sources/synthetic/periodic.py
def __iter__(self) -> Iterator[A]:
    """
    Generate symbols by repeating the pattern.

    Yields:
        Symbols from the pattern, cycling indefinitely.
    """
    i = 0
    n = len(self.pattern)
    while True:
        yield self.pattern[i]
        i = (i + 1) % n

__rshift__

__rshift__(transform: object) -> object

Pipeline operator for composing with transforms.

Source code in src/emic/sources/synthetic/periodic.py
def __rshift__(self, transform: object) -> object:
    """Pipeline operator for composing with transforms."""
    if callable(transform):
        return transform(self)
    return NotImplemented

SequenceData dataclass

SequenceData(
    symbols: tuple[A, ...],
    _alphabet: frozenset[A] | None = None,
)

Bases: Generic[A]

A finite sequence of observed symbols.

Wraps empirical data for use in inference pipelines. Immutable to ensure data integrity.

Parameters:

Name Type Description Default
symbols tuple[A, ...]

The sequence of observed symbols

required
_alphabet frozenset[A] | None

Optional explicit alphabet (inferred from symbols if None)

None

Examples:

>>> data = SequenceData(symbols=(0, 1, 0, 1, 0))
>>> list(data)
[0, 1, 0, 1, 0]
>>> len(data)
5
>>> data.alphabet
frozenset({0, 1})

alphabet property

alphabet: frozenset[A]

The set of possible symbols.

If an explicit alphabet was provided, returns that. Otherwise, returns the set of symbols observed in the data.

__iter__

__iter__() -> Iterator[A]

Iterate over symbols.

Source code in src/emic/sources/empirical/sequence_data.py
def __iter__(self) -> Iterator[A]:
    """Iterate over symbols."""
    return iter(self.symbols)

__len__

__len__() -> int

Number of symbols.

Source code in src/emic/sources/empirical/sequence_data.py
def __len__(self) -> int:
    """Number of symbols."""
    return len(self.symbols)

__rshift__

__rshift__(transform: object) -> object

Pipeline operator for composing with transforms.

Source code in src/emic/sources/empirical/sequence_data.py
def __rshift__(self, transform: object) -> object:
    """Pipeline operator for composing with transforms."""
    if callable(transform):
        return transform(self)
    return NotImplemented

from_string staticmethod

from_string(s: str) -> SequenceData[str]

Create from a string (each character is a symbol).

Parameters:

Name Type Description Default
s str

A string where each character is treated as a symbol.

required

Returns:

Type Description
SequenceData[str]

A SequenceData containing the characters.

Examples:

>>> data = SequenceData.from_string("AABBA")
>>> list(data)
['A', 'A', 'B', 'B', 'A']
Source code in src/emic/sources/empirical/sequence_data.py
@staticmethod
def from_string(s: str) -> SequenceData[str]:
    """
    Create from a string (each character is a symbol).

    Args:
        s: A string where each character is treated as a symbol.

    Returns:
        A SequenceData containing the characters.

    Examples:
        >>> data = SequenceData.from_string("AABBA")
        >>> list(data)
        ['A', 'A', 'B', 'B', 'A']
    """
    return SequenceData(tuple(s))

from_binary_string staticmethod

from_binary_string(s: str) -> SequenceData[int]

Create from a binary string like "01010".

Parameters:

Name Type Description Default
s str

A string of '0' and '1' characters.

required

Returns:

Type Description
SequenceData[int]

A SequenceData containing integers 0 and 1.

Raises:

Type Description
ValueError

If string contains non-binary characters.

Examples:

>>> data = SequenceData.from_binary_string("01010")
>>> list(data)
[0, 1, 0, 1, 0]
Source code in src/emic/sources/empirical/sequence_data.py
@staticmethod
def from_binary_string(s: str) -> SequenceData[int]:
    """
    Create from a binary string like "01010".

    Args:
        s: A string of '0' and '1' characters.

    Returns:
        A SequenceData containing integers 0 and 1.

    Raises:
        ValueError: If string contains non-binary characters.

    Examples:
        >>> data = SequenceData.from_binary_string("01010")
        >>> list(data)
        [0, 1, 0, 1, 0]
    """
    for c in s:
        if c not in ("0", "1"):
            msg = f"Expected binary string, got character '{c}'"
            raise ValueError(msg)
    return SequenceData(tuple(int(c) for c in s), _alphabet=frozenset({0, 1}))

TakeN dataclass

TakeN(n: int)

Bases: Generic[A]

Take the first N symbols from a source.

Converts an infinite source into a finite SequenceData. Useful for sampling from stochastic sources.

Parameters:

Name Type Description Default
n int

Number of symbols to take

required

Examples:

>>> from emic.sources import GoldenMeanSource, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> data = TakeN[int](100)(source)
>>> len(data)
100
>>> isinstance(data, SequenceData)
True

__call__

__call__(source: SequenceSource[A]) -> SequenceData[A]

Apply the transform to a source.

Parameters:

Name Type Description Default
source SequenceSource[A]

The source to take symbols from.

required

Returns:

Type Description
SequenceData[A]

A SequenceData containing the first n symbols.

Source code in src/emic/sources/transforms/take.py
def __call__(self, source: SequenceSource[A]) -> SequenceData[A]:
    """
    Apply the transform to a source.

    Args:
        source: The source to take symbols from.

    Returns:
        A SequenceData containing the first n symbols.
    """
    symbols = tuple(islice(source, self.n))
    return SequenceData(symbols, _alphabet=source.alphabet)

SkipN dataclass

SkipN(n: int)

Bases: Generic[A]

Skip the first N symbols (burn-in period).

Useful for allowing a process to reach stationarity before collecting data.

Parameters:

Name Type Description Default
n int

Number of symbols to skip

required

Examples:

>>> from emic.sources import GoldenMeanSource, SkipN, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> # Skip first 1000 symbols, then take 100
>>> skipped = SkipN[int](1000)(source)
>>> data = TakeN[int](100)(skipped)

__call__

__call__(source: SequenceSource[A]) -> _SkippedSource[A]

Apply the transform to a source.

Parameters:

Name Type Description Default
source SequenceSource[A]

The source to skip symbols from.

required

Returns:

Type Description
_SkippedSource[A]

A new source that skips the first n symbols.

Source code in src/emic/sources/transforms/skip.py
def __call__(self, source: SequenceSource[A]) -> _SkippedSource[A]:
    """
    Apply the transform to a source.

    Args:
        source: The source to skip symbols from.

    Returns:
        A new source that skips the first n symbols.
    """
    return _SkippedSource(source, self.n)

BitFlipNoise dataclass

BitFlipNoise(flip_prob: float, seed: int | None = None)

Bases: Generic[A]

Apply bit-flip (binary symmetric channel) noise to a source.

Each symbol is independently replaced with a random symbol from the alphabet with probability flip_prob. This models observation noise where the true underlying symbol is corrupted before being observed.

Parameters:

Name Type Description Default
flip_prob float

Probability of flipping each symbol (0 to 0.5)

required
seed int | None

Random seed for reproducibility

None

Examples:

>>> from emic.sources import GoldenMeanSource
>>> from emic.sources.transforms import BitFlipNoise, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> # Add 5% observation noise
>>> noisy = BitFlipNoise[int](flip_prob=0.05, seed=123)(source)
>>> data = TakeN[int](1000)(noisy)
Note

For binary alphabets, flip_prob=0.5 produces random noise independent of the input. For larger alphabets, flipped symbols are drawn uniformly at random from the alphabet.

__post_init__

__post_init__() -> None

Validate flip probability.

Source code in src/emic/sources/transforms/noise.py
def __post_init__(self) -> None:
    """Validate flip probability."""
    if not 0 <= self.flip_prob <= 0.5:
        raise ValueError(f"flip_prob must be in [0, 0.5], got {self.flip_prob}")

__call__

__call__(source: SequenceSource[A]) -> _NoisySource[A]

Apply the noise transform to a source.

Parameters:

Name Type Description Default
source SequenceSource[A]

The source to add noise to.

required

Returns:

Type Description
_NoisySource[A]

A new source with observation noise applied.

Source code in src/emic/sources/transforms/noise.py
def __call__(self, source: SequenceSource[A]) -> _NoisySource[A]:
    """
    Apply the noise transform to a source.

    Args:
        source: The source to add noise to.

    Returns:
        A new source with observation noise applied.
    """
    return _NoisySource(source, self.flip_prob, self.seed)