emic.sources¶
Data sources for epsilon-machine inference.
sources
¶
Source protocol and implementations for sequence generation.
Public API
- SequenceSource (Protocol)
- SeededSource (Protocol)
- StochasticSource (Base class)
- GoldenMeanSource
- EvenProcessSource
- BiasedCoinSource
- PeriodicSource
- SequenceData
- TakeN
- SkipN
- BitFlipNoise
SequenceSource
¶
Bases: Protocol[A_co]
A source of symbols for epsilon-machine inference.
Any object that is iterable over symbols and knows its alphabet satisfies this protocol.
Examples:
SeededSource
¶
Bases: SequenceSource[A_co], Protocol[A_co]
A source that can be seeded for reproducibility.
Extends SequenceSource with seed management for stochastic sources.
with_seed
¶
with_seed(seed: int) -> SeededSource[A_co]
StochasticSource
dataclass
¶
StochasticSource(
_alphabet: frozenset[A] = _empty_frozenset(),
_seed: int | None = None,
_rng: Random = Random(),
)
Bases: Generic[A]
Base class for stochastic process sources.
Handles random state management and provides common functionality. Not frozen because it maintains RNG state.
Subclasses should: 1. Set _alphabet in post_init 2. Implement iter to yield symbols 3. Implement with_seed to return a properly typed copy
Attributes:
| Name | Type | Description |
|---|---|---|
_alphabet |
frozenset[A]
|
The set of possible symbols |
_seed |
int | None
|
The random seed (None for unseeded) |
_rng |
Random
|
The random number generator |
Examples:
>>> class MyStohasticSource(StochasticSource[int]):
... def __post_init__(self):
... super().__post_init__()
... object.__setattr__(self, '_alphabet', frozenset({0, 1}))
... def __iter__(self):
... while True:
... yield self._rng.choice([0, 1])
__post_init__
¶
with_seed
¶
Return a new source with the given seed.
Subclasses should override to return the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
The random seed to use. |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new source instance with the given seed. |
Source code in src/emic/sources/base.py
__iter__
¶
Yield symbols from the source.
Subclasses must implement this method.
__rshift__
¶
Pipeline operator for composing sources with transforms.
Usage
source >> TakeN(1000) >> CSSR()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transform
|
object
|
A callable that accepts this source. |
required |
Returns:
| Type | Description |
|---|---|
object
|
The result of applying the transform to this source. |
Source code in src/emic/sources/base.py
GoldenMeanSource
dataclass
¶
GoldenMeanSource(
_alphabet: frozenset[int] = (
lambda: frozenset({0, 1})
)(),
_seed: int | None = None,
_rng: Random = Random(),
p: float = 0.5,
)
Bases: StochasticSource[int]
The Golden Mean Process.
A binary process where consecutive 1s are forbidden. After emitting a 1, the next symbol must be 0.
State machine
A --0 (p)--> A A --1 (1-p)--> B B --0 (1.0)--> A
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p
|
float
|
Probability of emitting 0 from state A (default: 0.5) |
0.5
|
Statistical properties
- Entropy rate: h = (2/3) * H(p) where H is binary entropy
- Statistical complexity: C_μ ≈ 0.918 bits for p=0.5
Examples:
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> it = iter(source)
>>> symbols = [next(it) for _ in range(100)]
>>> # No consecutive 1s in output
>>> '11' not in ''.join(map(str, symbols))
True
true_machine
property
¶
true_machine: EpsilonMachine[int]
Return the known epsilon-machine for this process.
The Golden Mean process has exactly 2 causal states.
Returns:
| Type | Description |
|---|---|
EpsilonMachine[int]
|
The epsilon-machine that generates this process. |
__post_init__
¶
Validate parameters and initialize RNG.
__iter__
¶
Generate symbols from the Golden Mean process.
Yields:
| Type | Description |
|---|---|
int
|
Symbols from {0, 1} with no consecutive 1s. |
Source code in src/emic/sources/synthetic/golden_mean.py
with_seed
¶
with_seed(seed: int) -> GoldenMeanSource
Return a new source with the given seed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
The random seed to use. |
required |
Returns:
| Type | Description |
|---|---|
GoldenMeanSource
|
A new GoldenMeanSource with the given seed. |
Source code in src/emic/sources/synthetic/golden_mean.py
EvenProcessSource
dataclass
¶
EvenProcessSource(
_alphabet: frozenset[int] = (
lambda: frozenset({0, 1})
)(),
_seed: int | None = None,
_rng: Random = Random(),
p: float = 0.5,
)
Bases: StochasticSource[int]
The Even Process.
A binary process where 1s must appear in runs of even length. After emitting a 1, must emit another 1 before any 0.
State machine
A --0 (p)--> A A --1 (1-p)--> B B --1 (1.0)--> A
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p
|
float
|
Probability of emitting 0 from state A (default: 0.5) |
0.5
|
Examples:
>>> source = EvenProcessSource(p=0.5, _seed=1)
>>> it = iter(source)
>>> symbols = [next(it) for _ in range(100)]
>>> # Count runs of 1s - all should be even length
>>> s = ''.join(map(str, symbols))
>>> import re
>>> all(len(run) % 2 == 0 for run in re.findall('1+', s))
True
true_machine
property
¶
true_machine: EpsilonMachine[int]
Return the known epsilon-machine for this process.
The Even process has exactly 2 causal states.
Returns:
| Type | Description |
|---|---|
EpsilonMachine[int]
|
The epsilon-machine that generates this process. |
__post_init__
¶
Validate parameters and initialize RNG.
__iter__
¶
Generate symbols from the Even process.
Yields:
| Type | Description |
|---|---|
int
|
Symbols from {0, 1} where runs of 1s have even length. |
Source code in src/emic/sources/synthetic/even_process.py
with_seed
¶
with_seed(seed: int) -> EvenProcessSource
Return a new source with the given seed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
The random seed to use. |
required |
Returns:
| Type | Description |
|---|---|
EvenProcessSource
|
A new EvenProcessSource with the given seed. |
Source code in src/emic/sources/synthetic/even_process.py
BiasedCoinSource
dataclass
¶
BiasedCoinSource(
_alphabet: frozenset[int] = (
lambda: frozenset({0, 1})
)(),
_seed: int | None = None,
_rng: Random = Random(),
p: float = 0.5,
)
Bases: StochasticSource[int]
Independent identically distributed binary source.
The simplest stochastic process: each symbol is independent. The epsilon-machine has exactly one state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p
|
float
|
Probability of emitting 1 (default: 0.5) |
0.5
|
Statistical properties
- Entropy rate: h = H(p) = -plog(p) - (1-p)log(1-p)
- Statistical complexity: C_μ = 0 (no memory needed)
Examples:
>>> source = BiasedCoinSource(p=0.7, _seed=42)
>>> symbols = [next(iter(source)) for _ in range(1000)]
>>> # Should be roughly 70% ones
>>> 0.65 < sum(symbols) / len(symbols) < 0.75
True
true_machine
property
¶
true_machine: EpsilonMachine[int]
Return the known epsilon-machine for this process.
An IID process has exactly 1 causal state.
Returns:
| Type | Description |
|---|---|
EpsilonMachine[int]
|
The epsilon-machine that generates this process. |
__post_init__
¶
Validate parameters and initialize RNG.
__iter__
¶
Generate IID symbols.
Yields:
| Type | Description |
|---|---|
int
|
Symbols from {0, 1} drawn independently. |
with_seed
¶
with_seed(seed: int) -> BiasedCoinSource
Return a new source with the given seed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
The random seed to use. |
required |
Returns:
| Type | Description |
|---|---|
BiasedCoinSource
|
A new BiasedCoinSource with the given seed. |
Source code in src/emic/sources/synthetic/biased_coin.py
PeriodicSource
dataclass
¶
Bases: Generic[A]
A deterministic periodic process.
Repeats a fixed pattern indefinitely. The epsilon-machine has N states (one per position in pattern).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
tuple[A, ...]
|
The repeating sequence of symbols |
required |
Statistical properties
- Entropy rate: h = 0 (deterministic)
- Statistical complexity: C_μ = log(N) where N = len(pattern)
Examples:
>>> source = PeriodicSource(pattern=(0, 1, 0))
>>> it = iter(source)
>>> [next(it) for _ in range(9)]
[0, 1, 0, 0, 1, 0, 0, 1, 0]
true_machine
property
¶
true_machine: EpsilonMachine[A]
Return the known epsilon-machine for this process.
A periodic process with period N has exactly N causal states, arranged in a cycle.
Returns:
| Type | Description |
|---|---|
EpsilonMachine[A]
|
The epsilon-machine that generates this process. |
__post_init__
¶
Validate pattern and set alphabet.
__iter__
¶
Generate symbols by repeating the pattern.
Yields:
| Type | Description |
|---|---|
A
|
Symbols from the pattern, cycling indefinitely. |
Source code in src/emic/sources/synthetic/periodic.py
__rshift__
¶
SequenceData
dataclass
¶
Bases: Generic[A]
A finite sequence of observed symbols.
Wraps empirical data for use in inference pipelines. Immutable to ensure data integrity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
symbols
|
tuple[A, ...]
|
The sequence of observed symbols |
required |
_alphabet
|
frozenset[A] | None
|
Optional explicit alphabet (inferred from symbols if None) |
None
|
Examples:
>>> data = SequenceData(symbols=(0, 1, 0, 1, 0))
>>> list(data)
[0, 1, 0, 1, 0]
>>> len(data)
5
>>> data.alphabet
frozenset({0, 1})
alphabet
property
¶
The set of possible symbols.
If an explicit alphabet was provided, returns that. Otherwise, returns the set of symbols observed in the data.
__iter__
¶
__len__
¶
__rshift__
¶
from_string
staticmethod
¶
from_string(s: str) -> SequenceData[str]
Create from a string (each character is a symbol).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
str
|
A string where each character is treated as a symbol. |
required |
Returns:
| Type | Description |
|---|---|
SequenceData[str]
|
A SequenceData containing the characters. |
Examples:
Source code in src/emic/sources/empirical/sequence_data.py
from_binary_string
staticmethod
¶
from_binary_string(s: str) -> SequenceData[int]
Create from a binary string like "01010".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
str
|
A string of '0' and '1' characters. |
required |
Returns:
| Type | Description |
|---|---|
SequenceData[int]
|
A SequenceData containing integers 0 and 1. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If string contains non-binary characters. |
Examples:
Source code in src/emic/sources/empirical/sequence_data.py
TakeN
dataclass
¶
Bases: Generic[A]
Take the first N symbols from a source.
Converts an infinite source into a finite SequenceData. Useful for sampling from stochastic sources.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of symbols to take |
required |
Examples:
>>> from emic.sources import GoldenMeanSource, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> data = TakeN[int](100)(source)
>>> len(data)
100
>>> isinstance(data, SequenceData)
True
__call__
¶
__call__(source: SequenceSource[A]) -> SequenceData[A]
Apply the transform to a source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
SequenceSource[A]
|
The source to take symbols from. |
required |
Returns:
| Type | Description |
|---|---|
SequenceData[A]
|
A SequenceData containing the first n symbols. |
Source code in src/emic/sources/transforms/take.py
SkipN
dataclass
¶
Bases: Generic[A]
Skip the first N symbols (burn-in period).
Useful for allowing a process to reach stationarity before collecting data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of symbols to skip |
required |
Examples:
>>> from emic.sources import GoldenMeanSource, SkipN, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> # Skip first 1000 symbols, then take 100
>>> skipped = SkipN[int](1000)(source)
>>> data = TakeN[int](100)(skipped)
__call__
¶
__call__(source: SequenceSource[A]) -> _SkippedSource[A]
Apply the transform to a source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
SequenceSource[A]
|
The source to skip symbols from. |
required |
Returns:
| Type | Description |
|---|---|
_SkippedSource[A]
|
A new source that skips the first n symbols. |
Source code in src/emic/sources/transforms/skip.py
BitFlipNoise
dataclass
¶
Bases: Generic[A]
Apply bit-flip (binary symmetric channel) noise to a source.
Each symbol is independently replaced with a random symbol from the
alphabet with probability flip_prob. This models observation noise
where the true underlying symbol is corrupted before being observed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
flip_prob
|
float
|
Probability of flipping each symbol (0 to 0.5) |
required |
seed
|
int | None
|
Random seed for reproducibility |
None
|
Examples:
>>> from emic.sources import GoldenMeanSource
>>> from emic.sources.transforms import BitFlipNoise, TakeN
>>> source = GoldenMeanSource(p=0.5, _seed=42)
>>> # Add 5% observation noise
>>> noisy = BitFlipNoise[int](flip_prob=0.05, seed=123)(source)
>>> data = TakeN[int](1000)(noisy)
Note
For binary alphabets, flip_prob=0.5 produces random noise independent of the input. For larger alphabets, flipped symbols are drawn uniformly at random from the alphabet.
__post_init__
¶
__call__
¶
__call__(source: SequenceSource[A]) -> _NoisySource[A]
Apply the noise transform to a source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
SequenceSource[A]
|
The source to add noise to. |
required |
Returns:
| Type | Description |
|---|---|
_NoisySource[A]
|
A new source with observation noise applied. |