Working with Real Data¶

A practical guide for applying epsilon-machine inference to empirical sequences.

Prerequisites: Familiarity with emic basics, your data in symbolic form

Reading time: 25-35 minutes

Table of Contents¶

The Empirical Challenge
Preparing Your Data
Alphabet Design
Sample Size Considerations
Choosing an Algorithm
Handling Noise
Validation Strategies
Complete Workflow Example
Domain-Specific Tips
Troubleshooting

1. The Empirical Challenge¶

1.1 Synthetic vs Real Data¶

Synthetic data (like GoldenMeanSource) is: - Clean, noise-free - Exactly follows the generative model - Infinite in principle

Real data is: - Noisy, messy - May not follow any simple model - Finite (often too short)

1.2 What Can Go Wrong¶

Problem	Symptom	Cause
Too many states	Cμ explodes	Noise, undersampling
Too few states	Miss structure	Insufficient history
Unstable results	Changes with seed	Insufficient data
Non-unifilar machine	Ambiguous transitions	Algorithm failure

1.3 The Goal¶

Transform your real-world observations into a useful epsilon-machine that captures the underlying structure without overfitting to noise.

2. Preparing Your Data¶

2.1 From Raw to Symbolic¶

emic expects data as a sequence of integers (Sequence[int]). Your first task is mapping raw observations to symbols.

from emic.sources import SequenceData

# Option 1: Already integers
data = [0, 1, 0, 0, 1, 1, 0, 1, 0, 0]
source = SequenceData(data)

# Option 2: Categorical labels
raw = ["A", "B", "A", "A", "B", "C", "A"]
mapping = {"A": 0, "B": 1, "C": 2}
data = [mapping[x] for x in raw]
source = SequenceData(data)

# Option 3: Continuous → discretized
import numpy as np
continuous = np.random.randn(1000)
data = (continuous > 0).astype(int).tolist()  # Binary
source = SequenceData(data)

2.2 Handling Missing Values¶

Don't include missing values in your sequence! Options:

Option A: Exclude (simple but breaks continuity)

raw = [1, 0, None, 1, 0, 1]
data = [x for x in raw if x is not None]

Option B: Impute (introduces bias)

from statistics import mode
common = mode([x for x in raw if x is not None])
data = [common if x is None else x for x in raw]

Option C: Segment (preserves structure)

# Split into contiguous segments, analyze separately
segments = []
current = []
for x in raw:
    if x is None:
        if current:
            segments.append(current)
            current = []
    else:
        current.append(x)
if current:
    segments.append(current)

2.3 Stationarity Check¶

Epsilon-machines assume the process is stationary—statistics don't change over time.

Quick check: Compare first half to second half:

import numpy as np

data = [...]  # Your data
n = len(data)
first_half = data[:n//2]
second_half = data[n//2:]

# Compare symbol frequencies
from collections import Counter
freq1 = Counter(first_half)
freq2 = Counter(second_half)

print("First half:", dict(freq1))
print("Second half:", dict(freq2))
# These should be similar

If non-stationary: Consider segmenting data or using a sliding window.

3. Alphabet Design¶

3.1 Size Matters¶

Alphabet Size	Needed Data	Notes
2 symbols	~1,000+	Works well
4 symbols	~5,000+	Good balance
8 symbols	~20,000+	Getting harder
16+ symbols	100,000+	Often too sparse

Rule of thumb: Expect to need \(O(|\mathcal{A}|^{L_{\max}})\) samples where \(|\mathcal{A}|\) is alphabet size and \(L_{\max}\) is max history length.

3.2 Strategies for Continuous Data¶

Threshold binning:

def threshold_bin(values, thresholds):
    """Map continuous values to discrete bins."""
    bins = [0] * len(values)
    for i, v in enumerate(values):
        for j, thresh in enumerate(thresholds):
            if v > thresh:
                bins[i] = j + 1
    return bins

# Example: Binary by median
median = np.median(values)
data = threshold_bin(values, [median])

# Example: Terciles
t1, t2 = np.percentile(values, [33, 67])
data = threshold_bin(values, [t1, t2])

Symbolization methods: - SAX (Symbolic Aggregate approXimation): Popular for time series - Quantile binning: Equal probability per symbol - Domain-specific: Use natural categories when available

3.3 Choosing Number of Symbols¶

Start small and increase only if needed:

Binary (2 symbols): Try this first
Check results: Does the machine capture expected structure?
Increase if needed: Add symbols if structure seems missing
Stop when: Adding symbols doesn't change Cμ/hμ meaningfully

4. Sample Size Considerations¶

4.1 How Much Data?¶

Minimum requirements (rough guidelines):

States	Max History	Minimum N
2-3	3	1,000
3-5	4	5,000
5-10	5	20,000
10+	6+	50,000+

4.2 Diagnosing Insufficient Data¶

Signs you need more data: - Cμ keeps increasing with max_history - Many states with similar predictive distributions - Results change significantly with different random seeds - Many rare transitions (< 5 occurrences)

4.3 Convergence Check¶

Run inference with increasing data sizes:

from emic.sources import SequenceData
from emic.inference import Spectral, SpectralConfig
from emic.analysis import analyze

data = [...]  # Your full dataset

sizes = [500, 1000, 2000, 5000, 10000, len(data)]
for n in sizes:
    if n > len(data):
        continue
    source = SequenceData(data[:n])
    result = Spectral(SpectralConfig()).infer(source.get_data())
    summary = analyze(result.machine)
    print(f"N={n:6d}: States={summary.num_states}, Cμ={summary.statistical_complexity:.3f}")

Good sign: Values stabilize as N increases.

Bad sign: Values keep changing significantly.

5. Choosing an Algorithm¶

5.1 Algorithm Overview¶

Algorithm	Best For	Pros	Cons
Spectral	General use	Fast, robust, no χ²	Less interpretable
CSSR	Understanding structure	Classic, well-studied	Sensitive to params
CSM	Clean data	Exact theory	Computationally expensive
BSI	Uncertainty quantification	Bayesian posteriors	Slow, complex
NSD	Manifold structure	Novel approach	Experimental

5.2 Recommended Workflow¶

from emic.inference import Spectral, SpectralConfig, CSSR, CSSRConfig

# Step 1: Quick exploration with Spectral
result1 = Spectral(SpectralConfig(num_states=None)).infer(data)  # Auto-detect
print(f"Spectral found {result1.machine.num_states} states")

# Step 2: Refine with CSSR for interpretability
config = CSSRConfig(
    max_history=5,
    significance_level=0.01,
    post_merge=True,
)
result2 = CSSR(config).infer(data)
print(f"CSSR found {result2.machine.num_states} states")

# Step 3: Compare
from emic.analysis import analyze
s1, s2 = analyze(result1.machine), analyze(result2.machine)
print(f"Spectral: Cμ={s1.statistical_complexity:.3f}, hμ={s1.entropy_rate:.3f}")
print(f"CSSR:     Cμ={s2.statistical_complexity:.3f}, hμ={s2.entropy_rate:.3f}")

5.3 When Algorithms Disagree¶

If Spectral and CSSR give very different results: - Spectral finds more states: May be overfitting, try setting num_states explicitly - CSSR finds more states: May need post-merging or lower significance level - Both unstable: Likely insufficient data

6. Handling Noise¶

6.1 Types of Noise¶

Observation noise: Correct symbol but random errors in observation - Example: Sensor measurement error

Process noise: The underlying process is itself noisy - Example: Random events affecting transitions

Structural noise: Symbols occasionally swapped/corrupted - Example: Transmission errors

6.2 Robustness Strategies¶

Strategy 1: Higher significance level

# More conservative splitting (fewer spurious states)
config = CSSRConfig(max_history=4, significance_level=0.001)

Strategy 2: Post-merging

# Merge similar states after inference
config = CSSRConfig(max_history=5, post_merge=True)

Strategy 3: Spectral with fixed states

# Fix number of states based on domain knowledge
config = SpectralConfig(num_states=3)

Strategy 4: Validation on held-out data

# Use subset for inference, test on rest
train_data = data[:len(data)//2]
test_data = data[len(data)//2:]
# Infer on train, evaluate on test (see validation section)

6.3 Simulating Noise Robustness¶

Test your pipeline's robustness:

from emic.sources import SequenceData, BitFlipNoise, TakeN
from emic.inference import Spectral, SpectralConfig
from emic.analysis import analyze

original_data = [...]  # Your clean data (if you have it)

for noise_level in [0.0, 0.01, 0.02, 0.05, 0.10]:
    # Add noise
    source = SequenceData(original_data) >> BitFlipNoise(flip_prob=noise_level, seed=42)
    noisy_data = list(source)

    # Infer
    result = Spectral(SpectralConfig()).infer(noisy_data)
    summary = analyze(result.machine)

    print(f"Noise {noise_level:.0%}: States={summary.num_states}, "
          f"Cμ={summary.statistical_complexity:.3f}")

7. Validation Strategies¶

7.1 Log-Likelihood on Held-Out Data¶

The most rigorous validation: Does the inferred machine predict unseen data well?

import math
from emic.types import EpsilonMachine

def compute_log_likelihood(machine: EpsilonMachine, data: list[int]) -> float:
    """Compute per-symbol log-likelihood of data under machine."""
    # Start from stationary distribution
    state_probs = dict(machine.stationary_distribution)

    log_lik = 0.0
    for symbol in data:
        # Probability of symbol given current state distribution
        p_symbol = 0.0
        new_state_probs = {}

        for state, state_prob in state_probs.items():
            for (s, sym), (next_state, trans_prob) in machine.transitions.items():
                if s == state and sym == symbol:
                    p_symbol += state_prob * trans_prob
                    new_state_probs[next_state] = new_state_probs.get(next_state, 0.0) + state_prob * trans_prob

        if p_symbol > 0:
            log_lik += math.log2(p_symbol)
            # Normalize for next step
            total = sum(new_state_probs.values())
            state_probs = {s: p/total for s, p in new_state_probs.items()}
        else:
            # Impossible transition - large penalty
            log_lik -= 100
            break

    return log_lik / len(data)  # Per-symbol

# Usage
train_data = data[:len(data)*3//4]
test_data = data[len(data)*3//4:]

result = Spectral(SpectralConfig()).infer(train_data)
ll = compute_log_likelihood(result.machine, test_data)
print(f"Test log-likelihood: {ll:.3f} bits/symbol")

Interpretation: - Higher (less negative) is better - Compare to entropy rate: should be close to -hμ

7.2 Cross-Validation¶

For more robust estimates:

import numpy as np

def k_fold_cv(data: list[int], k: int = 5) -> list[float]:
    """K-fold cross-validation for epsilon-machine inference."""
    n = len(data)
    fold_size = n // k
    scores = []

    for i in range(k):
        test_start = i * fold_size
        test_end = (i + 1) * fold_size if i < k - 1 else n

        train_data = data[:test_start] + data[test_end:]
        test_data = data[test_start:test_end]

        result = Spectral(SpectralConfig()).infer(train_data)
        ll = compute_log_likelihood(result.machine, test_data)
        scores.append(ll)

    return scores

scores = k_fold_cv(data, k=5)
print(f"CV scores: {scores}")
print(f"Mean: {np.mean(scores):.3f} ± {np.std(scores):.3f}")

7.3 Bootstrap Stability¶

Check if results are stable across resamples:

import random

def bootstrap_stability(data: list[int], n_bootstrap: int = 20):
    """Assess stability of inferred machine via bootstrap."""
    results = []
    n = len(data)

    for seed in range(n_bootstrap):
        random.seed(seed)
        # Resample with replacement
        resampled = [data[random.randint(0, n-1)] for _ in range(n)]

        result = Spectral(SpectralConfig()).infer(resampled)
        summary = analyze(result.machine)
        results.append({
            'states': summary.num_states,
            'c_mu': summary.statistical_complexity,
            'h_mu': summary.entropy_rate,
        })

    # Report variability
    states = [r['states'] for r in results]
    c_mus = [r['c_mu'] for r in results]

    print(f"States: {min(states)}-{max(states)}, mode={max(set(states), key=states.count)}")
    print(f"Cμ: {np.mean(c_mus):.3f} ± {np.std(c_mus):.3f}")

8. Complete Workflow Example¶

Here's a complete example analyzing DNA sequence data:

"""
Example: Analyzing a DNA sequence for hidden structure.
"""
from collections import Counter
from emic.sources import SequenceData
from emic.inference import Spectral, SpectralConfig, CSSR, CSSRConfig
from emic.analysis import analyze
from emic.output import diagram

# =============================================================================
# Step 1: Load and convert data
# =============================================================================

# Simulated DNA sequence (in practice, load from FASTA file)
dna = "ATGCGATCGATCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGGAATTCCGG" * 50
print(f"Sequence length: {len(dna)}")

# Map to integers
mapping = {"A": 0, "T": 1, "G": 2, "C": 3}
data = [mapping[base] for base in dna]

# =============================================================================
# Step 2: Basic statistics
# =============================================================================

counts = Counter(data)
print(f"Symbol frequencies: {dict(counts)}")
total = sum(counts.values())
print(f"Proportions: {[counts[i]/total for i in range(4)]}")

# =============================================================================
# Step 3: Initial exploration with Spectral
# =============================================================================

source = SequenceData(data)
result = Spectral(SpectralConfig(num_states=None)).infer(source.get_data())
summary = analyze(result.machine)

print(f"\nSpectral results:")
print(f"  States: {summary.num_states}")
print(f"  Cμ = {summary.statistical_complexity:.4f} bits")
print(f"  hμ = {summary.entropy_rate:.4f} bits/symbol")

# =============================================================================
# Step 4: Refine with CSSR
# =============================================================================

config = CSSRConfig(
    max_history=6,
    significance_level=0.01,
    post_merge=True,
)
result2 = CSSR(config).infer(source.get_data())
summary2 = analyze(result2.machine)

print(f"\nCSSR results:")
print(f"  States: {summary2.num_states}")
print(f"  Cμ = {summary2.statistical_complexity:.4f} bits")
print(f"  hμ = {summary2.entropy_rate:.4f} bits/symbol")

# =============================================================================
# Step 5: Validate
# =============================================================================

# Train/test split
train_data = data[:len(data)*3//4]
test_data = data[len(data)*3//4:]

result_train = Spectral(SpectralConfig()).infer(train_data)
# ... compute log-likelihood on test_data ...

# =============================================================================
# Step 6: Visualize
# =============================================================================

mermaid = diagram(result2.machine)
print(f"\nState diagram:\n{mermaid}")

9. Domain-Specific Tips¶

9.1 Genomic Sequences¶

Alphabet: Use 4 symbols (A, T, G, C) or 2 (purine/pyrimidine)
History: Codons suggest history length of 3 or 6
Stationarity: Gene regions may differ from intergenic
Sample size: Genes are short—may need to pool similar sequences

9.2 Financial Time Series¶

Discretization: Returns → {down, flat, up} or finer
Stationarity: Use rolling windows, regime detection
History: Daily patterns suggest history 5-20
Noise: High—use conservative parameters

9.3 Natural Language¶

Alphabet: Characters (26-52) or word categories
History: Long dependencies—may need L_max > 10
Size: Need millions of characters for good estimates
Structure: Rich—expect many states

9.4 Sensor/IoT Data¶

Discretization: Based on operating thresholds
Missing values: Common—segment carefully
Noise: Often high—consider smoothing first
Periodicity: Check for diurnal/weekly patterns

9.5 Neuroscience (Spike Trains)¶

Alphabet: Binary (spike/no-spike) or multi-unit
Binning: Careful choice of time bin width
Stationarity: Task-related changes
Length: Often limited—pool across trials

10. Troubleshooting¶

10.1 "Too Many States"¶

Symptoms: 10+ states, Cμ seems too high

Causes and fixes: 1. Noise: Lower significance level, use post_merge 2. Overfitting: Reduce max_history 3. True complexity: Process may genuinely be complex

# Diagnostic: Check state sizes
from collections import Counter
# After CSSR: look at state occupation counts

10.2 "Too Few States"¶

Symptoms: 1-2 states when more expected

Causes and fixes: 1. Insufficient history: Increase max_history 2. Over-merging: Increase significance level 3. Data issues: Check for errors in preprocessing

10.3 "Unstable Results"¶

Symptoms: Different runs give different answers

Causes and fixes: 1. Insufficient data: Get more data or reduce complexity 2. Near decision boundaries: States may be on the edge of merging 3. Stochastic algorithms: Fix random seeds, average over runs

10.4 "Non-Unifilar Machine"¶

Symptoms: Transitions are ambiguous

Causes and fixes: 1. Algorithm issue: Try different algorithm 2. True ambiguity: Process may not admit a finite-state εM 3. Numerical issues: Check for very small probabilities

10.5 "Takes Too Long"¶

Symptoms: Inference doesn't complete

Causes and fixes: 1. Too much data: Subsample 2. Large alphabet: Reduce symbol count 3. Long history: Reduce max_history 4. Wrong algorithm: Spectral is fastest

Summary Checklist¶

Before running epsilon-machine inference on real data:

[ ] Data is a sequence of integers (0, 1, 2, ...)
[ ] Missing values handled appropriately
[ ] Stationarity checked (or data segmented)
[ ] Alphabet size is reasonable (≤ 8 ideally)
[ ] Sample size is adequate for expected complexity
[ ] Algorithm chosen based on data characteristics
[ ] Validation strategy planned (train/test, CV)

Key parameters to tune: - max_history: Start at 5, adjust based on results - significance_level: 0.01 for noisy data, 0.05 for clean - post_merge: True for noisy data - num_states (Spectral): Set if you have domain knowledge

When in doubt: 1. Start with Spectral (fast, robust) 2. Verify with CSSR (interpretable) 3. Bootstrap for stability 4. Report confidence intervals

Happy analyzing! 🔬