Skip to content

CSSR: A Complete Guide

Understanding the Causal State Splitting Reconstruction algorithm step-by-step.

Prerequisites: Basic probability, hypothesis testing concepts

Reading time: 40-50 minutes


Table of Contents

  1. What is CSSR?
  2. The Core Idea: Predictive Equivalence
  3. Step 1: Building the Suffix Tree
  4. Step 2: Computing Predictive Distributions
  5. Step 3: Statistical Testing
  6. Step 4: State Splitting and Merging
  7. Worked Example: The Golden Mean Process
  8. Configuration Guide
  9. Common Pitfalls and Solutions
  10. Further Reading

1. What is CSSR?

1.1 The Algorithm in One Sentence

CSSR discovers the causal states of a process by finding groups of histories that make identical predictions about the future.

1.2 Historical Context

CSSR was introduced by Cosma Shalizi and Kristina Klinkner in 2004, building on the theoretical foundations of computational mechanics developed by James Crutchfield and colleagues since the 1980s.

The name stands for Causal State Splitting Reconstruction: - Causal States: The minimal sufficient statistics for prediction - Splitting: The algorithm starts with all histories in one group and splits them - Reconstruction: We're reconstructing the hidden epsilon-machine from observed data

1.3 Why CSSR?

Compared to other approaches:

Method Pros Cons
CSSR Well-studied, interpretable, consistent Sensitive to sample size
Spectral Faster, higher accuracy Less interpretable
BSI Uncertainty quantification Computationally expensive

CSSR remains the reference algorithm in the field because:

  • It directly implements the definition of causal states
  • Every step has clear statistical justification
  • Results are interpretable in terms of histories and predictions

2. The Core Idea: Predictive Equivalence

2.1 What Makes Two Histories "The Same"?

Consider watching a binary sequence from the Golden Mean process (no consecutive 1s allowed):

Observed: ... 0 1 0 0 1 0 1 0 0 ...

Now, suppose you've just seen: - History A: ...0 1 0 - History B: ...0 0 0

Are these histories "the same" for prediction purposes?

Yes! After seeing either history, the last symbol is 0, and from the Golden Mean machine: - After a 0, the next symbol is 0 or 1 with some probabilities - The specific earlier history doesn't matter—only that we ended with 0

2.2 The Formal Definition

Two histories \(h\) and \(h'\) are predictively equivalent if:

\[P(X_{t+1}, X_{t+2}, \ldots | H_t = h) = P(X_{t+1}, X_{t+2}, \ldots | H_t = h')\]

In words: the probability distribution over all possible futures is identical.

2.3 Causal States as Equivalence Classes

A causal state is an equivalence class of histories that are all predictively equivalent to each other.

For the Golden Mean: - State A: All histories ending in 0 → {0, 00, 10, 010, 100, ...} - State B: All histories ending in 1 → {1, 01, 001, 101, ...}

The epsilon-machine has exactly as many states as there are distinct predictive equivalence classes.

2.4 Why This Definition?

This definition is special because causal states are:

  1. Sufficient: Knowing your causal state is enough to make optimal predictions
  2. Minimal: No simpler representation preserves all predictive information
  3. Unique: There's exactly one set of causal states for any process

3. Step 1: Building the Suffix Tree

3.1 What is a Suffix Tree?

A suffix tree stores all observed subsequences (suffixes) of the data, organized by their structure.

Given the sequence: 0 1 0 0 1

The suffix tree contains: - Length 1: 0 (seen 3×), 1 (seen 2×) - Length 2: 01 (seen 2×), 10 (seen 1×), 00 (seen 1×) - Length 3: 010 (seen 1×), 100 (seen 1×), 001 (seen 1×) - And so on...

3.2 Why Suffixes?

Suffixes represent histories—the symbols leading up to the current moment. We group histories by their suffix because:

If two histories share the same suffix of length \(L\), and \(L\) is long enough to capture all memory, then they're predictively equivalent.

The parameter max_history (often called \(L\)) sets how far back we look.

3.3 Building the Tree in emic

from emic.sources import GoldenMeanSource, TakeN
from emic.inference import CSSR, CSSRConfig

# Generate data
source = GoldenMeanSource(p=0.5, _seed=42)
data = TakeN(10_000)(source)

# CSSR with max_history=5
config = CSSRConfig(max_history=5)
result = CSSR(config).infer(data)

Internally, CSSR builds a suffix tree containing all length-1 through length-5 suffixes, counting how often each occurs and what symbol follows.

3.4 Counts in the Suffix Tree

For each suffix \(s\), we track: - \(N(s)\): How many times suffix \(s\) appears - \(N(s, x)\): How many times symbol \(x\) follows suffix \(s\)

These counts let us estimate the predictive distribution:

\[\hat{P}(x | s) = \frac{N(s, x)}{N(s)}\]

4. Step 2: Computing Predictive Distributions

4.1 From Counts to Probabilities

For each suffix (potential history), we compute the empirical next-symbol distribution.

Example from 10,000 Golden Mean symbols:

Suffix N(suffix) N(→0) N(→1) P(0|suffix) P(1|suffix)
0 6,666 3,333 3,333 0.500 0.500
1 3,333 3,333 0 1.000 0.000
00 3,333 1,667 1,666 0.500 0.500
10 3,333 1,667 1,666 0.500 0.500
01 3,333 3,333 0 1.000 0.000

Notice: - All suffixes ending in 0 have distribution ≈ (0.5, 0.5) - All suffixes ending in 1 have distribution ≈ (1.0, 0.0)

4.2 The Key Question

Given these empirical distributions, CSSR must decide:

Are suffixes 0, 00, 10, 010, etc. truly from the same causal state, or do the small differences in their empirical distributions indicate they should be separate states?

This requires statistical hypothesis testing.


5. Step 3: Statistical Testing

5.1 The Null Hypothesis

For any two suffixes \(s\) and \(s'\), we test:

\[H_0: P(\cdot | s) = P(\cdot | s')\]

"The true predictive distributions are identical."

5.2 The Chi-Squared Test

CSSR uses the chi-squared test for homogeneity. Given: - Suffix \(s\) with counts \((n_{s,0}, n_{s,1}, \ldots)\) - Suffix \(s'\) with counts \((n_{s',0}, n_{s',1}, \ldots)\)

The test statistic is:

\[\chi^2 = \sum_{x \in \mathcal{A}} \frac{(O_x - E_x)^2}{E_x}\]

where \(O_x\) are observed counts and \(E_x\) are expected counts under \(H_0\).

5.3 Example Calculation

Compare suffix 0 (1000 observations: 500→0, 500→1) with suffix 00 (500 observations: 250→0, 250→1):

Both have the same proportions (50/50), so the chi-squared statistic is 0.

Compare suffix 0 with suffix 1 (500 observations: 500→0, 0→1):

The proportions differ dramatically, giving a large chi-squared value that exceeds the critical threshold.

5.4 The Significance Level

The significance parameter (α) controls how strict we are:

config = CSSRConfig(
    max_history=5,
    significance=0.001,  # Very strict: require strong evidence to split
)
Significance Effect
0.001 Conservative: fewer states, may under-split
0.01 Balanced: typical choice
0.05 Liberal: more states, may over-split

Rule of thumb: Use smaller significance (0.001) with large datasets, larger significance (0.01-0.05) with small datasets.


6. Step 4: State Splitting and Merging

6.1 The CSSR Iteration

CSSR proceeds in rounds, increasing the history length from 1 to max_history:

Round L = 1: 1. Group all length-1 suffixes by their predictive distribution 2. Suffixes 0 and 1 have different distributions → 2 initial states

Round L = 2: 1. For each length-2 suffix, check if it's consistent with its "parent" state 2. Is 00's distribution consistent with state containing 0? Yes (both ≈ 50/50) 3. Is 01's distribution consistent with state containing 1? Yes (both ≈ 100/0) 4. No splits needed

Round L = 3, 4, 5: 1. Same process—check each longer suffix against its state 2. For Golden Mean, no further splits needed

Final result: 2 causal states

6.2 When Splitting Occurs

Splitting happens when a suffix's distribution is significantly different from its parent state's distribution.

Example: The Even Process (1s must come in pairs)

Suffix Distribution Interpretation
0 (0.5, 0.5) Can emit 0 or 1
1 (0.0, 1.0) Must emit 1 (completing pair)
01 (0.5, 0.5) Back to neutral
11 (0.5, 0.5) Pair complete, back to neutral

At first, 0 and 01 and 11 seem similar. But: - After 0, we might see 0 or 1 - After 01, we might see 0 or 1 - After 11, we might see 0 or 1

But wait—what about 10? It never occurs! This reveals the structure.

6.3 Post-Merge Phase

After the main iteration, CSSR optionally performs a merge phase:

  1. Compare all pairs of final states
  2. If two states have statistically indistinguishable distributions, merge them
  3. This catches "over-splitting" caused by insufficient data
config = CSSRConfig(
    max_history=5,
    significance=0.001,
    post_merge=True,  # Enable merging (default)
    merge_significance=0.05,  # Less strict for merging
)

7. Worked Example: The Golden Mean Process

Let's trace CSSR step-by-step on 1,000 symbols from the Golden Mean process.

7.1 The Data

from emic.sources import GoldenMeanSource, TakeN

source = GoldenMeanSource(p=0.5, _seed=42)
data = list(TakeN(1000)(source))
print(''.join(map(str, data[:50])))
# Output: 01010010010100101010010101010010010101001010010100

7.2 Suffix Counts (L=1)

Suffix Count →0 →1 P(0) P(1)
0 618 303 315 0.49 0.51
1 381 381 0 1.00 0.00

Chi-squared test between 0 and 1: χ² = 423.7 (p < 0.001)

Result: Create 2 states - State A = {0} - State B = {1}

7.3 Suffix Counts (L=2)

Suffix Count →0 →1 P(0) P(1) Parent
00 175 83 92 0.47 0.53 A (0)
10 443 220 223 0.50 0.50 A (0)
01 381 381 0 1.00 0.00 B (1)

Compare 00 vs State A distribution: χ² = 0.42 (p = 0.52) → Consistent, no split Compare 10 vs State A distribution: χ² = 0.12 (p = 0.73) → Consistent, no split Compare 01 vs State B distribution: χ² = 0.00 (p = 1.00) → Consistent, no split

Result: Still 2 states (no changes)

7.4 Continuing to L=5

Each round, we check longer suffixes against their states. The Golden Mean has memory of only 1 symbol, so longer suffixes never reveal new structure.

7.5 Final Machine

from emic.inference import CSSR, CSSRConfig
from emic.analysis import analyze

config = CSSRConfig(max_history=5, significance=0.001)
result = CSSR(config).infer(data)

print(f"States: {len(result.machine.states)}")  # 2
print(f"Converged: {result.converged}")  # True

summary = analyze(result.machine)
print(f"Cμ = {summary.statistical_complexity:.4f}")  # ≈ 0.918

The inferred machine:

State A (ending in 0): P(0→A) = 0.5, P(1→B) = 0.5
State B (ending in 1): P(0→A) = 1.0

This exactly matches the theoretical Golden Mean machine!


8. Configuration Guide

8.1 Essential Parameters

from emic.inference import CSSR, CSSRConfig

config = CSSRConfig(
    max_history=5,      # How far back to look
    significance=0.001, # Chi-squared test threshold
)

8.2 max_history (L)

The maximum history length to consider.

Value Effect Use When
3-4 Fast, may miss long-range structure Exploratory analysis
5-6 Good balance Most applications
7-10 Captures long memory Complex processes, large data
>10 Very slow, needs huge data Rarely needed

Rule of thumb: Start with max_history=5. Increase if you get 1 state when expecting more.

8.3 significance (α)

The p-value threshold for statistical tests.

Value Effect Use When
0.001 Few states (conservative) Large data (N > 10,000)
0.01 Balanced Medium data (1,000-10,000)
0.05 More states (liberal) Small data (N < 1,000)

Rule of thumb: Use 0.001 by default; increase if under-splitting.

8.4 min_count

Minimum observations required for a suffix to be considered.

config = CSSRConfig(
    max_history=5,
    significance=0.001,
    min_count=5,  # Require at least 5 observations
)

This prevents unreliable estimates from rare suffixes.

8.5 post_merge

Whether to merge statistically indistinguishable states after convergence.

config = CSSRConfig(
    max_history=6,
    significance=0.001,
    post_merge=True,        # Enable merging
    merge_significance=0.05, # Less strict threshold for merging
)

Tip: Use post_merge=True when you get more states than expected.


9. Common Pitfalls and Solutions

9.1 Too Many States

Symptom: CSSR returns 10 states when you expect 2.

Causes: 1. Insufficient data: Random fluctuations create spurious differences 2. Over-strict significance: α = 0.001 splits on small differences 3. max_history too large: Creates rare suffixes with noisy estimates

Solutions:

# Option 1: More data
data = TakeN(100_000)(source)

# Option 2: Less strict testing
config = CSSRConfig(significance=0.01)

# Option 3: Enable post-merge
config = CSSRConfig(post_merge=True, merge_significance=0.1)

# Option 4: Increase min_count
config = CSSRConfig(min_count=10)

9.2 Too Few States (Just 1)

Symptom: CSSR returns 1 state for a complex process.

Causes: 1. max_history too small: Doesn't capture the memory 2. Over-liberal significance: α = 0.05 doesn't split different distributions 3. Not enough data: Estimates are too noisy to detect differences

Solutions:

# Option 1: Increase history length
config = CSSRConfig(max_history=8)

# Option 2: Stricter testing
config = CSSRConfig(significance=0.001)

# Option 3: More data
data = TakeN(50_000)(source)

9.3 Non-Convergence

Symptom: result.converged = False

Causes: 1. Process has very long memory (exceeds max_history) 2. Process is non-stationary 3. Alphabet is too large (creates sparse counts)

Solutions:

# Check the iteration count
print(f"Iterations: {result.iterations}")

# Try larger history
config = CSSRConfig(max_history=10)

# Or accept the approximate result
if not result.converged:
    print("Warning: Results may be approximate")

9.4 The Even Process Trap

The Even Process is notoriously difficult for CSSR because: - It has 2 true states - But finite samples often suggest 3-4 states - The constraint (even runs of 1s) creates subtle correlations

Solution:

config = CSSRConfig(
    max_history=6,
    significance=0.01,
    post_merge=True,
    merge_significance=0.1,  # Aggressive merging
)


10. Further Reading

10.1 Original Papers

  1. Shalizi & Klinkner (2004): "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences"
  2. The original CSSR paper
  3. Technical details and proofs

  4. Shalizi & Crutchfield (2001): "Computational Mechanics: Pattern and Prediction, Structure and Simplicity"

  5. Theoretical foundations
  6. Causal state definition and properties

10.2 Reference Implementation

The canonical CSSR implementation is available at:

The emic implementation follows the same algorithmic structure but with modern Python idioms and additional features like post-merging.

  • Spectral Learning: Matrix-based approach, often more accurate
  • See: Spectral Learning Deep Dive
  • CSM (Causal State Merging): Bottom-up instead of top-down
  • BSI (Bayesian Structural Inference): Bayesian approach with uncertainty

10.3 Comparison with CSSR

For a comparison of all algorithms in emic:

from emic.sources import GoldenMeanSource, TakeN
from emic.inference import CSSR, CSM, Spectral, CSSRConfig, CSMConfig, SpectralConfig
from emic.analysis import analyze

source = GoldenMeanSource(p=0.5, _seed=42)
data = TakeN(10_000)(source)

algorithms = [
    ("CSSR", CSSR(CSSRConfig(max_history=5))),
    ("CSM", CSM(CSMConfig(history_length=5))),
    ("Spectral", Spectral(SpectralConfig())),
]

for name, algo in algorithms:
    result = algo.infer(data)
    summary = analyze(result.machine)
    print(f"{name}: {summary.num_states} states, Cμ = {summary.statistical_complexity:.4f}")

11. Algorithm Correctness and Benchmarks

11.1 Validation Against Shalizi (2004)

The emic CSSR implementation has been validated against the benchmarks published in the original Shalizi & Klinkner (2004) paper. The paper reports that with parameters \(N = 10^4\), \(\alpha = 10^{-3}\), and \(L_{\max} \geq 3\), CSSR achieves 100% correct reconstruction for the Even Process (2 states) and related processes.

emic benchmark results (January 2026):

Process Sample Size Expected States emic States Status
Even Process 10,000 2 2 ✓ PASS
Golden Mean 10,000 2 2 ✓ PASS
Biased Coin 100,000 1 1 ✓ PASS

11.2 Sample Size Sensitivity

CSSR performance depends strongly on sample size:

Sample Size Typical Correctness Notes
N < 1,000 60-70% High variance, may over-split
N = 10,000 90-100% Shalizi benchmark threshold
N ≥ 100,000 ~100% Reliable reconstruction

Recommendation: Use at least N = 10,000 samples for reliable inference.

11.3 Comparison with Other Algorithms

Quick benchmark on canonical processes:

Algorithm N=1,000 N=10,000 N=100,000
CSSR 67% 89% 100%
Spectral 100% 100% 100%
CSM 56% 78% 89%

Key finding: Spectral Learning consistently outperforms CSSR and CSM on correctness, especially at smaller sample sizes. However, CSSR remains the reference algorithm due to its interpretability and direct implementation of causal state definitions.

11.4 Implementation Details

The emic CSSR implementation includes several enhancements over naive implementations:

  1. Synchronizing history detection: Identifies histories that uniquely determine causal state vs. ambiguous histories that mix multiple states.

  2. Proportion-based tolerance checks: Uses 15% tolerance on proportions before applying chi-squared tests, preventing over-splitting on IID processes.

  3. Lenient merge thresholds: Uses minimum 10% significance for merging to avoid over-splitting from statistical fluctuations with large samples.

  4. Adaptive thresholds: The algorithm adjusts sensitivity based on sample statistics rather than using fixed thresholds.

These enhancements ensure the implementation matches published benchmark performance while remaining robust across different process types.


Summary

CSSR is a top-down algorithm that:

  1. Builds a suffix tree of observed histories
  2. Computes predictive distributions for each suffix
  3. Tests whether suffixes have statistically different predictions
  4. Splits suffixes into separate states when they differ
  5. Merges over-split states in post-processing

The key parameters are: - max_history: How much memory to consider - significance: How strict the statistical tests are - post_merge: Whether to clean up over-splitting

CSSR directly implements the definition of causal states, making it the reference algorithm for epsilon-machine inference.