T is an experimental programming language for declarative, functional manipulation of tabular data. Inspired by R’s tidyverse and OCaml’s semantic rigor, T is designed to make data analysis explicit, inspectable, and pipeline-oriented.
Unlike traditional scripting languages, T is built from the ground up to support human–LLM collaborative programming, where humans specify intent and constraints, and language tools (including LLMs) generate localized, mechanical code.
Status: Pre-alpha. Actively designed and implemented.
Documentation
Getting Started
- Getting Started Guide — first steps with T
- Installation Guide — detailed setup with Nix
- Language Overview — types, syntax, functions, and standard library
- Numerical Arrays — tutorial on N-dimensional arrays and linear algebra
User Guides
- API Reference — complete function reference by package
- Data Manipulation Examples — practical examples with core data verbs
- Pipeline Tutorial — step-by-step guide to T's pipeline model
- Comprehensive Examples — real-world analysis patterns
- Error Handling Guide — error patterns and recovery strategies
Advanced Topics
- Reproducibility Guide — Nix integration and reproducible workflows
- LLM Collaboration — intent blocks and AI-assisted development
- Quotation & Metaprogramming — capturing and generating code
- Statistical Formulas — formula syntax for modeling
- Performance — Arrow backend and optimization
Developer Resources
- Architecture — language design and implementation
- Contributing Guide — how to contribute to T
- Development Guide — building, testing, and debugging
- Package Development Guide — creating and publishing T packages
Reference & Support
- FAQ — frequently asked questions
- Troubleshooting — common issues and solutions
- Changelog — version history and roadmap
- Code of Ethics — standards of conduct
Design Goals
- Data analysis as explicit pipelines
- First-class tabular data (DataFrame-centric)
- Expression-oriented, functional style
- Explicit semantics (no hidden rules, no implicit NA propagation)
- Minimal OCaml core, extensible via packages
- Deterministic execution and inspectable errors
- REPL-first exploratory workflow
- LLM-friendly structure and tooling
LLM-Native by Design
T treats large language models as first-class collaborators, not magic code generators. The language and tooling are designed to make LLM-generated code:
- Local rather than global
- Constrained rather than free-form
- Inspectable rather than opaque
- Correctable rather than brittle
Humans define intent, assumptions, and invariants. LLMs generate localized code. T enforces semantics and correctness.
Intent Blocks
T supports intent blocks: structured comments that encode analytical goals, assumptions, and checks in a machine-readable way.
-- intent: -- goal: "Estimate approval as a function of age and income" -- assumptions: -- - age is approximately linear -- - missing income is non-random -- checks: -- - no negative income -- - at least 100 observations per group
Intent blocks are preserved by tooling, version-controlled with code, and used as stable regeneration boundaries for LLM-assisted workflows.
Pipelines
Pipelines are T’s core execution model. Each pipeline is a DAG of named nodes with explicit dependencies, cacheable results, and inspectable outputs. T supports multi-runtime execution, allowing nodes to run in T, R, or Python with automatic data interchange via Arrow and PMML.
pipeline analysis {
raw = { read_csv("data.csv") }
cleaned = {
raw |> filter($age > 18)
|> mutate($income_k = $income / 1000)
}
model = node(
command = <{
# In R
fit <- lm(approval ~ age + income_k, data = cleaned)
fit
}>,
runtime = "R",
serializer = "pmml"
)
}
Pipelines enable local reasoning, reproducibility, and safe regeneration of individual steps without rewriting entire scripts.
Language Features
- R-style lambdas:
\(x) x + 1 - Conditional pipe operator:
|>(short-circuits on error) - Maybe-pipe operator:
?|>(forwards errors for recovery) - Dollar-prefix NSE:
select($name, $age),filter($age > 30),summarize($total = sum($amount)) - Named arguments:
lm(data = df, formula = y ~ x) - NA handling:
mean(data, na_rm = true)— all aggregation functions supportna_rm - CSV I/O:
read_csv(path, sep = ";", skip_lines = 2)andwrite_csv(df, path, sep = ";") - Python-style comprehensions:
[x * x for x in xs if x > 2] - Dictionary literals:
[name: "Alice", age: 30] - Multi-language orchestration: run R or Python code as pipeline nodes
- Model interchange: seamless PMML import of R/Python models into T
- Errors as values, not exceptions
- Actionable error messages with name suggestions, type conversion hints, and function signatures
- Explicit, typed missing values
Pipe Operators
T provides two pipe operators with different error-handling semantics:
Conditional Pipe: |>
The standard pipe passes the left-hand value as the first argument to the right-hand function. If the left-hand value is an error, the pipeline short-circuits and the error is returned without calling the function.
5 |> double -- 10
error("boom") |> double -- Error (short-circuited)
Maybe-Pipe: ?|>
The maybe-pipe always forwards the left-hand value — including errors — to the right-hand function. This enables explicit error recovery patterns.
-- Recover from errors:
handle = \(x) if (is_error(x)) "recovered" else x
error("boom") ?|> handle -- "recovered"
-- Chain recovery with normal processing:
recovery = \(x) if (is_error(x)) 0 else x
increment = \(x) x + 1
error("fail") ?|> recovery |> increment -- 1
Together, |> and ?|> enable
Railway-Oriented Programming in T: errors flow through pipelines as
explicit values, and recovery logic is composable.
Numerical Backend
T’s numerical stack is layered:
- Tabular layer: Apache Arrow for columnar data and zero-copy interoperability
- Interchange layer: PMML for high-fidelity cross-language model transfer
- Compute layer: Owl for linear algebra, optimization, and statistics
- Fallback layer: Selective C bindings (e.g. LAPACK, GSL) when needed
This approach prioritizes fast development, explicit semantics, and safe defaults, while leaving room for future performance upgrades.
Standard Packages
- core: functional utilities (
map,sum,seq) - stats: statistical primitives (
mean,sd,quantile,cor,lm) - colcraft: DataFrame operations (
select,filter,mutate,group_by,summarize) and window functions (row_number,min_rank,dense_rank,lag,lead,cumsum, etc.)
Packages are part of the standard library and loaded by default. Each function lives in its own file.
Missing Value Handling
T uses explicit NA values with type tags. NA does not propagate
implicitly — operations on NA produce errors by default:
mean([1, NA, 3]) -- Error: NA encountered sum([1, NA, 3]) -- Error: NA encountered
To skip NA values, use the na_rm = true parameter:
mean([1, NA, 3], na_rm = true) -- 2.0 sum([1, NA, 3], na_rm = true) -- 4 sd([2, NA, 4, 9], na_rm = true) -- 3.61 cor(x, y, na_rm = true) -- pairwise deletion
All aggregation functions (mean, sum, sd,
quantile, cor) support the na_rm parameter.
Window Functions and NA
Window functions compute values across a set of rows without collapsing them. All window functions handle NA gracefully:
- Ranking (
row_number,min_rank,dense_rank,cume_dist,percent_rank,ntile): NA positions get NA rank; ranks computed only among non-NA values - Offset (
lag,lead): NA values pass through unchanged - Cumulative (
cumsum,cummin,cummax,cummean,cumall,cumany): NA propagates to all subsequent values (matching R)
row_number([3, NA, 1]) -- Vector[2, NA, 1] cumsum([1, NA, 3]) -- Vector[1, NA, NA] lag([1, NA, 3]) -- Vector[NA, 1, NA]
Alpha Roadmap
The alpha version of T targets a complete, end-to-end workflow:
- Stable core syntax and interpreter
- Arrow-backed DataFrames
- DAG-based pipelines with Nix-managed caching
- Multi-runtime support (T, R, Python)
- Core data verbs (
select,filter,group_by,summarize) - Grouped operations (
group_by |> mutate,group_by |> summarize) - Window functions with NA handling (
row_number,lag,lead,cumsum, etc.) - PMML model interchange with tidy statistics bridge
- Basic statistics and modeling with NA handling (
na_rmparameter) - Intent blocks and tooling hooks
- REPL and CLI
Performance tuning, GPUs, and distributed execution are explicitly out of scope for alpha.
Project Structure
.
├── flake.nix
├── ast.ml
├── parser.ml
├── lexer.ml
├── eval.ml
├── repl.ml
├── pipeline.ml
├── dataframe.ml
└── packages/
├── core/
├── stats/
└── colcraft/
Building
nix develop t repl
Contributing
Contributions focus on clarity, explicit semantics, and small, reviewable changes. Packages live in-repo during early development.
License: EUPL v1.2.