T is an experimental programming language for declarative, functional manipulation of tabular data. Inspired by R’s tidyverse and OCaml’s semantic rigor, T is designed to make data analysis explicit, inspectable, and pipeline-oriented.
Unlike traditional scripting languages, T is built from the ground up to support human–LLM collaborative programming, where humans specify intent and constraints, and language tools (including LLMs) generate localized, mechanical code.
Status: Pre-alpha. Actively designed and implemented.
T treats large language models as first-class collaborators, not magic code generators. The language and tooling are designed to make LLM-generated code:
Humans define intent, assumptions, and invariants. LLMs generate localized code. T enforces semantics and correctness.
T supports intent blocks: structured comments that encode analytical goals, assumptions, and checks in a machine-readable way.
-- intent: -- goal: "Estimate approval as a function of age and income" -- assumptions: -- - age is approximately linear -- - missing income is non-random -- checks: -- - no negative income -- - at least 100 observations per group
Intent blocks are preserved by tooling, version-controlled with code, and used as stable regeneration boundaries for LLM-assisted workflows.
Pipelines are T’s core execution model. Each pipeline is a DAG of named nodes with explicit dependencies, cacheable results, and inspectable outputs.
pipeline analysis {
raw = { read_csv("data.csv") }
cleaned = {
raw |> filter(age > 18)
|> mutate(income_k = income / 1000)
}
model = {
cleaned |> lm(approval ~ age + income_k)
}
}
Pipelines enable local reasoning, reproducibility, and safe regeneration of individual steps without rewriting entire scripts.
\(x) x + 1|> (short-circuits on error)?|> (forwards errors for recovery)lm(data = df, formula = y ~ x)mean(data, na_rm = true) — all aggregation functions support na_rmread_csv(path, sep = ";", skip_lines = 2) and write_csv(df, path, sep = ";")[x * x for x in xs if x > 2]{name: "Alice", age: 30}T provides two pipe operators with different error-handling semantics:
|>The standard pipe passes the left-hand value as the first argument to the right-hand function. If the left-hand value is an error, the pipeline short-circuits and the error is returned without calling the function.
5 |> double -- 10
error("boom") |> double -- Error (short-circuited)
?|>The maybe-pipe always forwards the left-hand value — including errors — to the right-hand function. This enables explicit error recovery patterns.
-- Recover from errors:
handle = \(x) if (is_error(x)) "recovered" else x
error("boom") ?|> handle -- "recovered"
-- Chain recovery with normal processing:
recovery = \(x) if (is_error(x)) 0 else x
increment = \(x) x + 1
error("fail") ?|> recovery |> increment -- 1
Together, |> and ?|> enable
Railway-Oriented Programming in T: errors flow through pipelines as
explicit values, and recovery logic is composable.
T’s numerical stack is layered:
This approach prioritizes fast development, explicit semantics, and safe defaults, while leaving room for future performance upgrades.
map, sum, seq)mean, sd, quantile, cor, lm)select, filter, mutate, group_by, summarize) and window functions (row_number, min_rank, dense_rank, lag, lead, cumsum, etc.)Packages are part of the standard library and loaded by default. Each function lives in its own file.
T uses explicit NA values with type tags. NA does not propagate
implicitly — operations on NA produce errors by default:
mean([1, NA, 3]) -- Error: NA encountered sum([1, NA, 3]) -- Error: NA encountered
To skip NA values, use the na_rm = true parameter:
mean([1, NA, 3], na_rm = true) -- 2.0 sum([1, NA, 3], na_rm = true) -- 4 sd([2, NA, 4, 9], na_rm = true) -- 3.61 cor(x, y, na_rm = true) -- pairwise deletion
All aggregation functions (mean, sum, sd,
quantile, cor) support the na_rm parameter.
Window functions compute values across a set of rows without collapsing them. All window functions handle NA gracefully:
row_number, min_rank, dense_rank,
cume_dist, percent_rank, ntile):
NA positions get NA rank; ranks computed only among non-NA valueslag, lead):
NA values pass through unchangedcumsum, cummin, cummax,
cummean, cumall, cumany):
NA propagates to all subsequent values (matching R)row_number([3, NA, 1]) -- Vector[2, NA, 1] cumsum([1, NA, 3]) -- Vector[1, NA, NA] lag([1, NA, 3]) -- Vector[NA, 1, NA]
The alpha version of T targets a complete, end-to-end workflow:
select, filter, group_by, summarize)group_by |> mutate, group_by |> summarize)row_number, lag, lead, cumsum, etc.)na_rm parameter)Performance tuning, GPUs, and distributed execution are explicitly out of scope for alpha.
.
├── flake.nix
├── ast.ml
├── parser.ml
├── lexer.ml
├── eval.ml
├── repl.ml
├── pipeline.ml
├── dataframe.ml
└── packages/
├── core/
├── stats/
└── colcraft/
nix develop t repl
Contributions focus on clarity, explicit semantics, and small, reviewable changes. Packages live in-repo during early development.
License: EUPL v1.2.