20 Exporting Chunks
20.1 Concepts
Why we standardize export formats
Different enterprise platforms ingest and index text differently, but the governance requirement is the same:
- the exact source must be traceable,
- outputs must be reviewable and reproducible,
- and the exported dataset must support audit and reprocessing.
Standardizing on a platform-neutral chunk record (the output of fnd-chunking.qmd) gives us a stable “source of truth”. From there, we generate platform-specific exports without changing the underlying meaning or provenance.
Gemini vs. Foundry: what we’re optimizing for
- Gemini Enterprise workflows often benefit from JSONL because:
- it is line-delimited and stream-friendly,
- easy to generate and inspect,
- common in model ingestion pipelines.
- Palantir Foundry / Army Vantage workflows often benefit from Parquet because:
- columnar storage is efficient for analytics and indexing,
- schema is explicit and enforced,
- it is commonly used for Foundry datasets and pipelines.
“Export is not transformation”
Exports should be lossless with respect to: - document identity (document_sha256, document_id) - location/provenance pointers (page range, element span) - audit fields (run IDs, tool versions, parameters) - chunk content (text)
Any platform-specific reshaping should be: - mechanical (renaming fields, adding nested structure), and - fully documented and reproducible.
Source provenance must remain first-class
Exports must preserve the ability to answer, for any retrieved chunk:
- Which file/version? (sha256)
- Where in the file? (page range, element span)
- Which processing run and policy? (run metadata + parameters)
If a target platform cannot ingest certain metadata directly, we still export it as fields so it can be stored alongside the content or referenced externally.
20.2 Export schemas (recommended)
Baseline chunk record (source of truth)
We assume input chunks come from chunks.jsonl as produced by fnd-chunking.qmd.
At minimum, each record includes:
- Document provenance:
document_sha256,source_relpath,source_path - Location pointers:
page_number_min,page_number_max,element_index_start,element_index_end - Chunk identity:
chunk_id,chunk_index - Content:
text - Audit:
run_id,pipeline_version,semantic_chunker_params,software_versions
Gemini JSONL export (recommended structure)
Because “Gemini ingestion” can vary by organization and toolchain, we recommend a conservative, flexible JSON object per line:
id(usechunk_id)textmetadataobject containing all provenance and audit fields
Example (one line per chunk):
{"id":"<chunk_id>","text":"...","metadata":{...}}
This keeps the chunk content and provenance together while remaining broadly compatible with JSON-based pipelines.
Foundry Parquet export (recommended structure)
For Foundry, keep a flat, typed table with columns such as:
chunk_id(string)document_sha256(string)source_relpath(string)page_number_min(int)page_number_max(int)element_index_start(int)element_index_end(int)chunk_index(int)text(string)char_count(int)approx_token_count(int)run_id(string)chunked_at_utc(timestamp-like string)pipeline_version(string)embedding_model_name(string)semantic_chunker_params_json(string JSON)software_versions_json(string JSON)qa_flags(string[] or JSON string depending on downstream tooling)
This structure is analytics-friendly and plays well with schema enforcement.
20.3 Workflow overview
Inputs: - runs/<chunk_run>/chunks.jsonl - runs/<chunk_run>/chunk-run-metadata.json (recommended to retain)
Outputs: - gemini/chunks.jsonl (Gemini-friendly JSONL) - foundry/chunks.parquet (Foundry-friendly Parquet) - export-run-metadata.json (audit trail)
20.4 Prerequisites
Python packages: - pandas - pyarrow (required for Parquet) - tqdm (optional)
Installation (example): ~bash python -m pip install pandas pyarrow tqdm~
20.5 Runnable script: export JSONL (Gemini) + Parquet (Foundry)
What this script does
- Reads
chunks.jsonl - Writes:
exports/<RUN_ID>/gemini/chunks.jsonlexports/<RUN_ID>/foundry/chunks.parquet
- Writes
export-run-metadata.jsonto document:- inputs
- export schema decisions
- tool versions
Script
#| eval: false
# Save as scripts/export_chunks.py and run:
# python scripts/export_chunks.py
#
# Assumes you have already generated chunks.jsonl via fnd-chunking.qmd.
from __future__ import annotations
import datetime as dt
import json
import platform
import sys
from pathlib import Path
from typing import Any, Dict, List, Optional
import pandas as pd
from tqdm import tqdm
# ---------------------------
# Configuration (edit these)
# ---------------------------
CHUNK_RUN_DIR = Path(r"CHANGE_ME\runs\2026-04-23_semantic-chunk").resolve()
OUTPUT_ROOT = Path("exports").resolve()
RUN_ID = f"{dt.datetime.utcnow().date().isoformat()}_chunk-export"
# Output filenames
GEMINI_JSONL_NAME = "chunks.jsonl"
FOUNDRY_PARQUET_NAME = "chunks.parquet"
# ---------------------------
# Helpers
# ---------------------------
def utc_now_iso() -> str:
return dt.datetime.utcnow().replace(tzinfo=dt.timezone.utc).isoformat()
def _safe_import_version(pkg: str) -> Optional[str]:
try:
mod = __import__(pkg)
return getattr(mod, "__version__", None)
except Exception:
return None
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
rows: List[Dict[str, Any]] = []
with path.open("r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
# ---------------------------
# Export transforms
# ---------------------------
def to_gemini_jsonl_record(chunk: Dict[str, Any]) -> Dict[str, Any]:
"""
Gemini-friendly JSONL record.
Conservative structure:
- id: chunk_id
- text: chunk text
- metadata: everything else
"""
chunk_id = chunk.get("chunk_id")
text = chunk.get("text")
# Keep provenance and audit fields together, but exclude the main text.
metadata = {k: v for k, v in chunk.items() if k not in ("text",)}
return {
"id": chunk_id,
"text": text,
"metadata": metadata,
}
def flatten_for_foundry(chunk: Dict[str, Any]) -> Dict[str, Any]:
"""
Flatten nested dict fields into JSON strings for columnar storage.
Keeps table schema stable and explicit.
"""
out = dict(chunk)
# Normalize nested objects to JSON strings for Parquet friendliness
if isinstance(out.get("semantic_chunker_params"), dict):
out["semantic_chunker_params_json"] = json.dumps(out["semantic_chunker_params"], ensure_ascii=False, sort_keys=True)
del out["semantic_chunker_params"]
if isinstance(out.get("software_versions"), dict):
out["software_versions_json"] = json.dumps(out["software_versions"], ensure_ascii=False, sort_keys=True)
del out["software_versions"]
# Normalize qa_flags: keep list if you want list-typed parquet,
# or encode as JSON if downstream tooling prefers strings only.
# Here we keep list if present; pyarrow can store list<string>.
# If you prefer string: uncomment JSON conversion.
#
# if isinstance(out.get("qa_flags"), list):
# out["qa_flags_json"] = json.dumps(out["qa_flags"], ensure_ascii=False)
# del out["qa_flags"]
return out
def main() -> None:
chunks_path = CHUNK_RUN_DIR / "chunks.jsonl"
chunk_run_meta_path = CHUNK_RUN_DIR / "chunk-run-metadata.json"
for p in [chunks_path, chunk_run_meta_path]:
if not p.exists():
raise FileNotFoundError(f"Missing expected input: {p}")
export_dir = OUTPUT_ROOT / RUN_ID
gemini_dir = export_dir / "gemini"
foundry_dir = export_dir / "foundry"
gemini_dir.mkdir(parents=True, exist_ok=False)
foundry_dir.mkdir(parents=True, exist_ok=False)
# Read chunks
chunks = load_jsonl(chunks_path)
if not chunks:
raise RuntimeError(f"No chunks found in: {chunks_path}")
# Write export run metadata (audit trail)
export_run_meta = {
"run_id": RUN_ID,
"created_at_utc": utc_now_iso(),
"inputs": {
"chunk_run_dir": str(CHUNK_RUN_DIR),
"chunks_jsonl": str(chunks_path),
"chunk_run_metadata": str(chunk_run_meta_path),
},
"outputs": {
"gemini_jsonl": str(gemini_dir / GEMINI_JSONL_NAME),
"foundry_parquet": str(foundry_dir / FOUNDRY_PARQUET_NAME),
},
"packages": {
"pandas": _safe_import_version("pandas"),
"pyarrow": _safe_import_version("pyarrow"),
"tqdm": _safe_import_version("tqdm"),
},
"export_policies": {
"gemini_jsonl_structure": {"id": "chunk_id", "text": "text", "metadata": "all_other_fields"},
"foundry_parquet_flattening": [
"semantic_chunker_params -> semantic_chunker_params_json",
"software_versions -> software_versions_json",
],
"losslessness_goal": "Do not drop provenance/audit fields; only reshape.",
},
}
(export_dir / "export-run-metadata.json").write_text(json.dumps(export_run_meta, indent=2), encoding="utf-8")
# 1) Gemini JSONL
gemini_out_path = gemini_dir / GEMINI_JSONL_NAME
with gemini_out_path.open("w", encoding="utf-8") as f:
for c in tqdm(chunks, desc="Writing Gemini JSONL"):
rec = to_gemini_jsonl_record(c)
f.write(json.dumps(rec, ensure_ascii=False) + "\n")
# 2) Foundry Parquet
flat_rows = [flatten_for_foundry(c) for c in chunks]
df = pd.DataFrame(flat_rows)
# Optional: enforce column ordering (helps reviewers)
preferred_cols = [
"chunk_id",
"document_sha256",
"document_id",
"source_relpath",
"source_path",
"chunk_index",
"text",
"page_number_min",
"page_number_max",
"element_index_start",
"element_index_end",
"element_count",
"char_count",
"approx_token_count",
"run_id",
"chunked_at_utc",
"pipeline_stage",
"pipeline_version",
"embedding_model_name",
"semantic_chunker_params_json",
"software_versions_json",
"qa_flags",
]
cols = [c for c in preferred_cols if c in df.columns] + [c for c in df.columns if c not in preferred_cols]
df = df[cols]
foundry_out_path = foundry_dir / FOUNDRY_PARQUET_NAME
df.to_parquet(foundry_out_path, index=False)
print("Done.")
print(f"Export directory: {export_dir}")
print(f"Gemini JSONL: {gemini_out_path}")
print(f"Foundry Parquet: {foundry_out_path}")
if __name__ == "__main__":
main()20.6 Verification checklist (release blockers)
1) Row counts match (lossless export)
2) Provenance fields preserved
For a sample of rows/records, verify the following are present and consistent:
3) JSONL validity
4) Parquet schema review
20.7 Operational guidance (platform flexibility)
Importing into Gemini workflows
Because Gemini Enterprise ingestion pipelines differ by tenant configuration, we recommend keeping the JSONL format conservative:
idfor stable chunk identitytextfor chunk contentmetadatafor everything needed to cite and audit
If a Gemini workflow requires a different shape (e.g., specific keys like document, uri, or source), add a thin export adapter that reshapes fields without dropping provenance.
Importing into Foundry / Army Vantage
Foundry typically ingests Parquet cleanly and allows: - filtering by provenance fields, - joining back to other datasets (e.g., document catalog), - enforcing schema consistency in pipelines.
Treat document_sha256 as the primary join key back to the document manifest and any higher-level document registry.