20  Exporting Chunks

20.1 Concepts

Why we standardize export formats

Different enterprise platforms ingest and index text differently, but the governance requirement is the same:

  • the exact source must be traceable,
  • outputs must be reviewable and reproducible,
  • and the exported dataset must support audit and reprocessing.

Standardizing on a platform-neutral chunk record (the output of fnd-chunking.qmd) gives us a stable “source of truth”. From there, we generate platform-specific exports without changing the underlying meaning or provenance.

Gemini vs. Foundry: what we’re optimizing for

  • Gemini Enterprise workflows often benefit from JSONL because:
    • it is line-delimited and stream-friendly,
    • easy to generate and inspect,
    • common in model ingestion pipelines.
  • Palantir Foundry / Army Vantage workflows often benefit from Parquet because:
    • columnar storage is efficient for analytics and indexing,
    • schema is explicit and enforced,
    • it is commonly used for Foundry datasets and pipelines.

“Export is not transformation”

Exports should be lossless with respect to: - document identity (document_sha256, document_id) - location/provenance pointers (page range, element span) - audit fields (run IDs, tool versions, parameters) - chunk content (text)

Any platform-specific reshaping should be: - mechanical (renaming fields, adding nested structure), and - fully documented and reproducible.

Source provenance must remain first-class

Exports must preserve the ability to answer, for any retrieved chunk:

  1. Which file/version? (sha256)
  2. Where in the file? (page range, element span)
  3. Which processing run and policy? (run metadata + parameters)

If a target platform cannot ingest certain metadata directly, we still export it as fields so it can be stored alongside the content or referenced externally.


20.3 Workflow overview

Inputs: - runs/<chunk_run>/chunks.jsonl - runs/<chunk_run>/chunk-run-metadata.json (recommended to retain)

Outputs: - gemini/chunks.jsonl (Gemini-friendly JSONL) - foundry/chunks.parquet (Foundry-friendly Parquet) - export-run-metadata.json (audit trail)


20.4 Prerequisites

Python packages: - pandas - pyarrow (required for Parquet) - tqdm (optional)

Installation (example): ~bash python -m pip install pandas pyarrow tqdm~


20.5 Runnable script: export JSONL (Gemini) + Parquet (Foundry)

What this script does

  • Reads chunks.jsonl
  • Writes:
    • exports/<RUN_ID>/gemini/chunks.jsonl
    • exports/<RUN_ID>/foundry/chunks.parquet
  • Writes export-run-metadata.json to document:
    • inputs
    • export schema decisions
    • tool versions

Script

#| eval: false

# Save as scripts/export_chunks.py and run:
#   python scripts/export_chunks.py
#
# Assumes you have already generated chunks.jsonl via fnd-chunking.qmd.

from __future__ import annotations

import datetime as dt
import json
import platform
import sys
from pathlib import Path
from typing import Any, Dict, List, Optional

import pandas as pd
from tqdm import tqdm


# ---------------------------
# Configuration (edit these)
# ---------------------------

CHUNK_RUN_DIR = Path(r"CHANGE_ME\runs\2026-04-23_semantic-chunk").resolve()

OUTPUT_ROOT = Path("exports").resolve()
RUN_ID = f"{dt.datetime.utcnow().date().isoformat()}_chunk-export"

# Output filenames
GEMINI_JSONL_NAME = "chunks.jsonl"
FOUNDRY_PARQUET_NAME = "chunks.parquet"


# ---------------------------
# Helpers
# ---------------------------

def utc_now_iso() -> str:
    return dt.datetime.utcnow().replace(tzinfo=dt.timezone.utc).isoformat()

def _safe_import_version(pkg: str) -> Optional[str]:
    try:
        mod = __import__(pkg)
        return getattr(mod, "__version__", None)
    except Exception:
        return None

def load_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows: List[Dict[str, Any]] = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows


# ---------------------------
# Export transforms
# ---------------------------

def to_gemini_jsonl_record(chunk: Dict[str, Any]) -> Dict[str, Any]:
    """
    Gemini-friendly JSONL record.
    Conservative structure:
      - id: chunk_id
      - text: chunk text
      - metadata: everything else
    """
    chunk_id = chunk.get("chunk_id")
    text = chunk.get("text")

    # Keep provenance and audit fields together, but exclude the main text.
    metadata = {k: v for k, v in chunk.items() if k not in ("text",)}

    return {
        "id": chunk_id,
        "text": text,
        "metadata": metadata,
    }

def flatten_for_foundry(chunk: Dict[str, Any]) -> Dict[str, Any]:
    """
    Flatten nested dict fields into JSON strings for columnar storage.
    Keeps table schema stable and explicit.
    """
    out = dict(chunk)

    # Normalize nested objects to JSON strings for Parquet friendliness
    if isinstance(out.get("semantic_chunker_params"), dict):
        out["semantic_chunker_params_json"] = json.dumps(out["semantic_chunker_params"], ensure_ascii=False, sort_keys=True)
        del out["semantic_chunker_params"]

    if isinstance(out.get("software_versions"), dict):
        out["software_versions_json"] = json.dumps(out["software_versions"], ensure_ascii=False, sort_keys=True)
        del out["software_versions"]

    # Normalize qa_flags: keep list if you want list-typed parquet,
    # or encode as JSON if downstream tooling prefers strings only.
    # Here we keep list if present; pyarrow can store list<string>.
    # If you prefer string: uncomment JSON conversion.
    #
    # if isinstance(out.get("qa_flags"), list):
    #     out["qa_flags_json"] = json.dumps(out["qa_flags"], ensure_ascii=False)
    #     del out["qa_flags"]

    return out


def main() -> None:
    chunks_path = CHUNK_RUN_DIR / "chunks.jsonl"
    chunk_run_meta_path = CHUNK_RUN_DIR / "chunk-run-metadata.json"

    for p in [chunks_path, chunk_run_meta_path]:
        if not p.exists():
            raise FileNotFoundError(f"Missing expected input: {p}")

    export_dir = OUTPUT_ROOT / RUN_ID
    gemini_dir = export_dir / "gemini"
    foundry_dir = export_dir / "foundry"
    gemini_dir.mkdir(parents=True, exist_ok=False)
    foundry_dir.mkdir(parents=True, exist_ok=False)

    # Read chunks
    chunks = load_jsonl(chunks_path)
    if not chunks:
        raise RuntimeError(f"No chunks found in: {chunks_path}")

    # Write export run metadata (audit trail)
    export_run_meta = {
        "run_id": RUN_ID,
        "created_at_utc": utc_now_iso(),
        "inputs": {
            "chunk_run_dir": str(CHUNK_RUN_DIR),
            "chunks_jsonl": str(chunks_path),
            "chunk_run_metadata": str(chunk_run_meta_path),
        },
        "outputs": {
            "gemini_jsonl": str(gemini_dir / GEMINI_JSONL_NAME),
            "foundry_parquet": str(foundry_dir / FOUNDRY_PARQUET_NAME),
        },
        "packages": {
            "pandas": _safe_import_version("pandas"),
            "pyarrow": _safe_import_version("pyarrow"),
            "tqdm": _safe_import_version("tqdm"),
        },
        "export_policies": {
            "gemini_jsonl_structure": {"id": "chunk_id", "text": "text", "metadata": "all_other_fields"},
            "foundry_parquet_flattening": [
                "semantic_chunker_params -> semantic_chunker_params_json",
                "software_versions -> software_versions_json",
            ],
            "losslessness_goal": "Do not drop provenance/audit fields; only reshape.",
        },
    }
    (export_dir / "export-run-metadata.json").write_text(json.dumps(export_run_meta, indent=2), encoding="utf-8")

    # 1) Gemini JSONL
    gemini_out_path = gemini_dir / GEMINI_JSONL_NAME
    with gemini_out_path.open("w", encoding="utf-8") as f:
        for c in tqdm(chunks, desc="Writing Gemini JSONL"):
            rec = to_gemini_jsonl_record(c)
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

    # 2) Foundry Parquet
    flat_rows = [flatten_for_foundry(c) for c in chunks]
    df = pd.DataFrame(flat_rows)

    # Optional: enforce column ordering (helps reviewers)
    preferred_cols = [
        "chunk_id",
        "document_sha256",
        "document_id",
        "source_relpath",
        "source_path",
        "chunk_index",
        "text",
        "page_number_min",
        "page_number_max",
        "element_index_start",
        "element_index_end",
        "element_count",
        "char_count",
        "approx_token_count",
        "run_id",
        "chunked_at_utc",
        "pipeline_stage",
        "pipeline_version",
        "embedding_model_name",
        "semantic_chunker_params_json",
        "software_versions_json",
        "qa_flags",
    ]
    cols = [c for c in preferred_cols if c in df.columns] + [c for c in df.columns if c not in preferred_cols]
    df = df[cols]

    foundry_out_path = foundry_dir / FOUNDRY_PARQUET_NAME
    df.to_parquet(foundry_out_path, index=False)

    print("Done.")
    print(f"Export directory: {export_dir}")
    print(f"Gemini JSONL: {gemini_out_path}")
    print(f"Foundry Parquet: {foundry_out_path}")


if __name__ == "__main__":
    main()

20.6 Verification checklist (release blockers)

1) Row counts match (lossless export)

2) Provenance fields preserved

For a sample of rows/records, verify the following are present and consistent:

3) JSONL validity

4) Parquet schema review


20.7 Operational guidance (platform flexibility)

Importing into Gemini workflows

Because Gemini Enterprise ingestion pipelines differ by tenant configuration, we recommend keeping the JSONL format conservative:

  • id for stable chunk identity
  • text for chunk content
  • metadata for everything needed to cite and audit

If a Gemini workflow requires a different shape (e.g., specific keys like document, uri, or source), add a thin export adapter that reshapes fields without dropping provenance.

Importing into Foundry / Army Vantage

Foundry typically ingests Parquet cleanly and allows: - filtering by provenance fields, - joining back to other datasets (e.g., document catalog), - enforcing schema consistency in pipelines.

Treat document_sha256 as the primary join key back to the document manifest and any higher-level document registry.


20.8 Definition of done (for export work)