20 Exporting Chunks

20.1 Concepts

Why we standardize export formats

Different enterprise platforms ingest and index text differently, but the governance requirement is the same:

the exact source must be traceable,
outputs must be reviewable and reproducible,
and the exported dataset must support audit and reprocessing.

Standardizing on a platform-neutral chunk record (the output of fnd-chunking.qmd) gives us a stable “source of truth”. From there, we generate platform-specific exports without changing the underlying meaning or provenance.

Gemini vs. Foundry: what we’re optimizing for

Gemini Enterprise workflows often benefit from JSONL because:
- it is line-delimited and stream-friendly,
- easy to generate and inspect,
- common in model ingestion pipelines.
Palantir Foundry / Army Vantage workflows often benefit from Parquet because:
- columnar storage is efficient for analytics and indexing,
- schema is explicit and enforced,
- it is commonly used for Foundry datasets and pipelines.

“Export is not transformation”

Exports should be lossless with respect to: - document identity (document_sha256, document_id) - location/provenance pointers (page range, element span) - audit fields (run IDs, tool versions, parameters) - chunk content (text)

Any platform-specific reshaping should be: - mechanical (renaming fields, adding nested structure), and - fully documented and reproducible.

Source provenance must remain first-class

Exports must preserve the ability to answer, for any retrieved chunk:

Which file/version? (sha256)
Where in the file? (page range, element span)
Which processing run and policy? (run metadata + parameters)

If a target platform cannot ingest certain metadata directly, we still export it as fields so it can be stored alongside the content or referenced externally.

20.2 Export schemas (recommended)

Baseline chunk record (source of truth)

We assume input chunks come from chunks.jsonl as produced by fnd-chunking.qmd.

At minimum, each record includes:

Document provenance: document_sha256, source_relpath, source_path
Location pointers: page_number_min, page_number_max, element_index_start, element_index_end
Chunk identity: chunk_id, chunk_index
Content: text
Audit: run_id, pipeline_version, semantic_chunker_params, software_versions

Gemini JSONL export (recommended structure)

Because “Gemini ingestion” can vary by organization and toolchain, we recommend a conservative, flexible JSON object per line:

id (use chunk_id)
text
metadata object containing all provenance and audit fields

Example (one line per chunk):

{"id":"<chunk_id>","text":"...","metadata":{...}}

This keeps the chunk content and provenance together while remaining broadly compatible with JSON-based pipelines.

Foundry Parquet export (recommended structure)

For Foundry, keep a flat, typed table with columns such as:

chunk_id (string)
document_sha256 (string)
source_relpath (string)
page_number_min (int)
page_number_max (int)
element_index_start (int)
element_index_end (int)
chunk_index (int)
text (string)
char_count (int)
approx_token_count (int)
run_id (string)
chunked_at_utc (timestamp-like string)
pipeline_version (string)
embedding_model_name (string)
semantic_chunker_params_json (string JSON)
software_versions_json (string JSON)
qa_flags (string[] or JSON string depending on downstream tooling)

This structure is analytics-friendly and plays well with schema enforcement.

20.3 Workflow overview

Inputs: - runs/<chunk_run>/chunks.jsonl - runs/<chunk_run>/chunk-run-metadata.json (recommended to retain)

Outputs: - gemini/chunks.jsonl (Gemini-friendly JSONL) - foundry/chunks.parquet (Foundry-friendly Parquet) - export-run-metadata.json (audit trail)

20.4 Prerequisites

Python packages: - pandas - pyarrow (required for Parquet) - tqdm (optional)

Installation (example): ~~~bash python -m pip install pandas pyarrow tqdm~~~

20.5 Runnable script: export JSONL (Gemini) + Parquet (Foundry)

What this script does

Reads chunks.jsonl
Writes:
- exports/<RUN_ID>/gemini/chunks.jsonl
- exports/<RUN_ID>/foundry/chunks.parquet
Writes export-run-metadata.json to document:
- inputs
- export schema decisions
- tool versions

Script

#| eval: false

# Save as scripts/export_chunks.py and run:
#   python scripts/export_chunks.py
#
# Assumes you have already generated chunks.jsonl via fnd-chunking.qmd.

from __future__ import annotations

import datetime as dt
import json
import platform
import sys
from pathlib import Path
from typing import Any, Dict, List, Optional

import pandas as pd
from tqdm import tqdm


# ---------------------------
# Configuration (edit these)
# ---------------------------

CHUNK_RUN_DIR = Path(r"CHANGE_ME\runs\2026-04-23_semantic-chunk").resolve()

OUTPUT_ROOT = Path("exports").resolve()
RUN_ID = f"{dt.datetime.utcnow().date().isoformat()}_chunk-export"

# Output filenames
GEMINI_JSONL_NAME = "chunks.jsonl"
FOUNDRY_PARQUET_NAME = "chunks.parquet"


# ---------------------------
# Helpers
# ---------------------------

def utc_now_iso() -> str:
    return dt.datetime.utcnow().replace(tzinfo=dt.timezone.utc).isoformat()

def _safe_import_version(pkg: str) -> Optional[str]:
    try:
        mod = __import__(pkg)
        return getattr(mod, "__version__", None)
    except Exception:
        return None

def load_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows: List[Dict[str, Any]] = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows


# ---------------------------
# Export transforms
# ---------------------------

def to_gemini_jsonl_record(chunk: Dict[str, Any]) -> Dict[str, Any]:
    """
    Gemini-friendly JSONL record.
    Conservative structure:
      - id: chunk_id
      - text: chunk text
      - metadata: everything else
    """
    chunk_id = chunk.get("chunk_id")
    text = chunk.get("text")

    # Keep provenance and audit fields together, but exclude the main text.
    metadata = {k: v for k, v in chunk.items() if k not in ("text",)}

    return {
        "id": chunk_id,
        "text": text,
        "metadata": metadata,
    }

def flatten_for_foundry(chunk: Dict[str, Any]) -> Dict[str, Any]:
    """
    Flatten nested dict fields into JSON strings for columnar storage.
    Keeps table schema stable and explicit.
    """
    out = dict(chunk)

    # Normalize nested objects to JSON strings for Parquet friendliness
    if isinstance(out.get("semantic_chunker_params"), dict):
        out["semantic_chunker_params_json"] = json.dumps(out["semantic_chunker_params"], ensure_ascii=False, sort_keys=True)
        del out["semantic_chunker_params"]

    if isinstance(out.get("software_versions"), dict):
        out["software_versions_json"] = json.dumps(out["software_versions"], ensure_ascii=False, sort_keys=True)
        del out["software_versions"]

    # Normalize qa_flags: keep list if you want list-typed parquet,
    # or encode as JSON if downstream tooling prefers strings only.
    # Here we keep list if present; pyarrow can store list<string>.
    # If you prefer string: uncomment JSON conversion.
    #
    # if isinstance(out.get("qa_flags"), list):
    #     out["qa_flags_json"] = json.dumps(out["qa_flags"], ensure_ascii=False)
    #     del out["qa_flags"]

    return out


def main() -> None:
    chunks_path = CHUNK_RUN_DIR / "chunks.jsonl"
    chunk_run_meta_path = CHUNK_RUN_DIR / "chunk-run-metadata.json"

    for p in [chunks_path, chunk_run_meta_path]:
        if not p.exists():
            raise FileNotFoundError(f"Missing expected input: {p}")

    export_dir = OUTPUT_ROOT / RUN_ID
    gemini_dir = export_dir / "gemini"
    foundry_dir = export_dir / "foundry"
    gemini_dir.mkdir(parents=True, exist_ok=False)
    foundry_dir.mkdir(parents=True, exist_ok=False)

    # Read chunks
    chunks = load_jsonl(chunks_path)
    if not chunks:
        raise RuntimeError(f"No chunks found in: {chunks_path}")

    # Write export run metadata (audit trail)
    export_run_meta = {
        "run_id": RUN_ID,
        "created_at_utc": utc_now_iso(),
        "inputs": {
            "chunk_run_dir": str(CHUNK_RUN_DIR),
            "chunks_jsonl": str(chunks_path),
            "chunk_run_metadata": str(chunk_run_meta_path),
        },
        "outputs": {
            "gemini_jsonl": str(gemini_dir / GEMINI_JSONL_NAME),
            "foundry_parquet": str(foundry_dir / FOUNDRY_PARQUET_NAME),
        },
        "packages": {
            "pandas": _safe_import_version("pandas"),
            "pyarrow": _safe_import_version("pyarrow"),
            "tqdm": _safe_import_version("tqdm"),
        },
        "export_policies": {
            "gemini_jsonl_structure": {"id": "chunk_id", "text": "text", "metadata": "all_other_fields"},
            "foundry_parquet_flattening": [
                "semantic_chunker_params -> semantic_chunker_params_json",
                "software_versions -> software_versions_json",
            ],
            "losslessness_goal": "Do not drop provenance/audit fields; only reshape.",
        },
    }
    (export_dir / "export-run-metadata.json").write_text(json.dumps(export_run_meta, indent=2), encoding="utf-8")

    # 1) Gemini JSONL
    gemini_out_path = gemini_dir / GEMINI_JSONL_NAME
    with gemini_out_path.open("w", encoding="utf-8") as f:
        for c in tqdm(chunks, desc="Writing Gemini JSONL"):
            rec = to_gemini_jsonl_record(c)
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

    # 2) Foundry Parquet
    flat_rows = [flatten_for_foundry(c) for c in chunks]
    df = pd.DataFrame(flat_rows)

    # Optional: enforce column ordering (helps reviewers)
    preferred_cols = [
        "chunk_id",
        "document_sha256",
        "document_id",
        "source_relpath",
        "source_path",
        "chunk_index",
        "text",
        "page_number_min",
        "page_number_max",
        "element_index_start",
        "element_index_end",
        "element_count",
        "char_count",
        "approx_token_count",
        "run_id",
        "chunked_at_utc",
        "pipeline_stage",
        "pipeline_version",
        "embedding_model_name",
        "semantic_chunker_params_json",
        "software_versions_json",
        "qa_flags",
    ]
    cols = [c for c in preferred_cols if c in df.columns] + [c for c in df.columns if c not in preferred_cols]
    df = df[cols]

    foundry_out_path = foundry_dir / FOUNDRY_PARQUET_NAME
    df.to_parquet(foundry_out_path, index=False)

    print("Done.")
    print(f"Export directory: {export_dir}")
    print(f"Gemini JSONL: {gemini_out_path}")
    print(f"Foundry Parquet: {foundry_out_path}")


if __name__ == "__main__":
    main()

20.7 Operational guidance (platform flexibility)

Importing into Gemini workflows

Because Gemini Enterprise ingestion pipelines differ by tenant configuration, we recommend keeping the JSONL format conservative:

id for stable chunk identity
text for chunk content
metadata for everything needed to cite and audit

If a Gemini workflow requires a different shape (e.g., specific keys like document, uri, or source), add a thin export adapter that reshapes fields without dropping provenance.

Importing into Foundry / Army Vantage

Foundry typically ingests Parquet cleanly and allows: - filtering by provenance fields, - joining back to other datasets (e.g., document catalog), - enforcing schema consistency in pipelines.

Treat document_sha256 as the primary join key back to the document manifest and any higher-level document registry.

20.8 Definition of done (for export work)

Exports are generated reproducibly from chunk source-of-truth.
Provenance and audit fields are preserved (no “orphan” content).
Verification checks pass (counts, schema, sample inspection).
Export run metadata is written and stored alongside outputs.