update to store and curate data correctly

This commit is contained in:
2026-04-03 15:43:19 +02:00
parent 3d7f92e20f
commit 37de34cc5e
7 changed files with 199 additions and 7 deletions
+43
View File
@@ -90,6 +90,49 @@ Or with `just`:
just export-dataset
```
Curate a high-quality training subset (single file):
```sh
python -m server.DatasetCurator --input good_moves-2026-04-03.jsonl --output data/dataset/best_moves.jsonl
```
Curate from multiple JSONL sources (repeat `--input`):
```sh
python -m server.DatasetCurator \
--input good_moves-2026-04-03.jsonl \
--input good_moves-2026-04-04.jsonl \
--output data/dataset/best_moves.jsonl
```
Curate from folder or glob:
```sh
python -m server.DatasetCurator --input data/dataset --output data/dataset/best_moves.jsonl
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl
```
Append mode (keeps existing curated rows and deduplicates against them):
```sh
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append
```
Archive processed input files after curation:
```sh
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append --archive-input
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append --archive-input --archive-dir data/dataset/archive
```
Or with `just`:
```sh
just curate-dataset
just curate-dataset append=true
just curate-dataset append=true archive=true archive_dir=data/dataset/archive
```
To store compact dataset-only records (JSONL) and skip full per-game JSON files:
```sh