update to store and curate data correctly
This commit is contained in:
@@ -90,6 +90,49 @@ Or with `just`:
|
||||
just export-dataset
|
||||
```
|
||||
|
||||
Curate a high-quality training subset (single file):
|
||||
|
||||
```sh
|
||||
python -m server.DatasetCurator --input good_moves-2026-04-03.jsonl --output data/dataset/best_moves.jsonl
|
||||
```
|
||||
|
||||
Curate from multiple JSONL sources (repeat `--input`):
|
||||
|
||||
```sh
|
||||
python -m server.DatasetCurator \
|
||||
--input good_moves-2026-04-03.jsonl \
|
||||
--input good_moves-2026-04-04.jsonl \
|
||||
--output data/dataset/best_moves.jsonl
|
||||
```
|
||||
|
||||
Curate from folder or glob:
|
||||
|
||||
```sh
|
||||
python -m server.DatasetCurator --input data/dataset --output data/dataset/best_moves.jsonl
|
||||
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl
|
||||
```
|
||||
|
||||
Append mode (keeps existing curated rows and deduplicates against them):
|
||||
|
||||
```sh
|
||||
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append
|
||||
```
|
||||
|
||||
Archive processed input files after curation:
|
||||
|
||||
```sh
|
||||
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append --archive-input
|
||||
python -m server.DatasetCurator --input "good_moves-*.jsonl" --output data/dataset/best_moves.jsonl --append --archive-input --archive-dir data/dataset/archive
|
||||
```
|
||||
|
||||
Or with `just`:
|
||||
|
||||
```sh
|
||||
just curate-dataset
|
||||
just curate-dataset append=true
|
||||
just curate-dataset append=true archive=true archive_dir=data/dataset/archive
|
||||
```
|
||||
|
||||
To store compact dataset-only records (JSONL) and skip full per-game JSON files:
|
||||
|
||||
```sh
|
||||
|
||||
Reference in New Issue
Block a user