LoRA training
Do Not Train on Chaos: Why LoRA Datasets Need Copy-Only Data Prep
LoRA quality begins before training starts. If source images move, captions drift, and experiments overwrite each other, the model is learning from an operation the team can no longer explain.
LoRA training often gets described as a model problem: choose a rank, pick a learning rate, train for enough steps, and compare the outputs. Those choices matter, but they are not the first reliability problem. The first problem is whether the dataset is stable enough to trust.
In a product-image workflow, the training data came from multiple places: existing curated examples, raw render exports, generated candidates promoted after review, and synthetic product views produced by tooling. The temptation was obvious. Merge the new images into the current folder, rename whatever needs renaming, patch captions as you go, and keep moving.
That is how training data becomes unknowable.
A LoRA dataset is not a scratch folder. It is an experiment record.
Copy-Only Is Not Paranoia
The safest rule was simple: never move source images. Always copy them into the new dataset location. Never delete training data during preparation. Prefer multiple copies over accidental loss. Keep original datasets untouched during experiments.
That may sound inefficient to anyone used to normal file hygiene. In training workflows, it is the opposite. Storage is cheap compared with the cost of losing provenance. If a later LoRA looks better or worse, the team needs to know exactly which images and captions were used, where they came from, and what changed from the previous set.
Copy-only prep gives each dataset a stable identity. A version such as
hardware-design-aesthetic-v2 can reuse images from
v1, add raw turntable renders, and include newly written
captions without mutating the older experiment. That makes comparison
possible.
Captions Are Part Of The Dataset
A LoRA dataset is not just images. It is image-caption pairs. If the captions are inconsistent, too vague, or derived from filenames rather than visible features, the model learns noise from the text side even when the images are good.
The stronger captioning style was literal and constrained. Start with the unique trigger token. Avoid category words that would drag the model toward unwanted associations. Describe only visible features: view angle, body color, finish, silhouette, top component, transparent reservoir, screen, seams, buttons, and distinctive shapes. Keep the vocabulary consistent across the set.
This kind of captioning is not creative writing. It is schema design for visual learning. The point is to make the dataset legible to the model and auditable to the team.
Independent Datasets Beat Mystery Merges
When a new batch of source images arrives, the useful default is a new independent dataset. Merge only when the team explicitly decides to merge. Otherwise, every experiment should have a name, a source list, and its own complete image-caption set.
This Is Business Risk, Not Tidiness
Bad dataset operations do not just annoy engineers. They waste review cycles, GPU time, and stakeholder patience. When a LoRA produces poor images, the team needs to know whether the problem was the training settings, the base model, the trigger token, the runtime, the source images, or the captions.
If the dataset is messy, every diagnosis becomes guesswork. The team cannot reproduce the winning run. It cannot explain the failing one. It cannot safely hand the workflow to a non-technical user.
The practical rule is blunt: before you optimize LoRA training, make the dataset boring. Copy sources. Version sets. Pair captions. Validate counts. Keep old experiments untouched.
A clean dataset will not guarantee a good LoRA. But a chaotic dataset can make every result untrustworthy.