mdonigian commited on
Commit
78cbdf8
·
verified ·
1 Parent(s): 2f62234

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: microsoft/unixcoder-base
3
+ language:
4
+ - code
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - code
9
+ - multi-task
10
+ - structured-data
11
+ - code-quality
12
+ - content-type
13
+ - regression
14
+ - classification
15
+ - starcoderdata
16
+ datasets:
17
+ - bigcode/starcoderdata
18
+ ---
19
+
20
+ # StarCoderData Code Classifier
21
+
22
+ Multi-task code classification model for filtering large-scale code datasets. Built to aggressively curate training data for a 1B parameter structured-data-focused code model.
23
+
24
+ ## Model Details
25
+
26
+ - **Base model:** microsoft/unixcoder-base (125M params)
27
+ - **Architecture:** Shared encoder + three task-specific linear heads
28
+ - **Training data:** 191,776 code samples from bigcode/starcoderdata, labeled by GPT-5-nano (Batch API)
29
+ - **Languages:** Python, JavaScript, TypeScript, Java, Go, Rust, SQL, Shell (25K per language)
30
+ - **Training:** 3 epochs, batch size 16, lr 2e-5, AMP (bf16), torch.compile
31
+
32
+ ## Tasks
33
+
34
+ | Task | Type | Output | Loss |
35
+ |------|------|--------|------|
36
+ | Code Quality (1-5) | Regression | Float | MSE |
37
+ | Structured Data Relevance (0-3) | Regression | Float | MSE |
38
+ | Content Type (9 classes) | Classification | Softmax | CrossEntropy |
39
+
40
+ ## Test Set Performance
41
+
42
+ ### Code Quality (1-5 scale)
43
+
44
+ | Metric | Score |
45
+ |--------|-------|
46
+ | MAE | 0.598 |
47
+ | Rounded Accuracy | 55.3% |
48
+ | Spearman r | 0.575 |
49
+
50
+ | Level | Precision | Recall | F1 |
51
+ |-------|-----------|--------|----|
52
+ | 1 - Broken/gibberish | 0.72 | 0.35 | 0.47 |
53
+ | 2 - Functional but poor | 0.28 | 0.23 | 0.25 |
54
+ | 3 - Decent | 0.49 | 0.67 | 0.56 |
55
+ | 4 - Good | 0.68 | 0.63 | 0.65 |
56
+ | 5 - Excellent | 0.08 | 0.00 | 0.01 |
57
+
58
+ ### Structured Data Relevance (0-3 scale)
59
+
60
+ | Metric | Score |
61
+ |--------|-------|
62
+ | MAE | 0.421 |
63
+ | Rounded Accuracy | 66.7% |
64
+ | Spearman r | 0.807 |
65
+
66
+ | Level | Precision | Recall | F1 |
67
+ |-------|-----------|--------|----|
68
+ | 0 - None | 0.73 | 0.70 | 0.71 |
69
+ | 1 - Minor | 0.43 | 0.48 | 0.45 |
70
+ | 2 - Significant | 0.74 | 0.75 | 0.75 |
71
+ | 3 - Primary focus | 0.67 | 0.59 | 0.63 |
72
+
73
+ ### Content Type (9 classes)
74
+
75
+ | Metric | Score |
76
+ |--------|-------|
77
+ | Accuracy | 87.5% |
78
+ | Macro F1 | 0.678 |
79
+
80
+ | Type | Precision | Recall | F1 | Support |
81
+ |------|-----------|--------|----|---------|
82
+ | library | 0.89 | 0.92 | 0.91 | 7,990 |
83
+ | application | 0.83 | 0.75 | 0.79 | 3,404 |
84
+ | test | 0.92 | 0.93 | 0.93 | 1,818 |
85
+ | config | 0.77 | 0.68 | 0.72 | 309 |
86
+ | tutorial | 0.56 | 0.37 | 0.45 | 227 |
87
+ | data | 0.45 | 0.59 | 0.51 | 129 |
88
+ | generated | 0.66 | 0.49 | 0.56 | 316 |
89
+ | script | 0.90 | 0.93 | 0.91 | 4,970 |
90
+ | other | 0.75 | 0.20 | 0.32 | 15 |
91
+
92
+ ## Why These Scores Are Acceptable
93
+
94
+ This model is designed as a **coarse filter**, not a precise labeler. The intended workflow is:
95
+
96
+ 1. Run this model on the full StarCoderData (~250B tokens)
97
+ 2. Apply threshold filters (e.g., quality >= 3 AND structured_data >= 2)
98
+ 3. Train a 1B parameter model on the filtered subset
99
+
100
+ For this filtering use case, what matters is **rank ordering**, not exact classification:
101
+
102
+ - **Structured data (Spearman 0.81):** The model's strongest dimension. It reliably separates code with heavy structured data usage (APIs, schemas, serialization) from code without it. At the filtering threshold of structured_data >= 2, the model achieves 0.75 F1 — meaning the filtered subset will be genuinely rich in structured data patterns.
103
+
104
+ - **Quality (Spearman 0.58):** The weakest dimension, but still useful for filtering. The model struggles most with the quality 2-3 boundary (decent vs. poor) and virtually ignores quality 5 (only 1.2% of training data). However, for the intended filter of quality >= 3, the model has decent precision at levels 3-4 (0.49-0.68). The key insight: false positives at the boundary (quality-2 code scored as 3) are tolerable because the structured data filter provides a second gate. Code that passes both filters is unlikely to be low quality.
105
+
106
+ - **Content type (87.5% accuracy):** Strong performance on the high-volume categories that matter most for filtering: library (0.91 F1), script (0.91 F1), test (0.93 F1), and application (0.79 F1). The weaker categories (tutorial, data, generated, other) have low support — together they represent only 3.5% of the data. Even with lower recall on these rare types, the model will still flag enough of them for filtering decisions.
107
+
108
+ - **Errors are symmetric, not catastrophic.** A quality MAE of 0.60 means predictions are typically off by less than one level. A file scored as quality 4 is almost certainly quality 3-5, not quality 1. This is precisely the behavior needed for threshold-based filtering — the model rarely makes predictions that are off by more than one level.
109
+
110
+ ## How to Improve
111
+
112
+ The primary bottleneck is **training data volume and class balance**, not model capacity:
113
+
114
+ 1. **Scale up the GPT-5-nano labeling set.** The current model was trained on 192K labeled samples (~$20 via Batch API). Doubling to 400K samples (~$40) would particularly help quality levels 2 and 5, where the model struggles most. Level 5 (excellent code) had only 2,345 training examples — far too few for the model to learn the pattern.
115
+
116
+ 2. **Oversample rare classes.** Content types like tutorial (2,197 samples), data (1,413), and generated (2,975) are underrepresented. A targeted labeling run that specifically seeks out these types — e.g., filtering by filename patterns like `*_test.*`, `*.generated.*`, `tutorial*` — would improve recall on rare types without relabeling the entire dataset.
117
+
118
+ 3. **Increase max token length.** The current model uses 512 tokens, but code files often need more context to assess quality. Increasing to 1024 or 2048 tokens (UniXcoder supports up to 1024) would give the model more signal, particularly for quality assessment where style and documentation patterns emerge over longer spans.
119
+
120
+ 4. **Add a second training round with hard examples.** After running inference on the full StarCoderData, sample files where the model is least confident (prediction near the decision boundary, e.g., quality between 2.5 and 3.5) and send those to GPT-5-nano for labeling. Training on these hard cases would sharpen the model's performance exactly where filtering decisions are made.
121
+
122
+ ## Usage
123
+
124
+ ```python
125
+ from train_starcoderdata import CodeClassifierModel, load_model
126
+ from transformers import AutoTokenizer
127
+ import torch
128
+
129
+ model_dir = "models/starcoderdata-classifier"
130
+ tokenizer = AutoTokenizer.from_pretrained(model_dir)
131
+ model = load_model(model_dir)
132
+ model.eval()
133
+
134
+ code = "def hello(): print('world')"
135
+ inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
136
+ with torch.no_grad():
137
+ quality, structured_data, content_type_logits = model(
138
+ inputs["input_ids"], inputs["attention_mask"]
139
+ )
140
+
141
+ print(f"Quality: {quality.item():.1f}")
142
+ print(f"Structured Data: {structured_data.item():.1f}")
143
+ print(f"Content Type: {['library','application','test','config','tutorial','data','generated','script','other'][content_type_logits.argmax()]}")
144
+ ```
145
+
146
+ ## Files
147
+
148
+ - `config.json`, `model.safetensors` — UniXcoder encoder weights (HuggingFace format)
149
+ - `classifier_heads.pt` — Quality, structured data, and content type head weights
150
+ - `tokenizer.json`, `tokenizer_config.json` — Tokenizer
151
+ - `label_config.json` — Label definitions and task types
152
+ - `test_metrics.json` — Full test set metrics
153
+ - `training_history.csv` — Per-epoch training/validation metrics
154
+ - `checkpoint.pt` — Full training checkpoint (for resume)
checkpoint.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2aa8bda24f4129cdebea06b3a254d582859bb335b9a5f15011f3a7e23a17106
3
+ size 1506793407
classifier_heads.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:427dbb95238f4428e390a2887f716229fd4090d59695f5848a327669315b8f11
3
+ size 36993
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "gradient_checkpointing": false,
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0.1,
14
+ "hidden_size": 768,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "is_decoder": false,
18
+ "layer_norm_eps": 1e-05,
19
+ "max_position_embeddings": 1026,
20
+ "model_type": "roberta",
21
+ "num_attention_heads": 12,
22
+ "num_hidden_layers": 12,
23
+ "output_past": true,
24
+ "pad_token_id": 1,
25
+ "position_embedding_type": "absolute",
26
+ "tie_word_embeddings": true,
27
+ "transformers_version": "5.1.0",
28
+ "type_vocab_size": 10,
29
+ "use_cache": true,
30
+ "vocab_size": 51416
31
+ }
label_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "content_types": [
3
+ "library",
4
+ "application",
5
+ "test",
6
+ "config",
7
+ "tutorial",
8
+ "data",
9
+ "generated",
10
+ "script",
11
+ "other"
12
+ ],
13
+ "quality_range": [
14
+ 1,
15
+ 5
16
+ ],
17
+ "structured_data_range": [
18
+ 0,
19
+ 3
20
+ ],
21
+ "num_content_types": 9,
22
+ "tasks": {
23
+ "quality": "regression",
24
+ "structured_data": "regression",
25
+ "content_type": "classification"
26
+ }
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83a946ad4457ba05d0fb29ea4ab7524872021277a02cd131ce9f01a79b2231cd
3
+ size 503741264
test_metrics.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "loss": 1.3610109102735926,
3
+ "quality_mae": 0.598224937915802,
4
+ "quality_mse": 0.6799516677856445,
5
+ "quality_rounded_acc": 0.552508082177495,
6
+ "quality_spearman": 0.5751793155320672,
7
+ "structured_data_mae": 0.4209136366844177,
8
+ "structured_data_mse": 0.3287431299686432,
9
+ "structured_data_rounded_acc": 0.6668578579622484,
10
+ "structured_data_spearman": 0.8065141417877283,
11
+ "content_type_accuracy": 0.8753258942538326,
12
+ "content_type_macro_f1": 0.6776609396601136,
13
+ "combined_mae": 0.5095692873001099
14
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<s>",
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "errors": "replace",
8
+ "is_local": false,
9
+ "mask_token": "<mask>",
10
+ "model_max_length": 1000000000000000019884624838656,
11
+ "pad_token": "<pad>",
12
+ "sep_token": "</s>",
13
+ "tokenizer_class": "RobertaTokenizer",
14
+ "trim_offsets": true,
15
+ "unk_token": "<unk>"
16
+ }
training_history.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ epoch,train_loss,val_loss,val_quality_mae,val_quality_mse,val_quality_rounded_acc,val_quality_spearman,val_structured_data_mae,val_structured_data_mse,val_structured_data_rounded_acc,val_structured_data_spearman,val_content_type_accuracy,val_content_type_macro_f1,val_combined_mae
2
+ 1,1.9338482332460847,1.448065088578718,0.6309319138526917,0.7191991209983826,0.522108666179998,0.553817859073393,0.45338281989097595,0.35454854369163513,0.6431848993638544,0.79499171167133,0.8623944102617582,0.6234108515025513,0.5421573668718338
3
+ 2,1.3254653735931108,1.3402525051073595,0.6033948659896851,0.6770005226135254,0.5472937741161747,0.5729696168341741,0.42503005266189575,0.32683703303337097,0.6662842840755032,0.8050313624661184,0.8754823234956721,0.6905006297568385,0.5142124593257904
4
+ 3,1.1299075901272126,1.3563477016171384,0.6024025678634644,0.6801310777664185,0.5458337678590051,0.5730787938638631,0.4216791093349457,0.33049410581588745,0.6724893106684743,0.8050240322085739,0.8755866096568985,0.6923274552582193,0.512040838599205