Model Analytics
Performance metrics and evaluation results for the patent classification model trained on 99,826 USPTO patents.
Training Set
79,861
80% of 99,826
Test Set
19,965
20% holdout
Feature Dimensions
100,000
TF-IDF vectors
Best Accuracy
76.4%
Section level
- Precision
- Recall
- F1
| Section | Name | Precision | Recall | F1 Score |
|---|---|---|---|---|
| A | Human Necessities | 74% | 76% | 75% |
| B | Operations & Transport | 71% | 69% | 70% |
| C | Chemistry & Metallurgy | 82% | 84% | 83% |
| D | Textiles & Paper | 88% | 86% | 87% |
| E | Fixed Constructions | 79% | 77% | 78% |
| F | Mechanical Engineering | 75% | 73% | 74% |
| G | Physics | 68% | 70% | 69% |
| H | Electricity | 72% | 74% | 73% |
Projected accuracy improvements with larger training datasets and advanced model architectures.
- Section
- Class
- Subclass
Current position: 100K patents with TF-IDF + LinearSVC. Projected accuracy with 500K+ patents and transformer models (DistilBERT/GPT) reaches 92%+ at section level. GPU infrastructure (AWS SageMaker) required for production-scale training.
Methodology
Data Source
99,826 issued patents from the USPTO PatentsView database (2024), balanced across all 8 CPC sections via stratified sampling.
Feature Engineering
TF-IDF vectorization with 100,000 features, word-level unigrams and bigrams, sublinear term frequency, and L2 normalization.
Model Architecture
LinearSVC with calibrated probability estimates, 80/20 train/test split, and hierarchical classification at Section, Class, and Subclass levels.