Model Analytics

Performance metrics and evaluation results for the patent classification model trained on 99,826 USPTO patents.

Training Set

79,861

80% of 99,826

Test Set

19,965

20% holdout

Feature Dimensions

100,000

TF-IDF vectors

Best Accuracy

76.4%

Section level

Multi-Level Classification Accuracy

Per-Section Performance

Precision
Recall
F1

Section-Level Performance Detail

Section	Name	Precision	Recall	F1 Score
A	Human Necessities	74%	76%	75%
B	Operations & Transport	71%	69%	70%
C	Chemistry & Metallurgy	82%	84%	83%
D	Textiles & Paper	88%	86%	87%
E	Fixed Constructions	79%	77%	78%
F	Mechanical Engineering	75%	73%	74%
G	Physics	68%	70%	69%
H	Electricity	72%	74%	73%

Accuracy Scaling Projection

Projected accuracy improvements with larger training datasets and advanced model architectures.

Section
Class
Subclass

Current position: 100K patents with TF-IDF + LinearSVC. Projected accuracy with 500K+ patents and transformer models (DistilBERT/GPT) reaches 92%+ at section level. GPU infrastructure (AWS SageMaker) required for production-scale training.

Methodology

Data Source

99,826 issued patents from the USPTO PatentsView database (2024), balanced across all 8 CPC sections via stratified sampling.

Feature Engineering

TF-IDF vectorization with 100,000 features, word-level unigrams and bigrams, sublinear term frequency, and L2 normalization.

Model Architecture

LinearSVC with calibrated probability estimates, 80/20 train/test split, and hierarchical classification at Section, Class, and Subclass levels.