Model Analytics

Performance metrics and evaluation results for the patent classification model trained on 99,826 USPTO patents.

Training Set

79,861

80% of 99,826

Test Set

19,965

20% holdout

Feature Dimensions

100,000

TF-IDF vectors

Best Accuracy

76.4%

Section level

Multi-Level Classification Accuracy
SectionClassSubclass0%25%50%75%100%
Per-Section Performance
ABCDEFGH0255075100
  • Precision
  • Recall
  • F1
Section-Level Performance Detail
SectionNamePrecisionRecallF1 Score
AHuman Necessities
74%
76%
75%
BOperations & Transport
71%
69%
70%
CChemistry & Metallurgy
82%
84%
83%
DTextiles & Paper
88%
86%
87%
EFixed Constructions
79%
77%
78%
FMechanical Engineering
75%
73%
74%
GPhysics
68%
70%
69%
HElectricity
72%
74%
73%
Accuracy Scaling Projection

Projected accuracy improvements with larger training datasets and advanced model architectures.

1.6K10K100K500K1M+0%25%50%75%100%
  • Section
  • Class
  • Subclass

Current position: 100K patents with TF-IDF + LinearSVC. Projected accuracy with 500K+ patents and transformer models (DistilBERT/GPT) reaches 92%+ at section level. GPU infrastructure (AWS SageMaker) required for production-scale training.

Methodology

Data Source

99,826 issued patents from the USPTO PatentsView database (2024), balanced across all 8 CPC sections via stratified sampling.

Feature Engineering

TF-IDF vectorization with 100,000 features, word-level unigrams and bigrams, sublinear term frequency, and L2 normalization.

Model Architecture

LinearSVC with calibrated probability estimates, 80/20 train/test split, and hierarchical classification at Section, Class, and Subclass levels.