Model extraction attacks steal ML model functionality by systematically querying a model API and using the responses to train a substitute model that replicates the original's behavior.
A model extraction attack involves an adversary systematically querying a machine learning model's API to collect input-output pairs, then using those pairs to train a substitute model that replicates the original's decision boundaries. This effectively steals the intellectual property embedded in the model's learned parameters and training investment.
Attackers send carefully crafted queries to the target model API, collecting predictions with confidence scores. Using techniques like active learning, they optimize query selection to maximize information gained per query. The collected data trains a substitute model that closely approximates the original's behavior with minimal API calls.
Consequences include intellectual property theft of proprietary models worth millions in training investment, enabling white-box adversarial attacks against the extracted copy, circumventing usage-based API pricing, competitive advantage loss, exposure of training data characteristics, and potential evasion of ML-based security controls.
Models are more vulnerable when APIs return detailed prediction outputs (full probability distributions rather than just top-1 labels), allow unlimited queries without rate limiting, serve complex models that may be approximated with simpler architectures, and lack monitoring for anomalous query patterns indicating extraction attempts.
Defenses include rate limiting API queries, returning only top-k predictions without confidence scores, adding controlled noise to outputs (prediction perturbation), implementing watermarking to detect stolen models, monitoring for extraction-pattern query sequences, requiring authentication, and applying differential privacy to model responses.
Model watermarking embeds verifiable signatures into ML models by training them to produce specific outputs for specially crafted trigger inputs. If a model is extracted, the watermark transfers to the stolen copy, providing evidence of theft. Robust watermarks survive model fine-tuning and distillation attempts.
Model extraction enables more powerful adversarial attacks. Once an attacker has a local copy of the model, they can perform white-box attacks to craft adversarial examples that transfer to the original model. This converts a black-box attack scenario into a white-box one, significantly increasing attack effectiveness.
Detection monitors for anomalous API usage patterns such as high query volumes, systematically distributed input samples, queries near decision boundaries, unusual input distributions that do not match normal user behavior, and sequential queries that appear to be probing model behavior. Machine learning can classify query streams as benign or extraction attempts.