Fact-checked by Grok 2 weeks ago

scikit-learn

Scikit-learn is an open-source machine learning library for the Python programming language, providing simple and efficient tools for data mining and analysis that are accessible to non-experts and reusable across various applications. It features a wide array of supervised and unsupervised learning algorithms, including support vector machines, random forests, gradient boosting, k-means clustering, principal component analysis (PCA), and dimensionality reduction techniques, all built on top of the NumPy and SciPy scientific computing libraries to ensure high performance and consistency. Designed with an emphasis on ease of use, clean and uniform application programming interfaces (APIs), and extensive documentation, scikit-learn supports tasks such as classification, regression, clustering, model selection, and preprocessing, making it a cornerstone for both academic research and industrial applications in data science. Originally initiated in 2007 as a project by David Cournapeau, scikit-learn's first public release occurred on February 1, 2010, led by developers including Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, and Vincent Michel from the French National Institute for Research in Digital Science and Technology (INRIA). Since then, it has evolved into a community-driven with contributions from a global team of over 50 core developers and thousands of participants through regular coding sprints, funded in part by organizations such as and . The library is distributed under the permissive 3-Clause BSD license, allowing broad commercial and academic use while requiring preservation of copyright notices. As of late 2025, the stable version is 1.7.2, with ongoing development toward version 1.8, reflecting its commitment to regular three-month release cycles and compatibility with 3.10 and later. Scikit-learn's modular design enables seamless integration with other Python ecosystems, such as for data manipulation and or seaborn for visualization, facilitating end-to-end workflows from data preparation to model deployment. Key strengths include its focus on medium-scale problems, robust cross-validation tools for model evaluation, and utilities for handling imbalanced datasets and , which have made it one of the most popular libraries worldwide, with millions of downloads and citations in thousands of research papers.

Introduction

Overview

Scikit-learn is an open-source machine learning library for the Python programming language that implements a variety of classical algorithms for data mining and analysis. It is built on top of the NumPy, SciPy, and matplotlib libraries, enabling efficient numerical and scientific computing while providing tools for visualization. Designed primarily for predictive modeling, scikit-learn supports core machine learning tasks such as classification, regression, clustering, and dimensionality reduction, making it suitable for both educational and practical applications in data science. The library's primary goals emphasize simplicity, efficiency, and modularity, allowing users to construct sophisticated models using a consistent and intuitive with minimal code. This approach facilitates and experimentation, while its modular structure promotes reusability across different projects. Scikit-learn excels in handling classical workflows, from to model evaluation and deployment, without the overhead of frameworks. As of November 2025, the current stable version is 1.7.2, released on September 9, 2025, with active development underway toward version 1.8, which includes enhancements for CPU and memory efficiency in various estimators. Within the ecosystem, scikit-learn serves as a foundational tool for building pipelines, bridging with production-ready systems and integrating seamlessly with libraries like and Jupyter. Scikit-learn enjoys widespread adoption in the community, evidenced by over 58,000 stars and its inclusion among the top tools in 2025 lists. Its accessibility has made it a staple for practitioners and researchers tackling real-world problems in fields ranging from to healthcare.

Design Philosophy

Scikit-learn's design philosophy centers on providing a consistent and intuitive interface for tasks, encapsulated in the core methods of estimators: fit for training on data, predict for generating outputs from new inputs, and transform for modifying data representations. This uniform applies across all modules, enabling users to interchange algorithms seamlessly without altering structures, which promotes predictability and reduces the learning curve for diverse applications from to . The philosophy draws from experiences in developing scalable software, prioritizing simplicity in method signatures to handle both basic and composite objects uniformly. A key emphasis lies in readability and minimal dependencies to ensure broad accessibility for beginners and experts alike. By leveraging Python's expressiveness alongside and for numerical operations, scikit-learn maintains clean, vectorized code that avoids unnecessary complexity, while optional dependencies like support visualization without mandating them for core functionality. This approach fosters an environment where users can prototype models rapidly, with the library's implementation in for performance-critical parts ensuring efficiency without sacrificing Pythonic readability. The modular design further embodies this philosophy by allowing pipeline construction through composable components, such as Pipeline and FeatureUnion, which chain preprocessing, estimation, and prediction steps without requiring advanced programming expertise. Users can build end-to-end workflows declaratively, integrating transformers and estimators to handle data flows intuitively. Commitment to reproducibility underpins the library's reliability, achieved through utilities like the random_state parameter, which seeds random number generators to yield consistent results across runs, and built-in cross-validation tools that systematically partition data for robust model evaluation. These features mitigate variability in stochastic algorithms, supporting scientific rigor in experimentation. Over time, scikit-learn's philosophy has evolved from facilitating quick research prototyping—via its extensible, duck-typed —to enabling production deployment, exemplified by integration with Joblib for parallelism through the n_jobs parameter, which distributes computations across cores for scalable processing of large datasets. This progression maintains the library's foundational simplicity while accommodating real-world demands for efficiency and deployment.

History

Origins and Early Development

Scikit-learn originated in 2007 as a project led by David Cournapeau, aimed at extending the library with tools to support the burgeoning field of in . This initiative addressed the need for accessible statistical and ML capabilities within the scientific ecosystem, where researchers increasingly required efficient tools beyond basic numerical computing. Cournapeau's work laid the groundwork by prototyping algorithms that could leverage 's infrastructure, marking the project's humble beginnings as part of the broader "scikits" effort to modularize extensions for , along with contributions from Matthieu Brucher as part of his thesis. By 2010, the project gained momentum under the leadership of Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, and Bertrand Thirion, primarily affiliated with Inria (the National Institute for Research in Digital Science and Technology). These contributors formalized scikit-learn as a dedicated package, releasing its first public version on February 1, 2010, which introduced a unified interface for various algorithms. The motivation stemmed from the rapid growth in applications across domains like , physics, and web technologies, where a consistent, user-friendly was essential to enable non-experts to apply state-of-the-art methods without delving into low-level implementations. This version 0.8 emphasized simplicity and interoperability, building directly on for array handling and for optimized computations in linear algebra and sparse data structures. Early development faced challenges in harmonizing diverse existing libraries, such as wrapping the C++-based LIBSVM for support machines while minimizing overhead—achieving about 40% less compared to prior interfaces. with and required careful design to ensure seamless data flow, as Python's array operations sometimes limited efficiency in iterative algorithms like . The first stable release, version 0.9, arrived in September 2011, incorporating new modules for manifold learning and probabilistic models like the , solidifying scikit-learn's role as a robust toolkit. Through these efforts, the project evolved from a summer coding experiment into a foundational resource, fostering community contributions while maintaining high standards for documentation and test coverage exceeding 80%.

Major Releases and Evolution

Scikit-learn's development from onward has been marked by a series of major releases that introduced key algorithmic advancements, improved usability, and enhanced compatibility with modern ecosystems, reflecting the library's maturation into a robust tool for practitioners. Version 0.18, released in September 2016, represented a significant step forward with the introduction of (MLP) classifiers and regressors for basic support, alongside major enhancements to through the new sklearn.model_selection module, which replaced older cross-validation and grid search utilities for more consistent and flexible hyperparameter tuning. These changes improved ensemble method handling, such as better sample weighting in tree-based estimators, laying groundwork for scalable predictive modeling. Subsequent releases built on this foundation, with version 0.20 in September 2018 enhancing preprocessing capabilities, including native support for missing values in scalers and the ColumnTransformer for applying different transformations to subsets of features, streamlining workflows for heterogeneous datasets. Version 0.21 in May 2019 introduced histogram-based gradient boosting classifiers and regressors (HistGradientBoostingClassifier and HistGradientBoostingRegressor), which offered faster training on large datasets by binning features into histograms, outperforming traditional gradient boosting for numerical data with over 10,000 samples. By 2021, version 0.24 further refined preprocessing with improved error handling in check_array for sparse DataFrames and fixes to ColumnTransformer feature naming, alongside new hyperparameter tuning classes like HalvingGridSearchCV for more efficient searches on large parameter spaces. The milestone version 1.0, released in September 2021, emphasized long-term stability by enforcing keyword-only arguments in public methods (per SLEP009), reducing API breakage risks and signaling production readiness after years of consistent development. It introduced features like spline transformers for non-linear feature engineering and broader support for feature names in estimators, including integration with pandas outputs, while deprecating outdated datasets like load_boston. In 2023, version 1.3 advanced data handling with check_array now natively supporting pandas DataFrames containing extension arrays and numeric object columns, enabling seamless use of pandas for input validation without conversion overhead. This release also added metadata routing for advanced estimator customization and a skip_parameter_validation option to boost performance in trusted environments. Version 1.5, released in May 2024, focused on scalability improvements, particularly in , where a new "covariance_eigh" solver provided up to 10x faster computation and reduced memory usage for datasets with many more samples than features, including sparse inputs. These optimizations extended to QuantileTransformer for denser array subsampling, making viable for larger-scale applications. The 1.7 series, culminating in version 1.7.2 on September 9, 2025, prioritized reliability with bug fixes such as resolving convergence issues in under multi-class settings and validating transformer outputs in FeatureUnion, while adding support for 3.13 and experimental free-threaded builds for better concurrency. Over this period, scikit-learn evolved toward greater production readiness by incorporating interpretability tools like partial dependence plots (introduced in version 0.21) for visualizing feature impacts and methods via CalibratedClassifierCV to align predicted probabilities with true outcomes, essential for deployment in risk-sensitive domains. Community feedback, primarily through issues and pull requests, drove iterative improvements, including deprecations in version 1.2 such as warnings for outdated parameter passing, ensuring while pruning legacy elements. For deeper learning integrations, community wrappers like skorch enable scikit-learn-style s with models, and similar extensions bridge to , allowing hybrid workflows without abandoning the library's estimator paradigm. Each release's changelog underscores these impacts, with version 1.7 highlighting array compliance for future-proofing against evolving standards.

Technical Implementation

Core Architecture

Scikit-learn's core architecture is built around an object-oriented design that emphasizes a uniform interface for workflows, enabling seamless integration of diverse components. At its foundation lies the paradigm, which standardizes the behavior of all objects—referred to as estimators—through a consistent . This ensures that estimators, whether for fitting models, transforming data, or predicting outcomes, adhere to predictable methods like fit, predict, and transform, facilitating and across the . Central to this paradigm are base classes such as BaseEstimator and TransformerMixin. The BaseEstimator class provides essential functionality for parameter management, including get_params and set_params methods, which allow estimators to be inspected and configured dynamically—crucial for tools like grid search and pipelines. By inheriting from BaseEstimator, custom or built-in estimators gain compatibility with scikit-learn's meta-estimators, ensuring they can be nested or optimized without breaking the . Complementing this, TransformerMixin adds the fit_transform method to classes, optimizing the common pattern of fitting and transforming data in a single call, as seen in preprocessors like StandardScaler. Together, these classes enforce a uniform where all estimators implement fit(X, y=None) as the primary training method, with optional extensions for prediction (predict) or transformation (transform), promoting code reusability and reducing errors in complex workflows. Composition is a key architectural feature, achieved through utilities like Pipeline and ColumnTransformer. This enables chaining of multiple steps—such as preprocessors, feature selectors, and models—into a single, cohesive estimator, preventing data leakage by ensuring transformations are applied consistently during training and prediction. For instance, a Pipeline object takes a sequence of named steps, executing them sequentially: all but the final step must be transformers, while the last can be any estimator, exposing its methods (e.g., predict) to the overall pipeline. The ColumnTransformer extends this by applying different transformations to subsets of features, supporting heterogeneous data types like dense arrays, sparse matrices, and DataFrames, thus streamlining preprocessing in real-world applications. This modular composition not only simplifies workflows but also allows joint optimization of hyperparameters across steps via tools like GridSearchCV. Parallelization is integrated natively via the joblib library, providing efficient multi-core support without requiring users to manage threads or processes manually. Estimators expose an n_jobs to parallelism, spawning workers for computationally intensive operations such as cross-validation folds in cross_val_score or hyperparameter in GridSearchCV. Joblib's backend (defaulting to loky for process-based parallelism) handles memory sharing through memory-mapped arrays for large datasets, mitigating overhead and enabling scalable performance on multi-core systems. This architecture avoids oversubscription by coordinating with lower-level threading in dependencies like , ensuring robust resource management across the library. Error handling and input validation are enforced through utility functions in the sklearn.utils.validation module, promoting consistency and robustness in estimators. Functions like check_array validate and coerce input data to the expected format—ensuring arrays with finite values, supporting sparse matrices, and enforcing minimum dimensions—raising informative errors for inconsistencies such as non-numeric data or mismatched shapes. Similarly, check_X_y extends this to inputs, verifying that feature X and target y align in length and type, while validate_data (introduced in 1.6) integrates validation directly into the 's fit method, automatically setting attributes like n_features_in_. These check_* functions are invoked within implementations to catch issues early, maintaining the integrity of the API and preventing downstream failures in pipelines or operations. Extension mechanisms allow for customization and third-party integrations by leveraging inheritance and standardized tags. Developers create custom estimators by subclassing BaseEstimator (and relevant mixins like ClassifierMixin), implementing required methods while using check_estimator to verify compliance with scikit-learn conventions, including input validation and consistency. Estimator tags, defined via the __sklearn_tags__ attribute since version 1.6, declare capabilities such as support for partial fitting or sparse inputs, enabling meta-estimators to route data appropriately. For third-party extensions, the library's public supports compatible estimators from packages like Intel's oneAPI Data Analytics Library (oneDAL), which accelerate algorithms while preserving the core interface, as well as integrations through compatible formats like ONNX for model export. This design fosters an where extensions enhance scikit-learn without altering its foundational .

Dependencies and Performance

Scikit-learn depends on for efficient array handling and linear algebra operations, for scientific algorithms and sparse data structures, joblib for in estimators, and threadpoolctl for controlling thread usage in underlying BLAS libraries like or MKL to prevent oversubscription and optimize linear algebra performance. These dependencies ensure robust numerical computations and scalability across various hardware configurations. Optional dependencies enhance specific functionalities, such as for plotting results and visualizations in examples and benchmarks. Performance optimizations in scikit-learn include the use of to compile speed-critical components, such as nearest neighbors search in the KNeighborsClassifier, into optimized code for faster execution compared to pure . Additionally, integration with SciPy's formats, including CSR and representations, supports efficient handling of high-dimensional sparse data—common in text processing—by storing only non-zero elements, which can reduce and computation time for predictions on datasets with over 90% sparsity. Scalability features enable processing datasets larger than available through out-of-core learning via partial_fit methods in estimators like SGDClassifier, allowing incremental on data streams without loading everything into , as demonstrated in text pipelines where dominates runtime. For distributed environments, scikit-learn integrates with Dask through joblib's backend, enabling parallel execution across clusters for tasks like hyperparameter tuning with RandomizedSearchCV, scaling CPU-bound workloads while maintaining the familiar . As of version 1.7.2, released in September 2025, scikit-learn requires 3.10 or newer, with support up to 3.14; earlier versions like 1.3 supported 3.8, but this was dropped starting with 1.4 to align with evolving ecosystem standards and security updates. Benchmarking shows that the n_jobs parameter, which leverages joblib for multi-core parallelism, can reduce training times in cross-validation or grid search by a factor close to the number of CPU cores available—for instance, on a multi-core , fitting a RandomForestClassifier with n_jobs=-1 utilizes all cores to parallelize tree construction, yielding near-linear speedups for independent subtasks. Bulk predictions further benefit, achieving 1-2 orders of magnitude faster latency than sequential calls.

Core Features

Supervised Learning

Scikit-learn provides a comprehensive suite of algorithms for , encompassing both and tasks. These methods enable predictive modeling from , where the goal is to learn a mapping from input features to output labels or continuous values. The library implements efficient, scalable estimators that integrate seamlessly with its preprocessing , allowing users to apply techniques like prior to model fitting.

Classification Algorithms

Logistic Regression in scikit-learn models the probability of a binary or multiclass outcome using the logistic function, where the probability p is given by p = \frac{1}{1 + \exp(-z)} and z = \mathbf{w} \cdot \mathbf{x} + b, with \mathbf{w} as the weight vector and b as the bias term. It supports regularization through L1 (Lasso) and L2 (Ridge) penalties, controlled by the penalty parameter and the inverse regularization strength C, which helps prevent overfitting in high-dimensional settings. Support Vector Machines (SVMs), implemented via classes like SVC and LinearSVC, construct hyperplanes to separate classes with maximum margin, employing kernel tricks such as the Radial Basis Function (RBF) kernel defined as k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) to handle nonlinear data. The C parameter balances margin maximization against classification errors, while gamma tunes the kernel's curvature for RBF. These classifiers support both dense and sparse inputs and scale to multiclass problems using one-versus-one or one-versus-rest strategies. Decision Trees, via DecisionTreeClassifier, build hierarchical structures based on feature splits that minimize impurity measures such as Gini index or (information gain). For , Gini criterion computes node impurity as \sum_{k=1}^K p_k (1 - p_k), where p_k is the proportion of class k in the node, promoting splits that create purer child nodes. Random Forests extend this by averaging multiple decision trees trained on bootstrapped subsets and random selections, reducing variance and improving ; the n_estimators parameter controls tree count, and max_features limits split candidates to "sqrt" for by default. Gradient Boosting, particularly through HistGradientBoostingClassifier, sequentially fits decision trees to residuals of previous models, optimizing arbitrary differentiable loss functions like log-loss for . It uses histogram binning (default 255 bins per feature) for faster training on large datasets, natively handling missing values and supporting after max_iter iterations. This implementation draws from gradient-boosted decision trees, offering performance comparable to specialized libraries while integrating with scikit-learn's ecosystem.

Regression Algorithms

Linear Regression, implemented as LinearRegression, fits a by minimizing the ordinary objective \min_{\mathbf{w}} \|\mathbf{y} - X\mathbf{w}\|^2, yielding coefficients interpretable as feature impacts. It assumes linearity and handles but can overfit without regularization. Ridge Regression addresses overfitting via L2 regularization, minimizing \|\mathbf{y} - X\mathbf{w}\|^2 + \alpha \|\mathbf{w}\|^2, where alpha shrinks coefficients toward zero without sparsity. It performs well with correlated features and supports solvers like 'cholesky' for exact solutions. Regression, conversely, uses L1 regularization \|\mathbf{y} - X\mathbf{w}\|^2 + \alpha \|\mathbf{w}\|_1, inducing sparsity for automatic by setting some coefficients to exactly zero. The alpha hyperparameter in both controls the regularization strength. Support Vector Regression (SVR) estimates continuous values within an -insensitive tube, using kernels like RBF for nonlinearity; the epsilon parameter defines the margin of tolerance, and C penalizes deviations beyond it. Random Forest Regressor aggregates tree predictions via averaging, employing MSE as the default split criterion to minimize variance, with parameters like n_estimators and max_features="auto" for tuning.

Evaluation Metrics

For classification, scikit-learn offers metrics such as accuracy, which measures the proportion of correct predictions; , the of true positives to predicted positives; , the of true positives to actual positives; and F1-score, their , particularly useful for imbalanced datasets. These are computed via functions like accuracy_score, precision_score, recall_score, and f1_score in the sklearn.metrics module. In regression tasks, Mean Squared Error (MSE) quantifies average squared residuals; Mean Absolute Error (MAE) uses absolute differences for robustness to outliers; and R² score indicates the proportion of variance explained, ranging from negative infinity to 1. Functions like mean_squared_error, mean_absolute_error, and r2_score facilitate these evaluations, often used in cross-validation to assess generalization.

Hyperparameter Tuning

Scikit-learn's GridSearchCV performs exhaustive grid search over specified hyperparameter values, using cross-validation to select the best combination for supervised estimators, as in tuning C and gamma for SVMs. It integrates with any scorer, such as 'accuracy' for , and outputs the optimal model via best_estimator_. RandomizedSearchCV samples randomly from parameter distributions (e.g., via ), allowing efficient exploration for large spaces with n_iter trials, ideal for models like Random Forests where n_estimators and max_depth vary. Both tools support parallel execution and are essential for optimizing supervised models without manual trial-and-error.

Unsupervised Learning and Preprocessing

Scikit-learn provides a suite of tools for , which focuses on discovering inherent structures in data without labeled targets, and preprocessing, which prepares raw data for effective modeling by handling transformations, , and missing values. These modules enable tasks such as identifying clusters, reducing dimensionality for or efficiency, detecting anomalies, and standardizing features to meet algorithmic assumptions. The library's implementations are optimized for integration within pipelines, emphasizing scalability and consistency with and arrays. Clustering algorithms in scikit-learn partition data into groups based on similarity, with key implementations including K-Means, , and . K-Means assigns samples to a predefined number of clusters by iteratively updating s to minimize the within-cluster sum of squared distances, formulated as minimizing \sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2, where C_k is the set of points in cluster k and \mu_k is its . To select the optimal number of clusters k, the elbow method evaluates inertia (sum of squared distances to s) across varying k values, identifying the "elbow" point where improvements diminish. builds a by successively merging or splitting clusters using linkage criteria such as , which minimizes intra-cluster variance; complete linkage, which uses maximum inter-point distances; average linkage, based on mean distances; or single linkage, using minimum distances. excels at discovering clusters of arbitrary shape by designating core points with at least min_samples neighbors within distance eps, then expanding clusters from these points while labeling sparse regions as . Dimensionality reduction techniques in scikit-learn compress high-dimensional data while preserving structure, aiding visualization and computational efficiency. (PCA) achieves this by performing eigenvalue decomposition on the \Sigma = \frac{1}{n} X^T X, selecting eigenvectors corresponding to the largest eigenvalues to maximize projected variance. This linear method centers the data and optionally whitens components to unit variance, with variants like IncrementalPCA supporting large datasets via online updates. For non-linear visualization, (t-SNE) preserves local neighborhoods by minimizing divergences between high- and low-dimensional similarities, though it is computationally intensive and suited for 2D/3D projections. Preprocessing utilities standardize and transform features to ensure compatibility with downstream algorithms, addressing issues like scale disparities and categorical data. StandardScaler performs z-score normalization by subtracting the mean \mu and dividing by the standard deviation \sigma, yielding X_{\text{scaled}} = \frac{X - \mu}{\sigma}, which centers data at zero mean and unit variance to benefit distance-based methods. OneHotEncoder converts categorical variables into binary vectors, creating one column per category (with options to drop one for sparsity reduction) while handling unknown categories via strategies like infrequent replacement. SimpleImputer (formerly Imputer) fills missing values using strategies such as mean, median, or constant substitution, enabling robust handling of incomplete datasets. PolynomialFeatures expands features by generating all polynomial combinations up to a specified degree (e.g., degree=2 includes squares and interactions like x_1 x_2), introducing non-linearity without altering the original data. Anomaly detection identifies outliers as deviations from normal patterns, with scikit-learn supporting and One-Class SVM for unsupervised scenarios. isolates anomalies via an ensemble of isolation trees, where each tree randomly partitions data; anomalies require fewer splits (shorter path lengths) due to their sparsity, with the anomaly score derived from average path length across trees. Key parameters include n_estimators for tree count and contamination estimating outlier proportion. One-Class SVM fits a boundary around normal data in a kernel-induced space (default RBF), classifying points outside this hypersphere as anomalies, controlled by nu which bounds the outlier fraction.

Usage and Examples

Installation and Basic Workflow

Scikit-learn can be installed on systems running 3.10 or later, with the recommended method being the use of , which automatically handles core dependencies such as (version 1.22.0 or later), (1.8.0 or later), joblib (1.2.0 or later), and threadpoolctl (3.1.0 or later). To install via , users first create a using python -m venv sklearn-env, activate it (e.g., source sklearn-env/bin/activate on /macOS or sklearn-env\Scripts\activate on Windows), and then run pip install -U scikit-learn. Alternatively, for managing environments and dependencies more robustly, especially in scientific workflows, installation via conda is supported by creating an environment with conda create -n sklearn-env -c conda-forge scikit-learn and activating it using conda activate sklearn-env. Verification of the involves checking the package details with python -m [pip](/page/Pip) show scikit-learn or conda list scikit-learn, followed by importing the library in via import sklearn and displaying versions with sklearn.show_versions(). The basic workflow in scikit-learn follows a standardized end-to-end process for machine learning tasks, beginning with data loading, which integrates seamlessly with libraries like pandas for handling tabular data in DataFrame format. Next, the dataset is split into training and testing subsets using the train_test_split function from sklearn.model_selection, which randomly partitions features (X) and targets (y) while supporting stratification to maintain class distributions. Model training occurs by calling the .fit() method on an estimator instance with the training data, such as fitting a classifier or regressor to learn patterns from the features and labels. Predictions are then generated on unseen data using the .predict() method, which applies the learned model to new inputs. Finally, model performance is evaluated through the .score() method, which computes metrics like accuracy for classification or R² for regression directly on the test set. For more robust assessment beyond a single train-test split, scikit-learn provides cross-validation utilities such as KFold for general partitioning into k folds and StratifiedKFold for tasks to ensure balanced class representation across folds, typically integrated with cross_validate to compute scores like mean accuracy over multiple iterations. To streamline workflows and prevent data leakage—where information from the test set influences training—users construct using make_pipeline from sklearn.pipeline, which chains preprocessing steps (e.g., ) with estimators in a single object that applies transformations consistently during fitting and prediction. Best practices in scikit-learn emphasize by setting the random_state parameter to a fixed (e.g., ) in functions like train_test_split and estimators, ensuring consistent results across runs given the same data. For datasets with imbalanced es, the class_weight parameter in classifiers (e.g., 'balanced') automatically adjusts weights inversely proportional to class frequencies, mitigating toward es during .

Practical Code Examples

Scikit-learn provides a rich set of tools for implementing workflows through practical, executable code. The following examples illustrate key usage patterns, drawing from the library's built-in datasets and estimators. These snippets are self-contained and can be run in a environment with scikit-learn installed, assuming necessary dependencies like , , and are available.

Example 1: Iris Classification with LogisticRegression

The Iris dataset, a multiclass classification benchmark consisting of 150 samples across three species with four features (sepal length, sepal width, petal length, and petal width), serves as an ideal starting point for demonstrating supervised learning. To classify the species using logistic regression, first load the dataset, split it into training and testing sets, fit the model, generate predictions, and evaluate performance with a confusion matrix visualization.
python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the [Iris](/page/Iris) dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the [logistic regression](/page/Logistic_regression) model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Compute and visualize the confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Iris Classification')
plt.show()
This code achieves near-perfect accuracy on the test set, with the confusion matrix highlighting any misclassifications, such as occasional confusion between versicolor and virginica due to overlapping features.

Example 2: California Housing Regression with RandomForestRegressor

For regression tasks, the housing dataset offers 20,640 samples of median house values in California districts, with eight numerical features like and house age. This example replaces the deprecated dataset but follows a similar : load the data, apply , perform cross-validation to assess model performance, fit a random forest regressor, and plot feature importances to identify influential predictors. Random forests ensemble multiple decision trees for robust predictions, reducing through averaging.
python
import matplotlib.pyplot as plt
import [numpy](/page/NumPy) as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Create a pipeline with scaling and the regressor
[pipeline](/page/Pipeline) = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42))

# Perform cross-validation to score the model
scores = cross_val_score([pipeline](/page/Pipeline), [X, y](/page/X&Y), cv=5, scoring='neg_mean_squared_error')
print(f"Mean MSE: {-scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# Fit the pipeline on the full data for feature importances
[pipeline](/page/Pipeline).fit([X, y](/page/X&Y))
feature_importances = [pipeline](/page/Pipeline).named_steps['randomforestregressor'].feature_importances_

# Plot feature importances
plt.figure(figsize=(10, 6))
indices = np.argsort(feature_importances)[::-1]
plt.bar([range](/page/Range)([X](/page/X&Y).shape[1]), feature_importances[indices])
plt.xticks([range](/page/Range)([X](/page/X&Y).shape[1]), housing.feature_names[indices], [rotation](/page/Rotation)=45)
[plt.title](/page/Title)('Feature Importances in California Housing Regression')
[plt.tight_layout](/page/Title)()
[plt.show](/page/Title)()
Cross-validation yields a around 0.35, indicating good predictive power, with emerging as the most important feature.

Example 3: Digit Clustering with KMeans on MNIST Subset

Unsupervised clustering on the digits dataset (a subset of MNIST with 1,797 grayscale images of handwritten digits 0-9, each 8x8 pixels or 64 features) demonstrates KMeans for grouping similar samples. To determine the optimal number of clusters (k=10 for digits), compute an elbow plot of (within-cluster sum of squares) across k values, then fit the model and evaluate using the silhouette score, which measures cluster cohesion and separation (range: -1 to 1, higher is better).
python
import matplotlib.pyplot as plt
import [numpy](/page/NumPy) as np
from sklearn.datasets import load_digits
from sklearn.cluster import [KMeans](/page/cluster)
from sklearn.metrics import silhouette_score
from sklearn.decomposition import [PCA](/page/PCA)

# Load the digits dataset
digits = load_digits()
X = digits.data

# Elbow method to find optimal k
inertias = []
sil_scores = []
K = range(2, 20)
for k in K:
    kmeans = [KMeans](/page/cluster)(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.[inertia](/page/Inertia)_)
    sil_scores.append(silhouette_score(X, kmeans.labels_))

# Plot elbow curve
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(K, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')

# Plot silhouette scores
plt.subplot(1, 2, 2)
plt.plot(K, sil_scores, 'bx-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for k')
plt.tight_layout()
plt.show()

# Fit KMeans with k=10 and compute silhouette
kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
kmeans.fit(X)
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f'Silhouette Score for k=10: {silhouette_avg:.4f}')
The elbow plot shows a bend around k=10, and the silhouette score peaks near 0.19, confirming reasonable clustering quality despite some digit ambiguities like 4 and 9. For visualization, reduce dimensions with and plot clusters.

Integration Example: with StandardScaler and SVM for Text on 20 Newsgroups

The 20 Newsgroups dataset contains approximately 18,000 newsgroup posts across 20 topics, suitable for text classification. A pipeline integrates text (using TF-IDF for term frequency-inverse document frequency weighting), optional for , and a (SVM via LinearSVC) to classify topics efficiently on sparse high-dimensional data. This automates preprocessing and fitting, ensuring consistency.
python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load a subset of the 20 Newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))
X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target

# Create pipeline: TF-IDF vectorizer (includes [L2](/page/L2) normalization akin to [scaling](/page/Scaling)), followed by SVM (LinearSVC)
pipeline = make_pipeline(
    TfidfVectorizer(max_features=5000),
    LinearSVC(random_state=42)
)

# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Evaluate
print(classification_report(y_test, y_pred, target_names=categories))
Note that for sparse TF-IDF features, StandardScaler is typically omitted as TF-IDF already ; LinearSVC handles unscaled inputs well. This pipeline achieves around 90% accuracy on the subset, with varying by category.

Error-Prone Pitfalls: Demonstrating Data Leakage in Code and Fixes

Data leakage occurs when information from the test set inadvertently influences training, leading to inflated performance estimates; a common instance is applying preprocessing like scaling to the entire dataset before splitting. This violates the principle that models should only use training data for transformations available at prediction time. Consider a flawed approach on the :
python
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = [iris](/page/Iris).data, [iris](/page/Iris).target

# Pitfall: Scaling the entire dataset before splitting (leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test data info in fit
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'Leaky accuracy: {accuracy_score(y_test, y_pred):.4f}')  # Overly optimistic
This yields unrealistically high accuracy (e.g., 1.0000) because the scaler incorporates test set statistics. The fix uses a to apply scaling only within cross-validation folds or on training data:
python
from sklearn.pipeline import make_pipeline

# Correct: Pipeline ensures scaling is fit only on train
[pipeline](/page/Pipeline) = make_pipeline(StandardScaler(), LogisticRegression())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=[42](/page/42))
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f'Correct accuracy: {accuracy_score(y_test, y_pred):.4f}')  # Realistic ~0.98
Pipelines prevent leakage by deferring transformations to the fitting step, ensuring test data remains unseen during preprocessing. Always validate splits and use tools like cross_val_score for robust evaluation.

Applications

Industry Sectors

Scikit-learn has found widespread adoption in the and sectors for tasks such as detection and , leveraging its algorithms like to identify anomalous transactions in real-time. For instance, financial institutions utilize scikit-learn's ensemble methods to enhance detection systems by isolating outliers in transaction data, achieving higher accuracy in flagging suspicious activities compared to traditional rule-based approaches. In , the library supports predictive modeling for claims processing, where regression techniques such as Tweedie regression are applied to forecast claim amounts based on historical datasets, enabling insurers to streamline and reduce processing times. A notable example is , a UK-based lending platform, which employs scikit-learn for modeling and detection, integrating algorithms to evaluate applications and mitigate financial risks. In retail and , scikit-learn facilitates customer segmentation and personalized recommendations through clustering and techniques, helping businesses optimize inventory and boost sales. Retailers apply from scikit-learn to group customers based on purchasing behavior and demographics, allowing for targeted marketing campaigns that improve customer retention. For product recommendations, platforms like leverage scikit-learn for recommendation engines to suggest hotels and destinations based on user preferences, enhancing and increasing conversion rates in competitive e-commerce environments. The media and marketing industries utilize scikit-learn for content personalization and audience targeting, particularly through clustering for playlist curation. Spotify employs scikit-learn for music recommendations, analyzing user preferences to generate cohesive playlists that drive user engagement and content discovery. Technology companies integrate scikit-learn into internal tools for and , supporting data-driven decision-making in product development. For example, firms use scikit-learn's cross-validation utilities to evaluate experiment outcomes in A/B tests, ensuring robust statistical analysis of user behavior metrics like engagement and retention. In areas like software optimization, the library aids and preprocessing, allowing teams to build scalable models for tasks such as in system logs. In healthcare, scikit-learn contributes to pharmaceutical research by enabling regression models for drug response prediction, accelerating the identification of effective treatments. Researchers apply linear and ridge regression from scikit-learn to analyze genomic and chemical data, forecasting patient responses to drugs and prioritizing candidates for clinical trials based on predictive accuracy. This approach has been instrumental in precision medicine initiatives, where models trained on datasets like Genomics of Drug Sensitivity in Cancer help simulate therapeutic outcomes, reducing development timelines and costs.

Research and Academia

Scikit-learn plays a pivotal role in academic machine learning education, serving as a foundational tool in university curricula worldwide. For instance, at Stanford University, the Iris dataset bundled with scikit-learn is commonly featured in machine learning courses like CS229 to illustrate classification tasks, enabling students to implement supervised learning algorithms from scratch or using library functions. The library's intuitive API and built-in datasets facilitate hands-on learning of core concepts such as model fitting, evaluation, and hyperparameter tuning, making it ideal for introductory and advanced syllabi. Its integration into course projects allows students to experiment with real-world data without the overhead of low-level implementations. In research, scikit-learn has been extensively cited, underscoring its impact on scientific publications. The seminal paper introducing the library, "Scikit-learn: in " by Pedregosa et al., has garnered over 88,000 citations as of 2025, reflecting its widespread adoption in peer-reviewed studies across disciplines. In , researchers at Inria have leveraged scikit-learn for analyzing brain signals, particularly in (fMRI) studies. Techniques like via and support vector machines enable decoding of neural patterns and brain-behavior correlations, as demonstrated in applications for data processing. Open research contributions further extend scikit-learn's utility in through community-driven extensions and tools for reproducible . The scikit-learn-contrib organization hosts packages like scikit-learn-extra, which provide niche algorithms such as HDBSCAN clustering and label propagation that do not yet meet the core library's inclusion criteria but support specialized research needs. Additionally, scikit-learn's compatibility with Jupyter notebooks promotes reproducible workflows, allowing researchers to document experiments inline with code, visualizations, and results, as seen in numerous open-source and repositories. Educational resources bolster scikit-learn's academic footprint, with official tutorials and massive open online courses (MOOCs) emphasizing practical implementation. The Inria-developed scikit-learn MOOC, hosted on the platform, offers free, self-paced modules covering predictive modeling, , and model evaluation through interactive notebooks and quizzes. Platforms like and feature specialized courses, such as "Introduction to Data Science and scikit-learn in ," where learners apply the library in hands-on labs for tasks like and clustering. These resources democratize access to education, fostering skills in reproducible analysis. Scikit-learn's influence extends to competitive academic settings, notably competitions, where it serves as a standard for establishing baselines in over 80% of winning solutions analyzed from to 2025. Participants frequently use its preprocessing and ensemble methods to prototype models before integrating advanced techniques, highlighting its role in rapid experimentation and validation in research. This usage reinforces scikit-learn's status as a tool in empirical studies and student-led competitions.

Community and Recognition

Contributors and Governance

The scikit-learn project is sustained by a dedicated open-source , with core contributors forming specialized teams that handle maintenance, , communication, and contributor support. The active core contributors include approximately 19 maintainers such as Alexandre Gramfort, Andreas Mueller, and Jérémie du Boisberranger, alongside smaller teams for (e.g., Arturo Amor, ), contributor experience (e.g., Virgil Chan, Juan Carlos Alfaro Jiménez), and communication (e.g., Lauren Burke-McCarthy, François Goupil). Emeritus contributors, including early leaders like Fabian Pedregosa and Mathieu Blondel, have transitioned to advisory roles after significant foundational work, while the broader encompasses over 3,000 total contributors who have submitted , , or issue reports via as of November 2025. Governance operates under a meritocratic, consensus-seeking framework that emphasizes community input while ensuring efficient decision-making. All core contributors possess equal voting rights, and proposals are discussed publicly through GitHub issues and pull requests, the project mailing list, or in-person/virtual sprints. For contentious or major changes, such as API modifications, a Scikit-Learn Enhancement Proposal (SLEP) is required, needing approval from at least two core contributors and no objections; unresolved matters escalate to the Technical Committee (TC) after one month, where a two-thirds majority vote among its members—currently including Thomas Fan, Alexandre Gramfort, and others—resolves the issue. This structure replaced earlier informal processes to better scale with community growth, as detailed in SLEP 020. Contributions follow rigorous guidelines to uphold code quality and accessibility. Code must conform to PEP 8 standards, with inline comments for clarity and no alterations to unrelated files; new features require unit tests via pytest targeting at least 90% coverage, executable locally or in . Documentation enhancements use Sphinx for docstrings, user guides, and examples, built and previewed during development. Pull requests are submitted via forks, labeled for ease (e.g., "good first issue" for newcomers), and merged only after review by two core contributors, promoting collaborative refinement. Funding supports development through a mix of nonprofit grants and industry partnerships, enabling paid contributor time and infrastructure. As of 2025, key sources include NumFOCUS for , the Inria Foundation Consortium (with members like and ), and corporate backers such as probabl.ai, , , Quansight Labs, the Chan-Zuckerberg Initiative, , and Tidelift. Historical funding from , the , and universities like and has also been instrumental in early scaling. In-kind donations from Anaconda, , , and further aid operations. Diversity and are prioritized through targeted programs to broaden participation. The mentored initiative, formalized in SLEP 006, provides paid opportunities for underrepresented individuals to contribute under guidance from core developers, as exemplified by participant experiences shared in project updates. Annual coding sprints, numbering over 50 since 2010, incorporate sessions and to lower barriers for newcomers from varied backgrounds, fostering an equitable environment, with ongoing support from initiatives like the Chan-Zuckerberg Initiative and for equity in .

Awards and Milestones

Scikit-learn has received several formal recognitions for its contributions to open-source software. In 2019, core developers Gaël Varoquaux, Bertrand Thirion, Loïc Estève, Olivier Grisel, and Alexandre Gramfort were awarded the Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize for their work on scikit-learn, highlighting its role in democratizing statistical learning and fostering collaboration between research and industry. Key milestones underscore the project's growth and stability. The release of version 1.0 on September 24, 2021, marked a significant achievement, signifying a stable and mature after years of development and community feedback. In 2012, scikit-learn was selected for the program, supporting three student projects that enhanced features like sparse linear models and dictionary learning. More recently, in August 2025, the project completed the Secure Open Source Training, adopting best practices for security in open-source development. The library's impact extends to high-profile scientific work and broad adoption. Scikit-learn has been used in machine learning challenges analyzing data, such as the 2014 Higgs ML Challenge, where techniques aided particle classification. Surveys indicate strong usage among Python machine learning practitioners, with scikit-learn consistently ranking as the most popular framework; for instance, in the 2022 State of Machine Learning and Data Science Survey, it was used by over 40% of respondents, far outpacing alternatives. This widespread adoption reflects its accessibility and reliability in both and .

References

  1. [1]
  2. [2]
    Scikit-learn: Machine Learning in Python
    ### Summary of "Scikit-learn: Machine Learning in Python" by Pedregosa et al.
  3. [3]
    About us
    ### Summary of scikit-learn About Page
  4. [4]
    scikit-learn: machine learning in Python - GitHub
    scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
  5. [5]
    Release History — scikit-learn 1.7.2 documentation
    Changelogs and release notes for all scikit-learn releases are linked in this page. Version 1.7- Version 1.7.2, Version 1.7.1, Version 1.7.0., Version 1.6- ...Version 1.4 · Version 1.5 · Version 1.7 · Version 1.3
  6. [6]
    scikit-learn: machine learning in Python — scikit-learn 1.7.2 ...
    Open source, commercially usable - BSD license. Install · User Guide · API · Examples · Community · Getting Started · Release History · Glossary · Development ...1. Supervised learning · Getting Started · API Reference · Install
  7. [7]
    scikit-learn - PyPI
    scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
  8. [8]
    Version 1.8.dev0 - Scikit-learn
    Version 1.8.dev0#. November 2025. Changes impacting many modules#. Efficiency Improved CPU and memory usage in estimators and metric functions that rely on ...Missing: current | Show results with:current
  9. [9]
    Installing scikit-learn
    Scikit-learn 1.0 supported Python 3.7—3.10. Scikit-learn 1.1, 1.2 and 1.3 supported Python 3.8—3.12. Scikit-learn 1.4 and 1.5 supported Python 3.9—3.12. Scikit ...
  10. [10]
    8 of the Most Popular Machine Learning Tools in 2025 - Atlantic.Net
    Aug 11, 2025 · This guide offers a practical look at new machine learning tools, including the top indispensable ML tools in 2025.
  11. [11]
    Top 10 Machine Learning Frameworks in 2025 - GeeksforGeeks
    Jul 12, 2025 · Scikit-learn is a free software library for Machine Learning coding primarily in the Python programming language. It was initially developed as ...
  12. [12]
    [1309.0238] API design for machine learning software - arXiv
    Sep 1, 2013 · Abstract page for arXiv paper 1309.0238: API design for machine learning software: experiences from the scikit-learn project.
  13. [13]
    Glossary of Common Terms and API Elements - Scikit-learn
    This glossary hopes to definitively represent the tacit and explicit conventions applied in Scikit-learn and its API, while providing a reference for users ...Missing: philosophy | Show results with:philosophy
  14. [14]
    9.3. Parallelism, resource management, and configuration
    **Summary of scikit-learn Parallelism with joblib for Production Deployment:**
  15. [15]
  16. [16]
    Release history — scikit-learn 0.19.2 documentation
    Version 0.9¶. September 21, 2011. scikit-learn 0.9 was released on September 2011, three months after the 0.8 release and includes the new modules Manifold ...<|control11|><|separator|>
  17. [17]
    Release history — scikit-learn 0.18.2 documentation
    Scikit-learn 0.18 is the last major release of scikit-learn to support Python 2.6. Later versions of scikit-learn will require Python 2.7 or above. Changelog¶.
  18. [18]
    A Beginner's Guide to Neural Networks with Python and SciKit Learn ...
    Oct 20, 2016 · The newest version (0.18) was just released a few days ago and now has built in support for Neural Network models. In this article we will learn ...
  19. [19]
    Version 0.20 — scikit-learn 1.7.2 documentation
    Version 0.20 is the last version of scikit-learn to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or higher.
  20. [20]
    Version 0.24.2 - Scikit-learn
    Version 0.24#. For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 0.24.Missing: notes | Show results with:notes
  21. [21]
    Scikit-Learn 0.24: Top 5 new features you need to know - Zindi
    Oct 14, 2021 · But in the new version, we have two new classes for hyper-parameters tuning called HalvingGridSearchCV and HalvingRandomSearchCV.
  22. [22]
    Version 1.0.2 - Scikit-learn
    For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.0 ... Fix Improves stability of linear_model.
  23. [23]
    Release Highlights for scikit-learn 1.0
    The library has been stable for quite some time, releasing version 1.0 is recognizing that and signalling it to our users. This release does not include any ...Keyword And Positional... · Spline Transformers · Feature Names Support
  24. [24]
    Version 1.3 — scikit-learn 1.7.2 documentation
    Note that it is still possible to manually adjust the number of threads ... check_array now supports pandas DataFrames with extension arrays and object ...
  25. [25]
    Version 1.5 — scikit-learn 1.7.2 documentation
    PLSRegression now takes into account both the scale of X and Y when scale=True . Note that the previous predicted values were not affected by this bug.
  26. [26]
    Version 1.7.1 - Scikit-learn
    Version 1.7#. For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.7. Legend for changelogs.
  27. [27]
    Releases · scikit-learn/scikit-learn - GitHub
    We're happy to announce the 1.7.2 release. This release contains a few bug fixes and is the first version supporting Python 3.14.
  28. [28]
    5.1. Partial Dependence and Individual Conditional Expectation plots
    Partial dependence plots (PDP) and individual conditional expectation (ICE) plots can be used to visualize and analyze interaction between the target response.Partial_dependence · Partial Dependence and... · Permutation feature importanceMissing: evolution production readiness PyTorch TensorFlow
  29. [29]
    Version 1.2.2 - Scikit-learn
    Fix A deprecation warning is raised when using the base_estimator__ prefix to set parameters of the estimator used in calibration.CalibratedClassifierCV . # ...
  30. [30]
    Developing scikit-learn estimators
    This chapter details how to develop objects that safely interact with scikit-learn pipelines and model selection tools.Apis Of Scikit-Learn Objects · Estimators · Rolling Your Own Estimator
  31. [31]
    BaseEstimator — scikit-learn 1.7.2 documentation
    Base class for all estimators in scikit-learn. Inheriting from this class provides default implementations of: setting and getting parameters used by ...
  32. [32]
  33. [33]
    7.1. Pipelines and composite estimators - Scikit-learn
    Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data.
  34. [34]
  35. [35]
  36. [36]
    check_array
    ### Summary of `check_array` Function
  37. [37]
    check_X_y — scikit-learn 1.7.2 documentation
    Input validation for standard estimators. Checks X and y for consistent length, enforces X to be 2D and y 1D. By default, X is checked to be non-empty and ...
  38. [38]
    validate_data — scikit-learn 1.7.2 documentation
    Validate input data and set or check feature names and counts of the input. This helper function should be used in an estimator that requires input validation.
  39. [39]
    check_estimator — scikit-learn 1.7.2 documentation
    This function will run an extensive test-suite for input validation, shapes, etc, making sure that the estimator complies with scikit-learn conventions.
  40. [40]
    9.2. Computational Performance — scikit-learn 1.7.2 documentation
    Scipy provides sparse matrix data structures which are optimized for storing sparse data. The main feature of sparse formats is that you don't store zeros ...
  41. [41]
  42. [42]
  43. [43]
    Scikit-Learn & Joblib — dask-ml 2025.1.1 documentation
    Many Scikit-Learn algorithms are written for parallel execution using Joblib, which natively provides thread-based and process-based parallelism.
  44. [44]
    1. Supervised learning — scikit-learn 1.7.2 documentation
    Release History · Glossary · Development · FAQ · Support · Related Projects · Roadmap · Governance · About us · GitHub. 1.7.2 (stable). 1.8.dev0 (dev)1.7.2 ( ...
  45. [45]
    1.1. Linear Models — scikit-learn 1.7.2 documentation
    The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features.Ordinary Least Squares and... · 1.2. Linear and Quadratic... · LinearRegressionMissing: philosophy | Show results with:philosophy
  46. [46]
    LogisticRegression — scikit-learn 1.7.2 documentation
    This class implements regularized logistic regression using the 'liblinear' library, 'newton-cg', 'sag', 'saga' and 'lbfgs' solvers.LogisticRegressionCV · OneVsRestClassifier · Probability Calibration curves
  47. [47]
    1.4. Support Vector Machines
    ### Summary of Support Vector Machines in scikit-learn
  48. [48]
  49. [49]
    1.10. Decision Trees — scikit-learn 1.7.2 documentation
    Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the ...DecisionTreeClassifier · Plot the decision surface of... · Decision Tree RegressionMissing: philosophy | Show results with:philosophy
  50. [50]
  51. [51]
    1.11. Ensembles: Gradient boosting, random forests, bagging, voting ...
    Gradient Tree Boosting or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differentiable loss functions, see the seminal ...Plot the decision surfaces of... · GradientBoostingClassifier
  52. [52]
    RandomForestClassifier — scikit-learn 1.7.2 documentation
    A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the ...RandomForestRegressor · Comparing Random Forests...
  53. [53]
    HistGradientBoostingClassifier — scikit-learn 1.7.2 documentation
    Histogram-based Gradient Boosting Classification Tree. This estimator is much faster than GradientBoostingClassifier for big datasets (n_samples >= 10 000).
  54. [54]
  55. [55]
  56. [56]
  57. [57]
  58. [58]
  59. [59]
    RandomForestRegressor — scikit-learn 1.7.2 documentation
    A random forest is a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the ...
  60. [60]
    3.4. Metrics and scoring: quantifying the quality of predictions
    We want to give some guidance, inspired by statistical decision theory, on the choice of scoring functions for supervised learning.F1_score · Accuracy_score · Roc_auc_score · 3.5. Validation curvesMissing: philosophy | Show results with:philosophy
  61. [61]
  62. [62]
  63. [63]
  64. [64]
    3.2. Tuning the hyper-parameters of an estimator - Scikit-learn
    This is the best practice for evaluating the performance of a model with grid search. See Sample pipeline for text feature extraction and evaluation for an ...GridSearchCV · Tuning the decision threshold... · RandomizedSearchCV
  65. [65]
    GridSearchCV — scikit-learn 1.7.2 documentation
    Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. GridSearchCV implements a “fit” and a “score” method.Hyper-parameter optimization · ParameterGrid · RandomizedSearchCVMissing: algorithms | Show results with:algorithms
  66. [66]
  67. [67]
    2.3. Clustering — scikit-learn 1.7.2 documentation
    The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points is calculated using the current centroids.SpectralClustering · AgglomerativeClustering · 2.4. Biclustering · KMeansMissing: philosophy | Show results with:philosophy
  68. [68]
    2.5. Decomposing signals in components (matrix factorization problems)
    ### Summary of Dimensionality Reduction in scikit-learn
  69. [69]
    7.3. Preprocessing data
    ### Summary of Preprocessing Utilities in scikit-learn
  70. [70]
    2.7. Novelty and Outlier Detection
    ### Summary of Anomaly Detection: Isolation Forest and One-Class SVM
  71. [71]
    [PDF] k-means++: The Advantages of Careful Seeding - Stanford CS Theory
    Abstract. The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster.
  72. [72]
    Isolation Forest | IEEE Conference Publication
    This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge ...
  73. [73]
    [PDF] Support Vector Method for Novelty Detection
    Suppose you are given some dataset drawn from an underlying probabil- ity distribution P and you want to estimate a "simple" subset S of input.
  74. [74]
    Getting Started — scikit-learn 1.7.2 documentation
    Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting.
  75. [75]
    load_iris — scikit-learn 1.7.2 documentation
    The iris dataset is a classic and very easy multi-class classification dataset. Classes. 3. Samples per class. 50. Samples total.
  76. [76]
    Confusion matrix
    ### Summary of Content
  77. [77]
    fetch_california_housing — scikit-learn 1.7.2 documentation
    Gallery examples: Comparing Random Forests and Histogram Gradient Boosting ... Load the California housing dataset (regression). Samples total. 20640.
  78. [78]
  79. [79]
    A demo of K-Means clustering on the handwritten digits data
    This demo compares K-means initialization strategies (k-means++, random, PCA) on handwritten digits data, using a benchmark to measure performance.
  80. [80]
  81. [81]
    fetch_20newsgroups — scikit-learn 1.7.2 documentation
    Load the filenames and data from the 20 newsgroups dataset (classification). ... Sample pipeline for text feature extraction and evaluation. Semi-supervised ...Missing: tfidf code
  82. [82]
    Classification of text documents using sparse features - Scikit-learn
    This is an example showing how scikit-learn can be used to classify documents by topics using a Bag of Words approach.Loading And Vectorizing The... · Analysis Of A Bag-Of-Words... · Model Without Metadata...
  83. [83]
  84. [84]
    IsolationForest — scikit-learn 1.7.2 documentation
    The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of ...
  85. [85]
    Tweedie regression on insurance claims - Scikit-learn
    This example illustrates the use of Poisson, Gamma and Tweedie regression on the French Motor Third-Party Liability Claims dataset, and is inspired by an R ...Missing: AXA | Show results with:AXA
  86. [86]
    What is scikit-learn? (Updated 2025) - igmGuru
    Oct 7, 2025 · Learn what is Scikit-learn, its key features, and how it's used in Python to build machine learning models and analyze data efficiently.Scikit-Learn's Features · Scikit-Learn's Components · Scikit-Learn Use Cases
  87. [87]
    KMeans — scikit-learn 1.7.2 documentation
    For an example of how to use the different init strategies, see A demo of K-Means clustering on the handwritten digits data. For an evaluation of the impact ...Demonstration of k-means... · Silhouette analysis · K_means · MiniBatchKMeans
  88. [88]
    What Is Scikit-learn and How Is It Used in AI? - Dataquest
    Apr 26, 2024 · Scikit-learn is a collection of tools that allow you to quickly build and deploy machine learning models in Python.
  89. [89]
    3.1. Cross-validation: evaluating estimator performance - Scikit-learn
    The function cross_val_predict has a similar interface to cross_val_score , but returns, for each element in the input, the prediction that was obtained for ...Cross_val_score · KFold · StratifiedShuffleSplit · Receiver Operating...
  90. [90]
    Establishing predictive machine learning models for drug responses ...
    Jun 13, 2025 · This study delves into drug response profiles as predictors in precision medicine ... We conducted a comparison of regression models using the Sci ...
  91. [91]
    Comparative analysis of regression algorithms for drug response ...
    Jan 13, 2025 · We compared and evaluated the performance of 13 representative regression algorithms using Genomics of Drug Sensitivity in Cancer (GDSC) dataset.
  92. [92]
    Machine Learning - Projects - CS229
    Can we use some Machine Learning libraries such as scikit-learn or are we expected to implement them from scratch? You can use any library for the project.
  93. [93]
    Stanford CS229 Machine Learning Models - GitHub
    Each class has a similar structure to Scikit-Learn with the fit and predict methods. There are some examples on how to use the algorithms in the notebooks (src ...
  94. [94]
    (PDF) Scikit-learn: Machine Learning in Python - ResearchGate
    Aug 7, 2025 · Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
  95. [95]
    Machine learning for neuroimaging with scikit-learn - PubMed - NIH
    Feb 21, 2014 · Machine learning, using scikit-learn, is used for neuroimaging to model high-dimensional data, relate brain images to observations, and uncover ...Missing: neuroscience fMRI
  96. [96]
    [PDF] Machine learning for NeuroImaging data analysis - Hal-Inria
    Jan 20, 2025 · This chapter is centered on functional Magnetic Res- onance Imaging (fMRI), because the literature in this area is quite rich, and because it ...<|separator|>
  97. [97]
    scikit-learn-contrib - GitHub
    A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities.Missing: extensions | Show results with:extensions
  98. [98]
    Examples — scikit-learn 1.7.2 documentation
    This is the gallery of examples that showcase how scikit-learn can be used. Some examples demonstrate the use of the API in general and some demonstrate ...Comparing different clustering · Classifier comparison · Decision Trees · Clustering
  99. [99]
    Introduction — Scikit-learn course - GitHub Pages
    This course is an in-depth introduction to predictive modeling with scikit-learn. Step-by-step and didactic lessons introduce the fundamental methodological ...First model with scikit-learn · Linear regression using scikit... · Quiz Intro.01Missing: edX | Show results with:edX
  100. [100]
    Kaggle Winning Solutions: AI Trends & Insights
    Abstract: This work presents a comprehensive analysis of over 150 Kaggle competitions held between 2020 and 2025, encompassing nearly 700 winning solutions ...<|control11|><|separator|>
  101. [101]
    Scikit-learn governance and decision-making
    This document establishes a decision-making structure that takes into account feedback from all members of the community and strives to find consensus.Roles And Responsibilities · Core Contributors · Decision Making ProcessMissing: steering | Show results with:steering
  102. [102]
  103. [103]
    Contributing — scikit-learn 1.7.2 documentation
    The preferred way to contribute to scikit-learn is to fork the main repository on GitHub, then submit a “pull request” (PR). In the first few steps, we explain ...Submitting A Bug Report Or A... · Contributing Code · Pull Request Checklist
  104. [104]
    scikit-learn , a success story for machine learning free software | Inria
    Jan 9, 2020 · The achievement of being the third most used free software for machine learning in the world, scikit-learn has been a huge success.
  105. [105]
    3 Google summer of code for scikit-learn and more…
    The scikit-learn got 3 students accepted for the Google summer of code. Imanuel Bayer will work on making our sparse linear models, for ...
  106. [106]
    scikit-learn Completes the GitHub Secure Open Source Training
    Aug 16, 2025 · scikit-learn Completes the GitHub Secure Open Source Training. 2025-08-16 3 minute read. Author: Author Icon Reshama Shaikh. SummaryPermalink.
  107. [107]
    [PDF] Higgs Boson Discovery with Boosted Trees
    Soon after the discovery, Peter Higgs and François. Englert was acknowledged by the 2013 Nobel Prize in physics. The next step for physicists is to discover ...
  108. [108]
    2022 Kaggle Machine Learning & Data Science Survey
    'The percentage of Doctoral degree in the education column is 11.36 %.' In ... Education and Finance have the most Data Science and Machine Learning adoption.