Friday, March 13, 2026

A bland introduction to the main ingredients (PCA) in Python

Share


Photo by the author Ideogram

Analysis of the main ingredients (PCA) is one of the most popular techniques for reducing high -dimension dimensions. This is an vital process of data transformation in various scenarios and industries, such as image processing, finance, genetics and machine learning applications in which the data contain many functions that should be more effectively analyzed.

The reasons for the importance of dimensional reduction techniques, such as PCA, are diverse, and three of them stand out:

  • Efficiency: Reducing the number of data functions means a reduction in the calculation cost of processes requiring data, such as training advanced machine learning models.
  • Interpretability: By processing data into a low -dimension space, maintaining key patterns and properties, it is easier to interpret and visualize in 2D and 3D, sometimes helping to get insight into the visualization.
  • Noise reduction: Often, high -dimensional data may contain unnecessary or loud features, which after detecting methods such as PCA can be eliminated during behavior (or even improvement) of the effectiveness of subsequent analyzes.

We hope that at this moment I have convinced you of the practical importance of PCA during the service of elaborate data. In this case, read on, because we will start to be practical, learning how to exploit PCA in Python.

How to exploit the main ingredients in Python

Thanks to the support of libraries, such as Scikit-Learn, which contain abstract implementation of the PCA algorithm, using it on data is relatively basic, if the data is numerical, previously pre-processed and free from missing values, and the values ​​of the functions are normalized to avoid problems with variance. This is especially vital because PCA is a deeply statistical method that is based on variations of features to determine Main elements: fresh functions from original and orthogonal for themselves.

We will start our example of using PCA from scratch in Python by importing necessary libraries, charging a multist data set from handwritten numbers with low resolution and placing it in Pandas Dataframe:

import pandas as pd
from torchvision import datasets

mnist_data = datasets.MNIST(root="./data", train=True, download=True)
data = []
for img, label in mnist_data:
    img_array = list(img.getdata()) 
    data.append([label] + img_array)
columns = ["label"] + [f"pixel_{i}" for i in range(28*28)]
mnist_data = pd.DataFrame(data, columns=columns)

IN Multist data setEach instance is a square image of 28×28, a sum of 784 pixels, each of which contains a numeric code associated with its gray level, from 0 for black (without intensity) to 255 for white (maximum intensity). These data must first be postponed into one party plaque – instead of bidimency according to the original mesh system 28×28. This process called flattening takes place in the above code, with the final set of data in FormacieFrame Dataframe containing a total of 785 variables: one for each of the 784 pixels plus labels, indicating the value of the integer between 0 and 9 digit originally written in the picture.

Set of data multist Source: tensorflow
Set of data multist Source: tensorflow

In this example, we will not need a label – useful for other exploit cases, such as image classification – but we will assume that it may be necessary at hand for future analysis, which is why we will separate it from the rest of the functions related to the image pixels in the fresh variable:

X = mnist_data.drop('label', axis=1)

y = mnist_data.label

Although after PCA we will not exploit the supervised learning technique, we will assume that we can do it in future analyzes, which is why we will divide a set of data for training (80%) and testing (20%). There is another reason we do, let me explain it a little later.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state=42)

Preliminary data processing And giving it to the PCA algorithm is as vital as the exploit of the algorithm itself. In our example, initial processing entails scaling the original intensity of pixels in the Mnist data set to a standardized range of 0 and standard deviation and so that all functions have an equal contribution to the calculation of variance, avoiding problems with domination in some features. To do this, we will exploit the Standscaler class with scale ..pressing, which standardizes numerical functions:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Pay attention to the exploit fit_transform for training data, while in the case of the test data used transform Instead. This is another reason why we previously divided the data into training and test data to be able to discuss this: in data transformations such as standardization of numerical attributes, transformations in training sets and tests must be consistent. . fit_transform The method is used in training data, because calculates the necessary statistics that will conduct the process of data transformation from the training set (matching), and then exploit transformation. Meanwhile, the transformation method is used on test data, which uses the same transformation “learned” from training data to the test kit. This ensures that the model sees test data on the same target scale as used for training data, maintaining consistency and avoiding problems such as data leakage or deviation.

Now we can exploit the PCA algorithm. The implementation of Scikit-Learn PCA adopts an vital argument: n_components. This hyperparameter determines the percentage of the main components for behavior. Higher values ​​closer to 1 Determination of more components and capture a greater variance in original data, while lower values ​​closer to 0 means maintaining a smaller number of components and using a more aggressive dimension reduction strategy. For example, setting n_components Up to 0.95 implies the behavior of sufficient components to capture 95% of the variance of original data, which can be appropriate to reduce the dimension of data while maintaining most of its information. If, after applying this setting, the data dimension is significantly reduced, it means that many original functions did not contain many statistically relevant information.

from sklearn.decomposition import PCA

pca = PCA(n_components = 0.95)
X_train_reduced = pca.fit_transform(X_train_scaled)

X_train_reduced.shape

By means of shape The attribute of the obtained data set after PCA can be seen that the dimension of data has been drastically reduced from 784 functions to just 325, while maintaining 95% of vital information.

Is this a good result? The answer to this question largely depends on the later application or the type of analysis you want to do with reduced data. For example, if you want to build a classifier of digital paintings, you can build two classification models: one trained with an original, high -scale data set, and one trained with a reduced set of data. If there is no significant loss of accuracy of classification in the second classifier, good news: you have achieved a faster classifier (reduction of dimensions usually implies greater training and inference efficiency) and similar classification performance, as if you used original data.

Wrapping

In this article illustrated with the aid of Python, step by step, how to exploit a PCA algorithm from scratch, starting with a set of handwritten data of high -dimensional digits.

IVán Palomares Carrascosa He is a leader, writer, speaker and advisor in artificial intelligence, machine learning, deep learning and LLM. He trains and runs others, using artificial intelligence in the real world.

Latest Posts

More News