4.2. Principal Component Analysis

• Principal component analysis (PCA)

• Technique used to emphasize variation and bring out strong patterns in a dataset.

• It's often used to make data easy to explore and visualize.

• Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

4.2.2. 2D example

First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value. The axes don't actually mean anything physical; they're combinations of height and weight called "principal components" that are chosen to give one axes lots of variation.

import numpy as np
from sklearn.decomposition import PCA

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

pca = PCA(n_components=2)
pca.fit(X)
# PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
#     svd_solver='auto', tol=0.0, whiten=False)

pca.explained_variance_ratio_
# [ 0.99244...  0.00755...]

pca = PCA(n_components=2, svd_solver='full')

pca.fit(X)
# PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
#     svd_solver='full', tol=0.0, whiten=False)

pca.explained_variance_ratio_
# [ 0.99244...  0.00755...]

pca = PCA(n_components=1, svd_solver='arpack')

pca.fit(X)
# PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
#     svd_solver='arpack', tol=0.0, whiten=False)

pca.explained_variance_ratio_
# [ 0.99244...]


4.2.5. PCA dla zbioru Iris

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

features = iris.data
labels = iris.target

pca = decomposition.PCA(n_components=3)
pca.fit(features)
features = pca.transform(features)

plt.clf()  # doctest: +SKIP

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()  # doctest: +SKIP

for name, label in [('Setosa', 0), ('Versicolor', 1), ('Virginica', 2)]:
ax.text3D(
features[labels == label, 0].mean(),
features[labels == label, 1].mean() + 1.5,
features[labels == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=0.5, edgecolor='w', facecolor='w'))

# Reorder the labels to have colors matching the cluster results
labels = np.choose(labels, [1, 2, 0]).astype(np.float)
ax.scatter(features[:, 0], features[:, 1], features[:, 2], c=labels, edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()  # doctest: +SKIP


PCA dla zbioru Iris:

4.2.6. Assignments

4.2.6.1. PCA dla zbioru Pima Indian Diabetes

• Assignment: PCA dla zbioru Pima Indian Diabetes

• Complexity: medium

• Lines of code: 30 lines

• Time: 21 min

English:

...

Polish:
1. Przeprowadź analizę PCA dla zbioru Indian Pima

2. Uruchom doctesty - wszystkie muszą się powieść