Table of Contents

Module: kMeans Bio/Tools/Clustering/kMeans.py

This provides code for doing k-Means clustering of data.

k-Means is an algorithm for unsupervised clustering of data.

Glossary: clusters - A group of closely related data. centroids - A vector "in the middle" of a cluster.

Functions: cluster Cluster a list of data points.

Distance Functions: euclidean_dist The euclidean distance between two points.

Functions   
_find_closest_centroid
cluster
first_k_points_as_centroids
random_centroids
  _find_closest_centroid 
_find_closest_centroid (
        vector,
        centroids,
        distance_fn,
        )

_find_closest_centroid(vector, centroids, distance_fn) -> index of closest centroid

  cluster 
cluster (
        data,
        k,
        distance_fn=euclidean_dist,
        init_centroids_fn=random_centroids,
        max_iterations=1000,
        update_fn=None,
        )

cluster(data, k[, distance_fn][, max_iterations][, update_fn]) -> (centroids, clusters) or None

Organize data into k clusters. Return a list of cluster assignments between 0-(k-1), where the items in the list corresponds to the list of data points. If the algorithm does not converge by max_iterations (default is 1000), returns None. data is a list of data points, which are vectors of numbers. distance_fn is a callback function that calculates the distance between two vectors. By default, the Euclidean distance wwill be used. If update_fn is specified, it is called at the beginning of every iteration and passed the iteration number, cluster centroids, and current cluster assignments.

Exceptions   
ValueError

  first_k_points_as_centroids 
first_k_points_as_centroids ( data,  k )

first_k_points_as_centroids(data, k) -> list of centroids

Picks the first K points as the initial centroids. This isn't a good method (unless the data is randomized), but does provide determinism that's useful for debugging.

  random_centroids 
random_centroids ( data,  k )

random_centroids(data, k) -> list of centroids

Return a list of data points to serve as the initial centroids. This is k randomly chosen data points.


Table of Contents

This document was automatically generated on Tue Jul 31 12:07:03 2001 by HappyDoc version r1_3