Golf the K-means algorithm

10

1

K-means is a standard unsupervised clustering algorithm, which, given a set of "points" and a number of clusters K, will assign each "point" to one of K clusters.

Pseudo-Code of K-means

Note that there are many variants of K-means. You have to implement the algorithm I am describing below. You may have some variation on the algorithm or use built-ins as long as you would get the same result as this algorithm given the same initial points.

In this challenge, all inputs will be points on the 2D plane (each point is represented by its coordinates in x and y).

Inputs: K, the number of clusters
        P, the set of points

Choose K points of P uniformly at random
Each chosen point is the initial centroid of its cluster

Loop:
     For each point in P:
         Assign to the cluster whose centroid is the nearest (Euclidean distance)
         In case of a tie, any of the tied cluster can be chosen

     Recompute the centroid of each cluster:
         Its x coordinate is the average of all x's of the points in the cluster
         Its y coordinate is the average of all y's of the points in the cluster

Until the clusters don't change from one iteration to the next

Output: the set of clusters    

Inputs and Outputs

  • You may take K and P through STDIN, or as a function argument, etc.
  • P and the points in P can be represented using any structure that is natural for set/lists in your language of choice.
  • K is a strictly positive integer.
  • You may assume that inputs are valid.
  • There will always be at least K points in P.
  • You may output the clusters to STDOUT, return them from a function, etc.
  • The ordering of the clusters and the ordering inside the clusters is unimportant. -You can either return groups of points to represent clusters, or each point labeled with an identifier for the cluster (e.g. an integer).

Test cases

Since the resulting clusters depend on which points were initially chosen, you may not all get the same results (or the same result each time you run your code).

Therefore, only take the output as an example output.

Input:
  K = 1
  P = [[1,2.5]]
Output:
  [[[1,2.5]]]

Input:
  K = 3
  P = [[4,8], [15,16], [23,42], [-13.37,-12.1], [666,-666]]
Output:
  [[[666,-666]],[[-13.37,-12.1],[4,8]],[[15,16],[23,42]]]

Input:
  K = 2
  P = [[1,1], [1,1], [1,1]]
Output:
  [[[1,1]],[[1,1],[1,1]]]

Scoring

This is , so the shortest answer in bytes wins.

Fatalize

Posted 2016-05-04T06:54:36.350

Reputation: 32 976

Are built-ins allowed when the results are indistinguishable from your algorithm? – Martin Ender – 2016-05-04T07:03:35.060

@MartinBüttner if you can justify that given the same initial points, it converges to the same result, yes. – Fatalize – 2016-05-04T07:07:44.773

Woud it also be acceptable to output labels of membership of a cluster for each point? (E.g. all points of the first cluster have label 1, all points of the second one have label 2 e.t.c) – flawr – 2016-05-04T11:03:39.597

@flawr Yes, this is acceptable. – Fatalize – 2016-05-04T11:07:57.760

Degenerate test case: K=2, P = [[1,1], [1,1], [1,1]]. – Peter Taylor – 2016-05-04T11:54:24.733

Answers

4

Matlab, 25 bytes

@(x,k)kmeans(x,k,'S','u')

Given a n x 2 matrix (one row per point e.g. [[4,8]; [15,16]; [23,42]; [-13.37,-12.1]; [666,-666]]) this functions returns a list of labels for each input point.

flawr

Posted 2016-05-04T06:54:36.350

Reputation: 40 560

5

C++, 479 474 bytes

Only ~20x as much as Matlab!

Golfed

#define V vector<P>
#define f float
struct P{f x,y,i=0;f d(P&p){return(p.x-x)*(p.x-x)+(p.y-y)*(p.y-y);}f n(P&p){return i?x/=i,y/=i,d(p):x=p.x,y=p.y,0;}f a(P&p){x+=p.x,y+=p.y,i++;}};P z;int l(P a,P b){return a.d(z)<b.d(z);}f m(f k,V&p){f s=p.size(),i,j=0,t=1;V c(k),n=c,d;for(random_shuffle(p.begin(),p.end());j<k;c[j].i=j++)c[j]=p[j];for(;t;c=n,n=V(k)){for(i=0;i<s;i++)d=c,z=p[i],sort(d.begin(),d.end(),l),j=d[0].i,p[i].i=j,n[j].a(p[i]);for(j=t=0;j<k;j++)t+=n[j].n(c[j]);}}

The input/output to the algorithm is a set of points (struct P) with x and y; and the output is the same set, with their i's set to indicate the index of the output cluster that the point ends in.

That extra i is also used to identify the clusters. In the main loop, the closest centroid to each point is found by sorting a copy of the current centroids by proximity to that point.

This handles degenerate cases (empty clusters) by keeping the corresponding centroids' previous position (see definition of P::n, which also returns distance-to-previous-centroid). A few chars could be saved by assuming that these will not crop up.

Ungolfed, with main

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <algorithm>
using namespace std;

#define V vector<P>
#define f float
struct P{
    f x,y,i=0;
    f d(P&p){return(p.x-x)*(p.x-x)+(p.y-y)*(p.y-y);} // distance squared
    f n(P&p){return i?x/=i,y/=i,d(p):x=p.x,y=p.y,0;} // normalize-or-reset
    f a(P&p){x+=p.x,y+=p.y,i++;}                     // add coordinates
};
P z;int l(P a,P b){return a.d(z)<b.d(z);}            // closer-to-z comparator 
f m(f k,V&p){
    f s=p.size(),i,j=0,t=1;V c(k),n=c,d;
    for(random_shuffle(p.begin(),p.end());j<k;c[j].i=j++)
        c[j]=p[j];                                // initial random assignment
    for(;t;c=n,n=V(k)){                           
        for(i=0;i<s;i++)                          // assign to clusters
            d=c,z=p[i],sort(d.begin(),d.end(),l),
            j=d[0].i,p[i].i=j,n[j].a(p[i]);       // and add those coords
        for(j=t=0;j<k;j++)t+=n[j].n(c[j]);        // normalize & count changes
    }        
}

int main(int argc, char **argv) {
    srand((unsigned long)time(0));

    int k;
    V p;
    sscanf(argv[1], "%d", &k);
    printf("Input:\n");
    for (int i=2,j=0; i<argc; i+=2, j++) {
        P n;
        sscanf(argv[i], "%f", &(n.x));
        sscanf(argv[i+1], "%f", &(n.y));
        p.push_back(n);
        printf("%d : %f,%f\n", j, p[j].x, p[j].y);
    }

    m(k,p);
    printf("Clusters:\n");
    for (int q=0; q<k; q++) {
        printf("%d\n", q);
        for (unsigned int i=0; i<p.size(); i++) {
            if (p[i].i == q) printf("\t%f,%f (%d)\n", p[i].x, p[i].y, i);
        }
    }
    return 0;
}

tucuxi

Posted 2016-05-04T06:54:36.350

Reputation: 583

I know I might be late in this comment, but could you define a macro #define R p){return and change the second argument of l to p so you could use it three times total? – Zacharý – 2017-06-06T19:25:40.420

4

J, 60 54 bytes

p=:[:(i.<./)"1([:+/&.:*:-)"1/
]p](p(+/%#)/.[)^:_(?#){]

Defines a helper verb p which takes a list of points and centroids and classifies each point by the index of the nearest centroid. Then, it uses that to repeat the process of choosing new centroid by taking the averages of the points in each cluster until it converges, and then to partition the points for output.

Usage

The value of k is given as an integer on the LHS. The list of points is given as a 2d array on the RHS. Here it is specified as a list of points which is reshaped into a 2d array of 5 x 2. The output will be the label for which cluster each point belongs in the same order as the input.

If you desire to use a fixed seed for reproducible results, replace the ? with a ?. at (?#).

   p =: [:(i.<./)"1([:+/&.:*:-)"1/
   f =: ]p](p(+/%#)/.[)^:_(?#){]
   3 f (5 2 $ 4 8 15 16 23 42 _13.37 _12.1 666 _666)
0 1 1 0 2

Explanation

[:(i.<./)"1([:+/&.:*:-)"1/  Input: points on LHS, centroids on RHS
           (          )"1/  Form a table between each point and centroid and for each
                     -        Find the difference elementwise
            [:     *:         Square each
              +/&.:           Reduce using addition
                              Apply the inverse of square (square root) to that sum
[:(     )"1                 For each row of that table
     <./                      Reduce using min
   i.                         Find the index of the minimum in that row
                            Returns a list of indices for each point that shows
                            which centroid it belongs to

]p](p(+/%#)/.[)^:_(?#){]  Input: k on LHS, points on RHS
                    #     Count the number of points
                   ?      Choose k values in the range [0, len(points))
                          without repetition
                       ]  Identity function, get points
                      {   Select the points at the indices above
  ]                       Identity function, get points
   (         )^:_         Repeat until convergence
    p                       Get the labels for each point
             [              Identity function, get points
           /.               Partition the points using the labels and for each
      +/                      Take the sums of points elementwise
         #                    Get the number of points
        %                     Divide sum elementwise by the count
                            Return the new values as the next centroids
]                         Identity function, get points
 p                        Get the labels for each point and return it

miles

Posted 2016-05-04T06:54:36.350

Reputation: 15 654

I would give +1, but I'm scared that breaking your 3k will curse me. – NoOneIsHere – 2016-07-17T21:30:07.927

3

CJam (60 bytes)

{:Pmr<1/2P,#{:z{_:+\,/}f%:C,{P{C\f{.-Yf#:+}_:e<#1$=},\;}%}*}

This is a function which takes its input in the form k p on the stack. It assumes that the points are represented with doubles, not ints. It doesn't implicitly assume anything about the dimension of the points, so it would cluster equally well in 6-D Euclidean space as in the specified 2-D.

Online demo

Peter Taylor

Posted 2016-05-04T06:54:36.350

Reputation: 41 901

2

Mathematica 14 12 bytes

Since built-ins are allowed, this should do it.

FindClusters

Example

FindClusters[{{4, 8}, {15, 16}, {23, 42}, {-13.37, -12.1}, {666, -666}}, 3]

{{{4, 8}, {-13.37, -12.1}}, {{15, 16}, {23, 42}}, {{666, -666}}}

DavidC

Posted 2016-05-04T06:54:36.350

Reputation: 24 524

You don't need the brackets. f = FindClusters, f[something]. – NoOneIsHere – 2016-07-17T21:30:51.817

ok, thanks I wasn't sure. – DavidC – 2016-07-17T22:45:58.190

1

Jelly, 24 bytes

_ÆḊ¥þ³i"Ṃ€$
ẊḣµÇÆmƙ³µÐLÇ

Try it online!

Uses features that were implemented after this challenge was posted. Supposedly, this is no longer non-competing.

Explanation

_ÆḊ¥þ³i"Ṃ€$  Helper link. Input: array of points
             (Classify) Given a size-k array of points, classifies
             each point in A to the closet point in the size-k array
    þ        Outer product with
     ³       All points, P
   ¥         Dyadic chain
_              Subtract
 ÆḊ            Norm
          $  Monadic chain
      i"     Find first index, vectorized
        Ṃ€   Minimum each

ẊḣµÇÆmƙ³µÐLÇ  Main link. Input: array of points P, integer k
  µ           Start new monadic chain
Ẋ               Shuffle P
 ḣ              Take the first k
        µ     Start new monadic chain
   Ç            Call helper (Classify)
      ƙ         Group with those values the items of
       ³        All points, P
    Æm            Take the mean of each group
         ÐL   Repeat that until the results converge
           Ç  Call helper (Classify)

miles

Posted 2016-05-04T06:54:36.350

Reputation: 15 654

1

R, 273 bytes

function(K,P,C=P[sample(nrow(P),K),]){while(T){D=C
U=sapply(1:nrow(P),function(i)w(dist(rbind(P[i,],C))[1:K]))
C=t(sapply(1:K,function(i)colMeans(P[U==i,,drop=F])))
T=isTRUE(all.equal(C,D))}
cbind(U,P)}
w=function(x,y=seq_along(x)[x==min(x)])"if"(length(y)>1,sample(y,1),y)

Try it online!

Takes Pas a matrix, with x and ycoordinates in the first and second column respectively. Returns P with a first column added which indicates cluster index (integer).

I had to redefine w by copying the source from nnet::which.is.max to conform to the requirement that cluster is chosen randomly in case of ties. Otherwise I would use which.minfrom base for a total of 210 bytes. There is still room for golfing but I did not want to obfuscate it to much to give others a chance to spot possible issues within my code.

JayCe

Posted 2016-05-04T06:54:36.350

Reputation: 2 655

0

Julia 213 bytes

function f(p,k)
A=0
P=size(p,1)
c=p[randperm(P)[1:k],:]
while(true)
d=[norm(c[i]-p[j]) for i in 1:k, j in 1:P]
a=mapslices(indmin,d,1)
a==A&&return a
A=a
c=[mean(p[vec(a.==i),:],1) for i in 1:k]
end
end

Returns an array of the same length as p, with integers indicating which cluster the corresponding element of p belongs to.

I think there is still a fair bit of scope to optimize the character count down.

(Of course I could just use the Clustering.jl package to do it trivially)

Lyndon White

Posted 2016-05-04T06:54:36.350

Reputation: 1 021