Clustering Introduction:-
Clustering is one of the most popular techniques available in Machine learning field. This allows the system to group numurous entities into separate clusters/groups based on certain characteristics/features of the entities. Clustering is a widely used technique in many grouping problems like grouping similar news articles, blogs, emails, malwares etc based on their feature set. Clustering webpages will help indexing the webdata more effectively. This technique will enable the system to automatically find new categories (clusters) with out human interpretation. There are many clustering techniques to cluster the entities like K-means, X-means, Hierarchical clustering etc.
Problem:-
Consider a big clothes showroom and assume that they have not properly organized various dress materials like kids wear, fancy, formal, party, ethnic, sports etc separately and dumped all kinds of wears altogether or scattered in the whole showroom. Then any customer entering the showroom might throughly check the whole showroom in-order to pick his/her required selections, then the customer might end up rounding and rounding the whole showroom to select his dress material. What-if these dress materials are categorized into various sections and each section is devoted for a particuar type like kids wear, formal, party, ethnic, sports etc then the customer can directly enter into his/her section of interest and can complete his/her shopping efficiently.
Solution:-
Clustering techniques will help in identifying the similarities between the entities and cluster them into various categories. These different clusters will enable the system to mine the common patterns/rules from these clusters.
Clustering in Mahout:-
Mahout has support for various clustering techniques implemented in a distributed passion. Distributed/parallel implementations will directly relate to drastic improvement in the performance of the system as well as overcoming the limitation of limiting the input data size based on hardware in stand alone implementations.
Major clustering techniques available in Mahout 0.5 are,
- K-means - Exclusive clustering
- Fuzzy K-means - Overlapping clustering
- Dirichlet - Probabilistic clustering etc
K-means in Mahout:-
Suppose to say, we have to cluster the list of documents based on the content of each document. Then the following steps are involved,
1) Vectorizing:- The documents should be converted into a format which the algorithm can understand. Each document will be converted into a vector and all these vectors should be passed as input to the clustering algorithm (K-means). Each vector is a collection of similar values and each vector will have a dimension of its own.
The useful feature-set in clustering the text documents are TF-IDF (Term Frequency and Inverse Document Frequency)
- Term Frequency:- The number of occurrences of a term in document
- Inverse Document Frequency:- The uniqueness of a term in document. i.e., the unique terms associated with a given document
a) Calculating weights with TF-IDF:- Assigning weights to terms based on TF-IDF features will give more accuracy to the clustering algorithm rather than using the plain TF-IDF feature-set. Before we dig into details let us see what IDF means,
For example, say if a document has ‘n’ words w1, w2, w3…wn with frequencies f1, f2, f3…fn respectively,
Then, TFi of wi = fi
Then, IDFi = 1/DFi where DFi is the number of documents the word presents in (say word wi).
To normalize the above equation we will multiply the equation with ‘N’ (which is number of documents)
So, IDFi = N/DFi, therefore the weight Wi is TFi * IDFi = TFi * N/DFi
In order to give give more priority to TF we will apply logarithm to the above equation, Thus the final formula for calculating weight for each term is,
Wi = TFi * log ( N / DFi )
Applying the above equation the weights for each term will be calculated and the vectors for each document will be created, finally the input data for the algorithm will look something like this,
ant apple bat boll story …..
Doc1 2.85 0.0 3.6 5.87 1.897 …..
Doc2 0.0 1.23 0.0 6.87 4.5343 ….
……
From the above we can conclude that the words are represented as columns and documents as rows, so each vector signifies each row in the matrix. 0.0 represents that that particular word is not present in that document.
2) Selecting distance measure:- Distance measure plays a vital role in identifying the similarity between two different vectors. And there are many built-in distance measures available in Mahout clustering section. they are,
- Euclidean Distance Measure
- Squared Euclidean Distance Measure
- Cosine Distance Measure
- Tanimoto Distance Measure
- Manhattan Distance Measure
Each one of the above has its own way of calculating distances between the vectors usually Tanimoto distance measure will be the better choice for calculating the distances between the text documents. Also Tanimoto distance measure is a combination of both Euclidean and Cosine distance measures. But it is suggested that each application has various requirements to tuning the application by changing the distance measures and evaluating the clusters generated is preferable approach.
3) Tuning features:- Also tuning some features offered by Mahout will lead the system to generate more accurate results like, N-gram collocations, Normalization, selecting appropriate “K�? value etc.
4) After above all running the K-means over vectorized data and certain other features as mentioned above using the Mahout command-line commands or Mahout API we can generate the clustered results.
5) The generated results can be viewed using the Mahout command line or API.
Clustering can also be applied with recommendation algorithms where cluster-based recommendation will give us best results. Also clustering can be applied in combination with classification which will derive more efficient classification systems.
Very nice introduction to clustering algorithms.
One question — How we can leverage Mahout algorithms for real-time data. As far as I understood, clustering is a 3 way process:
1. Vector generation
2. Creating inputs.
3. Running cluster algorithm
Is there some kind of best practices we can follow to make clustering effective/fast ?
@Praveenesh Kumar
1) Mahout is a pack of many different algorithms. We should throughly understand the data and the problem before we choose any learning algorithm. Mahout has many built-in features which are useful in converting raw data to the required format. Also we can develop our own infrastructure to do that.
2) Best practices are not common for all kinds of data/problem, so it is always a good approach to tune the different parameters and evaluate the generated results and finally pick the best one among them. For general problems we can depend on some well defined approaches.
@mbalija
Thanks for the explanation. However, I understand the fact, tuning and choosing the right algorithm is a must and also a tricky thing to do.
My question is more focused on the iterative nature of the algorithms.
Given the fact, that mahout jobs when run on hadoop, will run as map-reduce jobs, and the more the iterations will be, the more time it will take(no matter how much tuning we do – there is a cost of M/R job execution that will always be there)
So I was just wondering, how people are tackling this situation. How can we use mahout clustering algorithms or any iterative algorithms in real time situations ?
Like mahout’s K-means clustering/LDA generally takes 2-5 minutes depending on the iterations( even on good hardware), so my question was more biased towards using iterative based algorithms in real time.
I understand the fact, that hadoop/mahout may not be build for the supporting real time scenarios, but as we are looking for making enterprise based applications on hadoop, I am just wondering are there some cool ways/work arounds to do the above things in effective manner or its still the long way to go.