In business applications, the cases to cluster 10000K variables are rare, where you deal with hundreds of, or maybe several thousand, demographic/transactional variables.
But think of microarray analysis, which may involve tens of thousands of arrays, or think of face recognition. In most cases, each face image is a rectangular matrix of several hundred of rows and columns or more, and their product is easily of the magnitude of 100K+. For examle, N people, each person has a face image under X different expressions and you want to cluster these N*X images for different purposes, say cluster by person or by expression. The matrix representation of the tensor is very large. Of course, you can work with a series of thin SVD reduced pseduo images. But in any case you will have to work on a very large matrix of many columns.
On the other hand, when you use kernel method, which involves inner product of your original matrix, then you will get a square matrix of dimension of tens of thousands when your original observations is at that magnitude.
As for interpretation, that is case by case, depending on the applications.
Well, that is my $0.02. If there are any mistakes, please don't hesitate to point them out.作者: shiyiming 时间: 2010-2-20 12:59 标题: Re: 假如我们面对1000万列的数据该怎么聚类? 看错了,不好意思。oloolo 所说只能load3万列的说法是不对的。
data like 50000K by 10000K matrix is extra large database, and need special methods to handle.
While storage nowadays is not a big issue, in such case, even using confusion matrix won't help the computation because the resulting matrix will still be 1000K-by-1000K, and the computation hurdle will be prohibitive in most affordable machines (<$500K cost).
One work around would be using randomized matrix approximation. For example, select a portion of columns with given probability, and conduct matrix factorization on the confusion matrix, which will be very small if the #of columns selected is within manageble magnitude. Obtain linear transformation matrix from the matrix factorization and project full data into this new space. In some cases, you can also keep only portion of the rows with probability of your selection. There also are algorithms that do element-wise selection. Each selection schema is different from the rest and possesses its unqiue properties and application contexts.作者: shiyiming 时间: 2010-3-2 14:18 标题: Re: 假如我们面对1000万列的数据该怎么聚类? 嗯,在做聚类之前,基于你要分组的目的 其实可以先做一些变量筛选的步骤(如varclus方法)。把变量减少一下之后做出来的matrix应该就不会非常大了作者: shiyiming 时间: 2010-3-2 15:48 标题: Re: 假如我们面对1000万列的数据该怎么聚类? 应该先查查VARCLUS用的算法,对于500万-by-100万列的矩阵,基本上是practically infeasible
VARCLUS 基于principle factor analysis (PFA)。PFA 又基于SVD,时间复杂度大约是O(v^3),v是变量个数。VARCLUS所需时间又稍大于主要素分析,这么算来对100万列的矩阵来做变量聚类是基本不可行的。你不妨先在你自己的PC上用SAS对一个1万-by-1万的矩阵做一个PCA,看看需要多长时间。