假如我们面对1000万列的数据该怎么聚类？

shiyiming · 发表于 2010-2-8 23:12:35

SAS 9.1以上支持超过32K的变量个数，但是依然受pagesize和操作系统的限制，不是想多大就多大，你列的这个级别在PC上没法做，至少是32位windows下不行。1000万行X500万列的数据我觉得用分布式集群做，自己写代码，而不是用SAS来做。google自己做pagerank和其他数据挖掘都是数亿列，数百亿行的矩阵，都是自己写的专有代码。

为啥我说的是纸上谈兵？你的依据是啥？

另外你需要用vector之间夹角来算，就自己来算这个夹角呗。只是>500万个features这么高维的vector，夹角怕是分不出来区别的。

shiyiming · 发表于 2010-2-9 10:06:22

oloolo,多谢！所有的回复只有你最后的回复符合我的本意。的确如此，这么高维的数据我也猜测SAS作不出来了。你提到的Google的Pagerank，和我的估计一模一样，我不知道你是否愿意share关于Google的Pagerank？愿意留下你的QQ吗？再次感谢！

shiyiming · 发表于 2010-2-11 01:22:53

我不在google工作，具体的了解不多。

shiyiming · 发表于 2010-2-11 18:45:46

偶来打个酱油：实际业务中需要直接对1000万个变量做聚类吗？即使技术上支持这么做，其结果又如何解释、应用？

shiyiming · 发表于 2010-2-17 02:18:56

to em_2002

In business applications, the cases to cluster 10000K variables are rare, where you deal with hundreds of, or maybe several thousand, demographic/transactional variables.

But think of microarray analysis, which may involve tens of thousands of arrays, or think of face recognition. In most cases, each face image is a rectangular matrix of several hundred of rows and columns or more, and their product is easily of the magnitude of 100K+. For examle, N people, each person has a face image under X different expressions and you want to cluster these N*X images for different purposes, say cluster by person or by expression. The matrix representation of the tensor is very large. Of course, you can work with a series of thin SVD reduced pseduo images. But in any case you will have to work on a very large matrix of many columns.

On the other hand, when you use kernel method, which involves inner product of your original matrix, then you will get a square matrix of dimension of tens of thousands when your original observations is at that magnitude.

As for interpretation, that is case by case, depending on the applications.

Well, that is my $0.02. If there are any mistakes, please don't hesitate to point them out.

shiyiming · 发表于 2010-2-20 12:59:28

看错了，不好意思。oloolo 所说只能load3万列的说法是不对的。

最后你要对变量（500万列）做聚类呢还是对观测（1000万行）。对变量的话oloolo用转置是个好办法啊。

拿聚类观测做例子，有1000万个观测，要做聚类。
夹角余弦你可以根据夹角余玄的公式把数据算成两两向量直接余玄值的矩阵（1000w*1000w），这种数据是可以用来做聚类的。只不过这么大数据不知道做起来会要多久

shiyiming · 发表于 2010-2-27 02:42:11

to lqq316

data like 50000K by 10000K matrix is extra large database, and need special methods to handle.
While storage nowadays is not a big issue, in such case, even using confusion matrix won't help the computation because the resulting matrix will still be 1000K-by-1000K, and the computation hurdle will be prohibitive in most affordable machines (<$500K cost).

One work around would be using randomized matrix approximation. For example, select a portion of columns with given probability, and conduct matrix factorization on the confusion matrix, which will be very small if the #of columns selected is within manageble magnitude. Obtain linear transformation matrix from the matrix factorization and project full data into this new space. In some cases, you can also keep only portion of the rows with probability of your selection. There also are algorithms that do element-wise selection. Each selection schema is different from the rest and possesses its unqiue properties and application contexts.

shiyiming · 发表于 2010-3-2 14:18:19

嗯，在做聚类之前，基于你要分组的目的其实可以先做一些变量筛选的步骤（如varclus方法）。把变量减少一下之后做出来的matrix应该就不会非常大了

shiyiming · 发表于 2010-3-2 15:48:50

应该先查查VARCLUS用的算法，对于500万-by-100万列的矩阵，基本上是practically infeasible
VARCLUS 基于principle factor analysis （PFA）。PFA 又基于SVD，时间复杂度大约是O(v^3)，v是变量个数。VARCLUS所需时间又稍大于主要素分析，这么算来对100万列的矩阵来做变量聚类是基本不可行的。你不妨先在你自己的PC上用SAS对一个1万-by-1万的矩阵做一个PCA，看看需要多长时间。

另外，很多分析是不能用变量聚类抽取部分“重要”变量来分析的。比如时间序列性质的基因测试样本，很长一段时间的信用卡交易数据，数百万居民一年时间内每15分钟的电力消耗及预测【这个是智能电网的重要组成部分】，顾客在百货公司里面行走的坐标，停留时间和对应的消费量【女顾客这方面的数据量尤其大，呵呵】。这些分析重点就在整个矩阵的结构上，而不是某一些“显著”变量，所以要尽量保留多的信息。RSVD虽然只是随机sample一部分变量来做SVD，但是最后完整的的矩阵还是要投影到用小样本算出来的正交空间里面，这方面有点类似于一个依概率的插值算法，原有信息还是尽可能地保留了。

oblique的要素分析在一些经济分析里面还是很有用的，特别是可解释性很强，不过相比现代数据挖掘方法个人感觉它的用处还是不是很大。

我靠，我真闲啊，半夜2点还在发贴，版主发点奖励吧。

shiyiming · 发表于 2010-3-3 09:13:23

我已经有一个解决方案，多谢楼上各位的回复给了我很多灵感。

		自动登录	找回密码
密码			立即注册

假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对100万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 偶来打个酱油

Re: 偶来打个酱油

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？

Re: 假如我们面对1000万列的数据该怎么聚类？