BAT面试官最喜欢问的问题之一:算法Kmeans优化算法有?

人工智能
KMeans算法的主要缺点有:
1)需要人工预先确定初始K值,且该值和真实的数据不一定能够吻合。
2)K均值只能收敛到局部最优,效果受到初始值的影响很大。
3)容易受到噪声的影响。
4)样本只能被划分到单一的类簇中。
Kmeans算法改进模型主要有Kmeans++和ISODATA算法
Kmeans++的主要是对K的选取进行优化,
假设已经选取了n个初始聚类中心,则在选取第n+1个聚类中心时,距离当前n个聚类中心越远的点会有更高的概率会被选为第n+1个聚类中心。在选取第一个聚类中心时同样通过随机的方法。当选择完初始点后,Kmeans++后续的计算都和经典的Kmeans算法相同,这也是对初始值选择进行改进的方法的共同点。
当K值的大小不确定时,可以使用ISODATA算法。ISODATA算法的全称是迭代自组织数据分析法。在Kmeans算法中,聚类个数K的值需要预先人为的确定,并在整个过程中无法更改。而当遇到高纬度、海量的数据集时,人们往往很难准确的估计出K的大小,ISODATA算法针对这个问题进行了改进,它的思想也很直观,当属于某个类别的样本数过少时,就把该类别踢掉;当属于某个类别的样本数过多、分散程度较大时,就把该类分成两个子类。ISODATA算法在Kmeans算法的基础上增加了两个操作,一个是分裂操作,对应着增加聚类中心数,二是合并操作,对应着减少聚类中心数。ISODATA虽然对Kmeans算法进行了优化,但它也有缺点,就是需要确定以下这些参数:
a 预期的聚类中心数K,在ISODATA运行过程中聚类中心可以变化,K是一个用户制定的参考值,该算法在聚类中心数目变动范围也由其决定。一般情况下,最终输出的聚类中心数据常见范围是从K的一半到两倍K。
b 每个类所要求的最少样本数目N,如果分裂后悔导致某个子类别所包含样本数目小于阈值,就不会对该类别进行分裂操作。
c 最大方差S,用于控制某个类别中样本的分散程度,当样本的分散程度超过这个阈值,分裂后满足a,进行分裂操作。
d 两个聚类中心之间所允许最小距离D,如果两个类靠的非常近,小于该阈值时,则对两个类进行合并操作。
{"weixin":{"label":"微信","name":"weixin","selected":true,"value":true,"sortid":"1","shareid":"weixin","sharetitle":"分享到微信","event":"shareToWeiXin","lang":"shareWeb_WeiXin"},"copy":{"label":"复制网址","name":"copy","selected":true,"value":true,"sortid":"2","shareid":"copy","sharetitle":"复制网址","event":"copy_url","lang":"shareWeb_Copy"},"qq":{"label":"QQ好友","name":"qq","selected":true,"value":false,"sortid":"1","shareid":"qq","sharetitle":"分享到QQ","event":"shareToQQ","lang":"shareWeb_QQ"},"sina_weibo":{"label":"新浪微博","name":"sina_weibo","selected":true,"value":true,"sortid":"4","shareid":"sina_weibo","sharetitle":"分享到新浪微博","event":"shareToSinaWB","lang":"shareWeb_SinaWeiBo"},"qq_zone":{"label":"QQ空间","name":"qq_zone","selected":true,"value":true,"sortid":"5","shareid":"qq_zone","sharetitle":"分享到QQ空间","event":"shareToQzone","lang":"shareWeb_QQZone"},"renren":{"label":"人人网","name":"renren","selected":true,"value":true,"sortid":"7","shareid":"renren","sharetitle":"分享到人人网","event":"shareToRenren","lang":"shareWeb_RenRen"},"douban":{"label":"豆瓣网","name":"douban","selected":true,"value":true,"sortid":"8","shareid":"douban","sharetitle":"分享到豆瓣网","event":"shareToDouban","lang":"shareWeb_DouBan"},"baidu_tieba":{"label":"百度贴吧","name":"baidu_tieba","selected":true,"value":true,"sortid":"10","shareid":"baidu_tieba","sharetitle":"分享到百度贴吧","event":"shareToTieba","lang":"shareWeb_TieBa"},"Facebook":{"label":"Facebook","name":"Facebook","selected":true,"value":true,"sortid":"11","shareid":"Facebook","sharetitle":"分享到FaceBook","event":"shareToFacebook","lang":"shareWeb_Facebook"},"Twitter":{"label":"Twitter","name":"Twitter","selected":true,"value":true,"sortid":"12","shareid":"Twitter","sharetitle":"分享到Twitter","event":"shareToTwitter","lang":"shareWeb_Twitter"},"LinkedIn":{"label":"LinkedIn","name":"LinkedIn","selected":true,"value":true,"sortid":"13","shareid":"LinkedIn","sharetitle":"分享到linkedIn","event":"shareToLinkedin","lang":"shareWeb_Linkedin"},"whatsapp":{"label":"whatsapp","name":"whatsapp","selected":true,"value":true,"sortid":"15","shareid":"whatsapp","sharetitle":"分享到whatsapp","event":"shareToWhatsapp","lang":"shareWeb_whatsapp"},"line":{"label":"line","name":"line","selected":true,"value":true,"sortid":"15","shareid":"line","sharetitle":"分享到line","event":"shareToLine","lang":"shareWeb_line"},"qq_weibo":{"label":"腾讯微博","name":"qq_weibo","selected":true,"value":true,"sortid":"3","shareid":"qq_weibo","sharetitle":"分享到腾讯微博","event":"shareToQQwb","lang":"shareWeb_QQWeiBo"},"peopleBlog":{"label":"人民微博","name":"propleBlog","selected":true,"value":true,"sortid":"14","shareid":"propleBlog","sharetitle":"分享到人民微博","event":"shareToPeopleBlog","lang":"shareWeb_peopleBlog"}}