NLP数据挖掘基础知识
Basis(基礎(chǔ)):
- SSE(Sum of Squared Error, 平方誤差和)
- SAE(Sum of Absolute Error, 絕對(duì)誤差和)
- SRE(Sum of Relative Error, 相對(duì)誤差和)
- MSE(Mean Squared Error, 均方誤差)
- RMSE(Root Mean Squared Error, 均方根誤差)
- RRSE(Root Relative Squared Error, 相對(duì)平方根誤差)
- MAE(Mean Absolute Error, 平均絕對(duì)誤差)
- RAE(Root Absolute Error, 平均絕對(duì)誤差平方根)
- MRSE(Mean Relative Square Error, 相對(duì)平均誤差)
- RRSE(Root Relative Squared Error, 相對(duì)平方根誤差)
- Expectation(期望)&Variance(方差)
- Standard Deviation(標(biāo)準(zhǔn)差,也稱Root Mean Squared Error, 均方根誤差)
- CP(Conditional Probability, 條件概率)
- JP(Joint Probability, 聯(lián)合概率)
- MP(Marginal Probability, 邊緣概率)
- Bayesian Formula(貝葉斯公式)
- CC(Correlation Coefficient, 相關(guān)系數(shù))
- Quantile (分位數(shù))
- Covariance(協(xié)方差矩陣)
- GD(Gradient Descent, 梯度下降)
- SGD(Stochastic Gradient Descent, 隨機(jī)梯度下降)
- LMS(Least Mean Squared, 最小均方)
- LSM(Least Square Methods, 最小二乘法)
- NE(Normal Equation, 正規(guī)方程)
- MLE(Maximum Likelihood Estimation, 極大似然估計(jì))
- QP(Quadratic Programming, 二次規(guī)劃)
- L1 /L2 Regularization(L1/L2正則, 以及更多的, 現(xiàn)在比較火的L2.5正則等)
- Eigenvalue(特征值)
- Eigenvector(特征向量)
Common Distribution(常見分布):
Discrete Distribution(離散型分布):
- Bernoulli Distribution/Binomial Distribution(貝努利分布/二項(xiàng)分布)
- Negative Binomial Distribution(負(fù)二項(xiàng)分布)
- Multinomial Distribution(多項(xiàng)分布)
- Geometric Distribution(幾何分布)
- Hypergeometric Distribution(超幾何分布)
- Poisson Distribution (泊松分布)
Continuous Distribution (連續(xù)型分布):
- Uniform Distribution(均勻分布)
- Normal Distribution/Gaussian Distribution(正態(tài)分布/高斯分布)
- Exponential Distribution(指數(shù)分布)
- Lognormal Distribution(對(duì)數(shù)正態(tài)分布)
- Gamma Distribution(Gamma分布)
- Beta Distribution(Beta分布)
- Dirichlet Distribution(狄利克雷分布)
- Rayleigh Distribution(瑞利分布)
- Cauchy Distribution(柯西分布)
- Weibull Distribution (韋伯分布)
Three Sampling Distribution(三大抽樣分布):
- Chi-square Distribution(卡方分布)
- t-distribution(t-分布)
- F-distribution(F-分布)
Data Pre-processing(數(shù)據(jù)預(yù)處理):
- Missing Value Imputation(缺失值填充)
- Discretization(離散化)
- Mapping(映射)
- Normalization(歸一化/標(biāo)準(zhǔn)化)
Sampling(采樣):
- Simple Random Sampling(簡單隨機(jī)采樣)
- Offline Sampling(離線等可能K采樣)
- Online Sampling(在線等可能K采樣)
- Ratio-based Sampling(等比例隨機(jī)采樣)
- Acceptance-rejection Sampling(接受-拒絕采樣)
- Importance Sampling(重要性采樣)
- MCMC(Markov Chain MonteCarlo 馬爾科夫蒙特卡羅采樣算法:Metropolis-Hasting& Gibbs)
Clustering(聚類):
- K-MeansK-Mediods
- 二分K-Means
- FK-Means
- Canopy
- Spectral-KMeans(譜聚類)
- GMM-EM(混合高斯模型-期望最大化算法解決)
- K-Pototypes
- CLARANS(基于劃分)
- BIRCH(基于層次)
- CURE(基于層次)
- STING(基于網(wǎng)格)
- CLIQUE(基于密度和基于網(wǎng)格)
- 2014年Science上的密度聚類算法等
Clustering Effectiveness Evaluation(聚類效果評(píng)估):
- Purity(純度)
- RI(Rand Index, 芮氏指標(biāo))
- ARI(Adjusted Rand Index, 調(diào)整的芮氏指標(biāo))
- NMI(Normalized Mutual Information, 規(guī)范化互信息)
- F-meaure(F測量)
Classification&Regression(分類&回歸):
- LR(Linear Regression, 線性回歸)
- LR(Logistic Regression, 邏輯回歸)
- SR(Softmax Regression, 多分類邏輯回歸)
- GLM(Generalized Linear Model, 廣義線性模型)
- RR(Ridge Regression, 嶺回歸/L2正則最小二乘回歸),LASSO(Least Absolute Shrinkage and Selectionator Operator , L1正則最小二乘回歸)
- DT(Decision Tree決策樹)
- RF(Random Forest, 隨機(jī)森林)
- GBDT(Gradient Boosting Decision Tree, 梯度下降決策樹)
- CART(Classification And Regression Tree 分類回歸樹)
- KNN(K-Nearest Neighbor, K近鄰)
- SVM(Support Vector Machine, 支持向量機(jī), 包括SVC(分類)&SVR(回歸))
- CBA(Classification based on Association Rule, 基于關(guān)聯(lián)規(guī)則的分類)
- KF(Kernel Function, 核函數(shù))?
- Polynomial Kernel Function(多項(xiàng)式核函數(shù))
- Guassian Kernel Function(高斯核函數(shù))
- Radial Basis Function(RBF徑向基函數(shù))
- String Kernel Function 字符串核函數(shù)
- NB(Naive Bayesian,樸素貝葉斯)
- BN(Bayesian Network/Bayesian Belief Network/Belief Network 貝葉斯網(wǎng)絡(luò)/貝葉斯信度網(wǎng)絡(luò)/信念網(wǎng)絡(luò))
- LDA(Linear Discriminant Analysis/Fisher Linear Discriminant 線性判別分析/Fisher線性判別)
- EL(Ensemble Learning, 集成學(xué)習(xí))?
- Boosting
- Bagging
- Stacking
- AdaBoost(Adaptive Boosting 自適應(yīng)增強(qiáng))
- MEM(Maximum Entropy Model, 最大熵模型)
Classification EffectivenessEvaluation(分類效果評(píng)估):
- Confusion Matrix(混淆矩陣)
- Precision(精確度)
- Recall(召回率)
- Accuracy(準(zhǔn)確率)
- F-score(F得分)
- ROC Curve(ROC曲線)
- AUC(AUC面積)
- Lift Curve(Lift曲線)
- KS Curve(KS曲線)
PGM(Probabilistic Graphical Models, 概率圖模型):
- BN(BayesianNetwork/Bayesian Belief Network/ Belief Network , 貝葉斯網(wǎng)絡(luò)/貝葉斯信度網(wǎng)絡(luò)/信念網(wǎng)絡(luò))
- MC(Markov Chain, 馬爾科夫鏈)
- MEM(Maximum Entropy Model, 最大熵模型)
- HMM(Hidden Markov Model, 馬爾科夫模型)
- MEMM(Maximum Entropy Markov Model, 最大熵馬爾科夫模型)
- CRF(Conditional Random Field,條件隨機(jī)場)
- MRF(Markov Random Field, 馬爾科夫隨機(jī)場)
- Viterbi(維特比算法)
NN(Neural Network, 神經(jīng)網(wǎng)絡(luò))
- ANN(Artificial Neural Network, 人工神經(jīng)網(wǎng)絡(luò))
- SNN(Static Neural Network, 靜態(tài)神經(jīng)網(wǎng)絡(luò))
- BP(Error Back Propagation, 誤差反向傳播)
- HN(Hopfield Network)
- DNN(Dynamic Neural Network, 動(dòng)態(tài)神經(jīng)網(wǎng)絡(luò))
- RNN(Recurrent Neural Network, 循環(huán)神經(jīng)網(wǎng)絡(luò))
- SRN(Simple Recurrent Network, 簡單的循環(huán)神經(jīng)網(wǎng)絡(luò))
- ESN(Echo State Network, 回聲狀態(tài)網(wǎng)絡(luò))
- LSTM(Long Short Term Memory, 長短記憶神經(jīng)網(wǎng)絡(luò))
- CW-RNN(Clockwork-Recurrent Neural Network, 時(shí)鐘驅(qū)動(dòng)循環(huán)神經(jīng)網(wǎng)絡(luò), 2014ICML)等.
Deep Learning(深度學(xué)習(xí)):
- Auto-encoder(自動(dòng)編碼器)
- SAE(Stacked Auto-encoders堆疊自動(dòng)編碼器)?
- Sparse Auto-encoders(稀疏自動(dòng)編碼器)
- Denoising Auto-encoders(去噪自動(dòng)編碼器)
- Contractive Auto-encoders(收縮自動(dòng)編碼器)
- RBM(Restricted Boltzmann Machine, 受限玻爾茲曼機(jī))
- DBN(Deep Belief Network, 深度信念網(wǎng)絡(luò))
- CNN(Convolutional Neural Network, 卷積神經(jīng)網(wǎng)絡(luò))
- Word2Vec(詞向量學(xué)習(xí)模型)
Dimensionality Reduction(降維):
- LDA(Linear Discriminant Analysis/Fisher Linear Discriminant, 線性判別分析/Fish線性判別)
- PCA(Principal Component Analysis, 主成分分析)
- ICA(Independent Component Analysis, 獨(dú)立成分分析)
- SVD(Singular Value Decomposition 奇異值分解)
- FA(Factor Analysis 因子分析法)
Text Mining(文本挖掘):
- VSM(Vector Space Model, 向量空間模型)
- Word2Vec(詞向量學(xué)習(xí)模型)
- TF(Term Frequency, 詞頻)
- TF-IDF(TermFrequency-Inverse Document Frequency, 詞頻-逆向文檔頻率)
- MI(Mutual Information, 互信息)
- ECE(Expected Cross Entropy, 期望交叉熵)
- QEMI(二次信息熵)
- IG(Information Gain, 信息增益)
- IGR(Information Gain Ratio, 信息增益率)
- Gini(基尼系數(shù))
- x2 Statistic(x2統(tǒng)計(jì)量)
- TEW(Text Evidence Weight, 文本證據(jù)權(quán))
- OR(Odds Ratio, 優(yōu)勢率)
- N-Gram Model
- LSA(Latent Semantic Analysis, 潛在語義分析)
- PLSA(Probabilistic Latent Semantic Analysis, 基于概率的潛在語義分析)
- LDA(Latent Dirichlet Allocation, 潛在狄利克雷模型)
- SLM(Statistical Language Model, 統(tǒng)計(jì)語言模型)
- NPLM(Neural Probabilistic Language Model, 神經(jīng)概率語言模型)
- CBOW(Continuous Bag of Words Model, 連續(xù)詞袋模型)
- Skip-gram(Skip-gram Model)
Association Mining(關(guān)聯(lián)挖掘):
- Apriori算法
- FP-growth(Frequency Pattern Tree Growth, 頻繁模式樹生長算法)
- MSApriori(Multi Support-based Apriori, 基于多支持度的Apriori算法)
- GSpan(Graph-based Substructure Pattern Mining, 頻繁子圖挖掘)
Sequential Patterns Analysis(序列模式分析)
- AprioriAll
- Spade
- GSP(Generalized Sequential Patterns, 廣義序列模式)
- PrefixSpan
Forecast(預(yù)測)
- LR(Linear Regression, 線性回歸)
- SVR(Support Vector Regression, 支持向量機(jī)回歸)
- ARIMA(Autoregressive Integrated Moving Average Model, 自回歸積分滑動(dòng)平均模型)
- GM(Gray Model, 灰色模型)
- BPNN(BP Neural Network, 反向傳播神經(jīng)網(wǎng)絡(luò))
- SRN(Simple Recurrent Network, 簡單循環(huán)神經(jīng)網(wǎng)絡(luò))
- LSTM(Long Short Term Memory, 長短記憶神經(jīng)網(wǎng)絡(luò))
- CW-RNN(Clockwork Recurrent Neural Network, 時(shí)鐘驅(qū)動(dòng)循環(huán)神經(jīng)網(wǎng)絡(luò))
- ……
Linked Analysis(鏈接分析)
- HITS(Hyperlink-Induced Topic Search, 基于超鏈接的主題檢索算法)
- PageRank(網(wǎng)頁排名)
Recommendation Engine(推薦引擎):
- SVD
- Slope One
- DBR(Demographic-based Recommendation, 基于人口統(tǒng)計(jì)學(xué)的推薦)
- CBR(Context-based Recommendation, 基于內(nèi)容的推薦)
- CF(Collaborative Filtering, 協(xié)同過濾)
- UCF(User-based Collaborative Filtering Recommendation, 基于用戶的協(xié)同過濾推薦)
- ICF(Item-based Collaborative Filtering Recommendation, 基于項(xiàng)目的協(xié)同過濾推薦)
Similarity Measure&Distance Measure(相似性與距離度量):
- EuclideanDistance(歐式距離)
- Chebyshev Distance(切比雪夫距離)
- Minkowski Distance(閔可夫斯基距離)
- Standardized EuclideanDistance(標(biāo)準(zhǔn)化歐氏距離)
- Mahalanobis Distance(馬氏距離)
- Cos(Cosine, 余弦)
- Hamming Distance/Edit Distance(漢明距離/編輯距離)
- Jaccard Distance(杰卡德距離)
- Correlation Coefficient Distance(相關(guān)系數(shù)距離)
- Information Entropy(信息熵)
- KL(Kullback-Leibler Divergence, KL散度/Relative Entropy, 相對(duì)熵)
Optimization(最優(yōu)化):
Non-constrained Optimization(無約束優(yōu)化):
- Cyclic Variable Methods(變量輪換法)
- Variable Simplex Methods(可變單純形法)
- Newton Methods(牛頓法)
- Quasi-Newton Methods(擬牛頓法)
- Conjugate Gradient Methods(共軛梯度法)。
Constrained Optimization(有約束優(yōu)化):
- Approximation Programming Methods(近似規(guī)劃法)
- Penalty Function Methods(罰函數(shù)法)
- Multiplier Methods(乘子法)。
- Heuristic Algorithm(啟發(fā)式算法)
- SA(Simulated Annealing, 模擬退火算法)
- GA(Genetic Algorithm, 遺傳算法)
- ACO(Ant Colony Optimization, 蟻群算法)
Feature Selection(特征選擇):
- Mutual Information(互信息)
- Document Frequence(文檔頻率)
- Information Gain(信息增益)
- Chi-squared Test(卡方檢驗(yàn))
- Gini(基尼系數(shù))
Outlier Detection(異常點(diǎn)檢測):
- Statistic-based(基于統(tǒng)計(jì))
- Density-based(基于密度)
- Clustering-based(基于聚類)。
Learning to Rank(基于學(xué)習(xí)的排序):
- Pointwise?
- McRank
- Pairwise?
- RankingSVM
- RankNet
- Frank
- RankBoost;
- Listwise?
- AdaRank
- SoftRank
- LamdaMART
Tool(工具):
- MPI
- Hadoop生態(tài)圈
- Spark
- IGraph
- BSP
- Weka
- Mahout
- Scikit-learn
- PyBrain
- Theano?
轉(zhuǎn)載于:https://www.cnblogs.com/baiboy/p/dm1.html
總結(jié)
以上是生活随笔為你收集整理的NLP数据挖掘基础知识的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 图书推荐
- 下一篇: dyld: Library not lo