ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测
ML之NB:基于news新聞文本數(shù)據(jù)集利用純統(tǒng)計法、kNN、樸素貝葉斯(高斯/多元伯努利/多項式)、線性判別分析LDA、感知器等算法實現(xiàn)文本分類預(yù)測
?
?
?
?
目錄
基于news新聞文本數(shù)據(jù)集利用純統(tǒng)計法、kNN、樸素貝葉斯(高斯/多元伯努利/多項式)、線性判別分析LDA、感知器等算法實現(xiàn)文本分類預(yù)測
設(shè)計思路
輸出結(jié)果
核心代碼
?
相關(guān)文章
ML之NB:基于news新聞文本數(shù)據(jù)集利用純統(tǒng)計法、kNN、樸素貝葉斯(高斯/多元伯努利/多項式)、線性判別分析LDA、感知器等算法實現(xiàn)文本分類預(yù)測
ML之NB:基于news新聞文本數(shù)據(jù)集利用純統(tǒng)計法、kNN、樸素貝葉斯(高斯/多元伯努利/多項式)、線性判別分析LDA、感知器等算法實現(xiàn)文本分類預(yù)測實現(xiàn)
基于news新聞文本數(shù)據(jù)集利用純統(tǒng)計法、kNN、樸素貝葉斯(高斯/多元伯努利/多項式)、線性判別分析LDA、感知器等算法實現(xiàn)文本分類預(yù)測
設(shè)計思路
?
輸出結(jié)果
代碼中的數(shù)據(jù)集:https://download.csdn.net/download/qq_41185868/13757777
F:\Program Files\Python\Python36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serialwarnings.warn("detected Windows; aliasing chunkize to chunkize_serial") <class 'pandas.core.frame.DataFrame'> RangeIndex: 1293 entries, 0 to 1292 Data columns (total 6 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1293 non-null int64 1 content 1292 non-null object2 id 1293 non-null int64 3 tags 1293 non-null object4 time 1293 non-null object5 title 1293 non-null object dtypes: int64(2), object(4) memory usage: 60.7+ KB NoneUnnamed: 0 content \ 0 0 牽動人心的雄安新區(qū)規(guī)劃細節(jié)內(nèi)容和出臺時間表敲定。日前,北京商報記者從業(yè)內(nèi)獲悉,京津冀協(xié)同發(fā)... 1 1 去年以來,多個城市先后發(fā)布了多項樓市調(diào)控政策。在限購、限貸甚至限售的政策“組合拳”下,房地產(chǎn)... 2 2 在今年中國國際自行車展上,上海鳳凰自行車總裁王朝陽表示,共享單車的到來把我們打懵了,影響更是... 3 3 25家上市銀行迎來了一年一度的“分紅季”,21世紀經(jīng)濟報道記者根據(jù)公開信息梳理發(fā)現(xiàn),25家銀... 4 4 說起卷餅,大家其實并不陌生,這個來自中原的傳統(tǒng)美食,發(fā)展至今也衍生出各種各樣的種類,卷邊的制... id tags \ 0 6428905748545732865 ['財經(jīng)', '白洋淀', '城市規(guī)劃', '徐匡迪', '太行山'] 1 6428954136200855810 ['財經(jīng)', '碧桂園', '萬科集團', '投資', '廣州恒大'] 2 6420576443738784002 ['財經(jīng)', '自行車', '鳳凰', '王朝陽', '汽車展覽'] 3 6429007290541031681 ['財經(jīng)', '銀行', '工商銀行', '興業(yè)銀行', '交通銀行'] 4 6397481672254619905 ['財經(jīng)', '小吃', '裝修', '市場營銷', '手工藝'] time title 0 2017-06-07 22:52:55 雄安新區(qū)規(guī)劃“骨架”敲定,方案有望9月底出爐 1 2017-06-08 08:01:13 “紅五月”不紅 房企資金鏈壓力攀升 2 2017-05-16 12:03:00 鳳凰自行車總裁:共享單車把我們打懵了 3 2017-06-08 07:00:00 25家銀行分紅季派出3536億“大紅包” 4 2017-03-15 07:03:22 五萬以下的小本餐飲項目,卷餅賺錢最穩(wěn) chinese_pattern re.compile('[\\u4e00-\\u9fff]+') Building prefix dict from F:\File_Jupyter\實用代碼\naive_bayes(簡單貝葉斯)\jieba_dict\dict.txt.big ... Loading model from cache C:\Users\niu\AppData\Local\Temp\jieba.ue3752d4e13420d2dc6b66831a5a4ab13.cache Loading model cost 1.326 seconds. Prefix dict has been built succesfully. dictionary <class 'gensim.corpora.dictionary.Dictionary'> Dictionary(46351 unique tokens: ['一個', '一個個', '一舉一動', '一些', '一體']...) <class 'method'> <bound method Dictionary.doc2bow of <gensim.corpora.dictionary.Dictionary object at 0x000001BDC62291D0>> F:\Program Files\Python\Python36\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarrayreturn array(a, dtype, copy=False, order=order)Unnamed: 0 content \ 0 0 牽動人心的雄安新區(qū)規(guī)劃細節(jié)內(nèi)容和出臺時間表敲定。日前,北京商報記者從業(yè)內(nèi)獲悉,京津冀協(xié)同發(fā)... 1 1 去年以來,多個城市先后發(fā)布了多項樓市調(diào)控政策。在限購、限貸甚至限售的政策“組合拳”下,房地產(chǎn)... 2 2 在今年中國國際自行車展上,上海鳳凰自行車總裁王朝陽表示,共享單車的到來把我們打懵了,影響更是... id tags \ 0 6428905748545732865 ['財經(jīng)', '白洋淀', '城市規(guī)劃', '徐匡迪', '太行山'] 1 6428954136200855810 ['財經(jīng)', '碧桂園', '萬科集團', '投資', '廣州恒大'] 2 6420576443738784002 ['財經(jīng)', '自行車', '鳳凰', '王朝陽', '汽車展覽'] time title \ 0 2017-06-07 22:52:55 雄安新區(qū)規(guī)劃“骨架”敲定,方案有望9月底出爐 1 2017-06-08 08:01:13 “紅五月”不紅 房企資金鏈壓力攀升 2 2017-05-16 12:03:00 鳳凰自行車總裁:共享單車把我們打懵了 doc_words \ 0 [牽動人心, 雄安, 新區(qū), 規(guī)劃, 細節(jié), 內(nèi)容, 出臺, 時間表, 敲定, 日前, 北京... 1 [去年, 以來, 多個, 城市, 先后, 發(fā)布, 多項, 樓市, 調(diào)控, 政策, 限購, 限... 2 [今年, 中國, 國際, 自行車, 展上, 上海, 鳳凰, 自行車, 總裁, 王, 朝陽, ... corpus \ 0 [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2... 1 [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5... 2 [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1... tfidf 0 [(0, 0.005554342859788116), (1, 0.007470250835... 1 [(0, 0.002081356679198299), (3, 0.012288034179... 2 [(15, 0.057457146244872616), (53, 0.0543395377... after abs 4.7683716e-07 foo: (1293, 1293) dis2TSNE_Visual: (1293, 2) {'養(yǎng)生': 0, '科技': 1, '財經(jīng)': 2, '游戲': 3, '育兒': 4, '汽車': 5} data_frame.keyword_index: 1 379 2 287 5 283 4 148 3 141 0 55 Name: keyword_index, dtype: int64Unnamed: 0 content \ 0 0 牽動人心的雄安新區(qū)規(guī)劃細節(jié)內(nèi)容和出臺時間表敲定。日前,北京商報記者從業(yè)內(nèi)獲悉,京津冀協(xié)同發(fā)... 1 1 去年以來,多個城市先后發(fā)布了多項樓市調(diào)控政策。在限購、限貸甚至限售的政策“組合拳”下,房地產(chǎn)... 2 2 在今年中國國際自行車展上,上海鳳凰自行車總裁王朝陽表示,共享單車的到來把我們打懵了,影響更是... id tags \ 0 6428905748545732865 ['財經(jīng)', '白洋淀', '城市規(guī)劃', '徐匡迪', '太行山'] 1 6428954136200855810 ['財經(jīng)', '碧桂園', '萬科集團', '投資', '廣州恒大'] 2 6420576443738784002 ['財經(jīng)', '自行車', '鳳凰', '王朝陽', '汽車展覽'] time title \ 0 2017-06-07 22:52:55 雄安新區(qū)規(guī)劃“骨架”敲定,方案有望9月底出爐 1 2017-06-08 08:01:13 “紅五月”不紅 房企資金鏈壓力攀升 2 2017-05-16 12:03:00 鳳凰自行車總裁:共享單車把我們打懵了 doc_words \ 0 [牽動人心, 雄安, 新區(qū), 規(guī)劃, 細節(jié), 內(nèi)容, 出臺, 時間表, 敲定, 日前, 北京... 1 [去年, 以來, 多個, 城市, 先后, 發(fā)布, 多項, 樓市, 調(diào)控, 政策, 限購, 限... 2 [今年, 中國, 國際, 自行車, 展上, 上海, 鳳凰, 自行車, 總裁, 王, 朝陽, ... corpus \ 0 [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2... 1 [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5... 2 [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1... tfidf visual01 visual02 \ 0 [(0, 0.005554342859788116), (1, 0.007470250835... -65.903542 -14.433964 1 [(0, 0.002081356679198299), (3, 0.012288034179... -29.659267 -14.811647 2 [(15, 0.057457146244872616), (53, 0.0543395377... -22.118195 -48.148167 keyword_index 0 2 1 2 2 2 Childcare,label_category_ID_pos.tfidf)[:20]: ['孩子', '家長', '教育', '學習', '男孩子', '成績', '爸爸', '分享', '幫助', '方法', '小學', '數(shù)學', '交流', '男孩', '媽媽', '成長', '父母', '懂', '免費', '翼航'] Childcare,label_category_ID_neg.tfidf)[:20]: [] train_index MatrixSimilarity<646 docs, 46329 features> hot_words shape: 6 300 {0: {1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 18009, 7258, 4697, 7260, 16989, 3674, 91, 87, 16993, 18020, 616, 4714, 5228, 40044, 1646, 4720, 3185, 15986, 34928, 5236, 113, 34936, 6777, 126, 15999, 127, 4737, 40067, 5252, 643, 4739, 13444, 8840, 1157, 133, 4749, 3219, 10388, 17562, 5278, 46239, 5287, 3751, 167, 680, 6827, 4784, 16048, 16050, 180, 46260, 16054, 6839, 4792, 2743, 4789, 17083, 16060, 4790, 16062, 43200, 5315, 46276, 46279, 17098, 6860, 5836, 16081, 43219, 1237, 1750, 15575, 8921, 2266, 6877, 12511, 12512, 21216, 226, 4834, 6884, 16101, 4838, 742, 2280, 2281, 227, 7915, 6886, 6893, 2798, 6894, 5870, 4849, 242, 1779, 4852, 21215, 44791, 4864, 3329, 258, 4865, 4866, 44805, 4877, 21264, 4882, 274, 8986, 8987, 796, 32029, 4382, 21277, 4896, 1825, 801, 3363, 36644, 1830, 4393, 36138, 303, 815, 4401, 12594, 21299, 7986, 820, 310, 1337, 21307, 4411, 317, 33598, 5953, 17730, 5954, 10050, 17733, 17734, 25927, 21320, 17739, 4939, 21324, 4942, 33615, 6885, 16210, 6071, 18261, 5976, 860, 16740, 16745, 2922, 4969, 17263, 6512, 33649, 16242, 2419, 17775, 373, 1398, 880, 1916, 17276, 16255, 1920, 43394, 3974, 4999, 396, 8080, 16788, 18325, 1942, 16279, 1433, 43418, 36252, 17311, 43425, 16802, 7585, 15959, 7594, 36268, 4525, 7597, 5551, 6063, 36272, 36275, 4533, 16309, 18358, 36280, 1465, 441, 7611, 16825, 16829, 4538, 2488, 2495, 8129, 4545, 4547, 16836, 4549, 7621, 1484, 1997, 11214, 1999, 16846, 16847, 4563, 7636, 14293, 7638, 4567, 16855, 17369, 16861, 478, 16351, 18400, 17377, 993, 9699, 5085, 6111, 7645, 6119, 6124, 17903, 1011, 4597, 6646, 16376, 6138, 16891, 16892, 7165, 4606}, 1: {0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 12391, 28267, 12396, 109, 9836, 12399, 11884, 12401, 12400, 12403, 627, 117, 629, 9847, 628, 17020, 637, 9855, 639, 12418, 643, 1668, 133, 3715, 14470, 1160, 12424, 11912, 9867, 33420, 10376, 655, 12433, 148, 150, 3735, 1176, 12440, 154, 21659, 1180, 3742, 10399, 11936, 1185, 31904, 675, 13472, 167, 1704, 7337, 11946, 171, 172, 8876, 8878, 2734, 1200, 1709, 2226, 8877, 180, 1155, 697, 12475, 189, 8894, 1215, 1218, 4291, 708, 709, 3271, 2760, 6354, 2771, 1748, 213, 3798, 727, 730, 20187, 44767, 225, 2786, 2787, 13028, 1765, 1254, 13543, 26344, 740, 11497, 1771, 3819, 13549, 11502, 751, 1775, 752, 242, 21743, 12524, 759, 11511, 2809, 2812, 35581, 257, 8962, 771, 259, 15623, 1288, 3849, 12048, 1810, 786, 788, 3862, 793, 7450, 798, 24862, 7458, 12579, 31524, 31523, 7459, 1322, 810, 25391, 12081, 1329, 820, 3386, 1850, 9023, 319, 835, 9029, 325, 4424, 330, 12107, 13134, 846, 3409, 3924, 1878, 854, 344, 11609, 5978, 1883, 11612, 343, 11615, 358, 4457, 362, 875, 1385, 1900, 4462, 3439, 12144, 369, 3438, 1396, 38773, 28025, 2428, 13305, 13183, 12161, 12674, 1922, 34690, 2438, 1926, 13193, 907, 9100, 911, 13204, 1431, 10135, 2456, 44956, 925, 413, 32670, 1952, 928, 23455, 5540, 1956, 1447, 12200, 1448, 1452, 8109, 12205, 1965, 9651, 2486, 5559, 1464, 956, 1982, 959, 3522, 12235, 976, 3025, 10194, 1491, 12244, 465, 30675, 5585, 472, 470, 10714, 475, 3027, 478, 1503, 479, 5089, 483, 2532, 995, 9190, 5607, 1512, 1513, 9703, 10728, 494, 1518, 1520, 2545, 1007, 1524, 501, 503, 1017, 1534}, 2: {0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 3146, 1100, 26701, 1614, 1102, 592, 3577, 35410, 2639, 2644, 3159, 25688, 1626, 91, 3162, 1119, 608, 21089, 1634, 102, 2662, 31848, 2665, 11881, 27242, 12907, 1131, 1132, 15388, 2672, 3185, 1138, 627, 43124, 2675, 113, 1657, 2682, 3194, 127, 3715, 1668, 133, 3717, 135, 2696, 3209, 1162, 1158, 1676, 2701, 11916, 1167, 138, 1169, 148, 2710, 1174, 152, 1177, 22167, 26779, 21659, 157, 158, 1183, 30880, 1185, 26784, 2209, 2724, 3232, 672, 167, 4256, 8876, 685, 4269, 1202, 2226, 691, 1205, 3253, 1207, 2231, 2242, 4291, 14026, 27340, 1740, 1231, 14032, 24273, 3284, 1749, 213, 727, 217, 730, 2266, 14044, 1246, 1248, 225, 1254, 742, 745, 3819, 14060, 12013, 750, 1775, 242, 1780, 1268, 759, 760, 249, 33536, 1281, 261, 262, 2311, 1290, 267, 37132, 5902, 1810, 7958, 39191, 280, 793, 43813, 1318, 807, 295, 45354, 1324, 28461, 1838, 28462, 815, 1329, 820, 1333, 317, 2366, 39743, 832, 2365, 45378, 835, 330, 1356, 845, 334, 1359, 4433, 4438, 854, 14168, 1370, 1883, 1372, 1371, 860, 863, 3935, 3937, 1378, 11618, 3426, 870, 358, 3942, 361, 874, 362, 875, 28010, 3438, 2416, 369, 880, 14196, 886, 4472, 1403, 894, 895, 2432, 385, 904, 905, 27528, 907, 909, 911, 1431, 409, 1433, 925, 1950, 415, 928, 413, 13731, 3494, 20902, 937, 1452, 942, 1968, 1973, 1464, 1977, 956, 34240, 3009, 32706, 14278, 3015, 456, 1993, 973, 975, 976, 465, 466, 1491, 14290, 2512, 1494, 472, 475, 480, 3554, 995, 2532, 3048, 1513, 23529, 3564, 494, 498, 500, 501, 503, 1017, 3070}, 3: {1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 6731, 9293, 31823, 2133, 9303, 601, 91, 43615, 608, 9314, 10338, 25709, 1646, 10349, 6257, 7794, 27763, 11381, 9337, 7801, 637, 3709, 639, 11391, 9345, 7299, 3715, 1668, 41606, 11401, 11402, 4233, 9868, 10893, 142, 5259, 9872, 25744, 25741, 148, 10389, 34455, 3735, 8345, 8857, 154, 10396, 1178, 7839, 10399, 8554, 1704, 10409, 9900, 10412, 2734, 14512, 10416, 7858, 9394, 9904, 6325, 2232, 1721, 38589, 8894, 6336, 1220, 9925, 11461, 3271, 9420, 719, 14544, 2773, 3286, 3287, 214, 20187, 9438, 26335, 6048, 13534, 226, 3811, 19172, 1766, 2280, 36585, 14575, 2801, 9457, 10993, 10485, 23797, 759, 27896, 5882, 8443, 23803, 1790, 767, 8962, 9476, 7433, 6924, 2316, 2318, 3853, 14608, 4371, 9494, 8983, 6425, 793, 362, 6433, 7458, 2339, 810, 1835, 8493, 6447, 1329, 28466, 44855, 9527, 1338, 10044, 317, 3390, 10047, 41280, 31554, 2372, 9029, 11592, 9547, 3916, 9042, 10066, 3925, 343, 10072, 5978, 860, 8030, 10079, 10593, 9572, 2916, 9061, 3430, 6501, 4969, 10089, 30571, 10603, 11117, 9582, 10607, 6505, 14193, 28529, 14707, 7197, 369, 11639, 23929, 894, 1919, 3459, 11652, 2438, 10631, 907, 10642, 9109, 2454, 14743, 2456, 29594, 11164, 6559, 9631, 3999, 1951, 14754, 14756, 31653, 9638, 31654, 33704, 45984, 3500, 31661, 1453, 1455, 9645, 9649, 41394, 9651, 9652, 10165, 30718, 2999, 31672, 1982, 9662, 44483, 11205, 2505, 5581, 10704, 465, 977, 31699, 9172, 4053, 9174, 31703, 4567, 470, 10714, 475, 5076, 478, 480, 23008, 9186, 30692, 9190, 9703, 10216, 491, 30699, 1005, 2542, 31726, 1007, 494, 25586, 10222, 18417, 10736, 8178, 3064, 1529, 509, 1534}, 4: {0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 601, 7258, 91, 5722, 5214, 4703, 608, 3679, 2143, 101, 6758, 5224, 616, 7277, 2158, 4723, 5236, 6267, 1660, 637, 639, 4737, 4739, 5252, 133, 1668, 4606, 23688, 5768, 17035, 2188, 5772, 38034, 5779, 3220, 6805, 2199, 1688, 5273, 154, 155, 1694, 4767, 5280, 5278, 5284, 1191, 1704, 167, 3754, 5802, 5290, 3751, 3247, 5296, 3257, 5818, 5823, 3265, 708, 5318, 5830, 4294, 1738, 5841, 5330, 4825, 4316, 734, 6369, 5349, 4838, 4326, 2280, 4329, 46315, 6380, 29660, 44269, 5871, 5873, 242, 7927, 759, 760, 2812, 1277, 8448, 3329, 4866, 2304, 4869, 5382, 7430, 3848, 3339, 2318, 782, 3857, 5906, 26513, 788, 2841, 7450, 4382, 1825, 7458, 801, 37156, 4393, 810, 7979, 3886, 815, 4911, 4401, 7986, 1329, 820, 5942, 3896, 8506, 2874, 317, 5441, 835, 5445, 5958, 6578, 5964, 5965, 4942, 8016, 8024, 344, 4952, 860, 1884, 29533, 8545, 8037, 3430, 6504, 7017, 2922, 4457, 362, 5998, 2928, 373, 374, 2935, 1398, 8057, 6011, 6015, 32127, 384, 4994, 8579, 4996, 8072, 396, 6541, 5006, 6540, 5009, 1938, 1427, 7571, 2965, 1942, 6039, 1940, 7574, 2970, 409, 7068, 7575, 8606, 5014, 5018, 7585, 5017, 6561, 7588, 1447, 3497, 6058, 5547, 1965, 6065, 4529, 21939, 4531, 6069, 5043, 5559, 7096, 1465, 6074, 3515, 4533, 6077, 5054, 7103, 448, 6080, 6076, 4547, 8132, 4552, 4555, 1484, 39372, 39374, 4561, 6611, 5078, 470, 1496, 5081, 472, 7131, 4572, 7133, 5598, 5086, 4576, 4577, 6111, 478, 4580, 1508, 480, 1503, 5096, 1506, 4584, 23019, 493, 494, 498, 5108, 18935, 1529, 6138, 7163, 10238, 5119}, 5: {0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91, 14940, 9308, 14937, 14943, 608, 6755, 1124, 13924, 14950, 5219, 14947, 9325, 3697, 14961, 11893, 14968, 12408, 15485, 637, 5247, 1668, 1157, 23172, 647, 15492, 15498, 5773, 19087, 13969, 9362, 15506, 1681, 148, 11926, 1176, 2713, 155, 1180, 15517, 1692, 20124, 10401, 19105, 675, 674, 19109, 167, 1704, 11946, 15019, 12458, 1709, 682, 9091, 2224, 15025, 20656, 176, 180, 7858, 12982, 15031, 15543, 41136, 14013, 2239, 1729, 708, 9413, 21700, 712, 15562, 15051, 2765, 15057, 15061, 9942, 15063, 21718, 22747, 15068, 15069, 32475, 13535, 15583, 15074, 227, 19683, 2789, 1766, 13542, 13036, 2799, 752, 3312, 13552, 242, 26867, 1268, 15618, 759, 2809, 763, 28924, 2812, 10495, 2817, 2818, 14083, 769, 259, 15622, 2823, 1288, 8962, 15109, 19720, 15629, 19213, 3345, 786, 788, 280, 25375, 2337, 15650, 804, 15653, 3366, 807, 2349, 15151, 7984, 1329, 21810, 820, 12602, 1338, 317, 11582, 5953, 2370, 835, 323, 15688, 1864, 15693, 854, 13142, 344, 15705, 4955, 860, 23899, 11615, 863, 15199, 15711, 13155, 15205, 872, 4457, 15722, 362, 15724, 875, 3438, 15215, 369, 883, 19828, 24437, 374, 29179, 9593, 19834, 15227, 894, 19326, 13186, 35203, 2436, 15749, 389, 19847, 15750, 19849, 2438, 1922, 6028, 909, 15752, 2446, 13200, 2448, 409, 21923, 9644, 14766, 22959, 14771, 23989, 12728, 9145, 14778, 14779, 3000, 12733, 7102, 3007, 9665, 14786, 12226, 2498, 14789, 8645, 15301, 15305, 15818, 461, 976, 5585, 977, 1489, 15358, 472, 1496, 42457, 2524, 478, 19422, 480, 15330, 15843, 20452, 26084, 6631, 14827, 492, 15343, 3571, 14836, 15348, 19446, 14839, 11765, 1017, 14843, 14844, 14846}} word_bagNum shape: 6 50 {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960], 1: [0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613], 2: [0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651], 3: [1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284], 4: [0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740], 5: [0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]} after all_words, word_bag shape: 6 300 {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 1: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 2: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 3: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 4: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 5: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]} features_data_frame.shape: (6, 255) 0 30 1 185 2 139 3 66 4 69 5 157 class_Proportion: [0.04643962848297214, 0.28637770897832815, 0.21517027863777088, 0.1021671826625387, 0.10681114551083591, 0.24303405572755418] test_data_frame.head(2) Unnamed: 0 content \ 854 854 據(jù)Mobileexpose報道,華碩已經(jīng)正式向媒體發(fā)出邀請,定于6月14日在臺灣舉辦記者會,... 101 101 6月6日,王者榮耀猴三棍重做引起王者峽谷一陣軒然大波,畢竟這個強勢的猴子已經(jīng)陪伴我們好幾個... id tags \ 854 6429089676803440897 ['科技', '華碩', '華碩ZenFone', '臺灣', '手機'] 101 6429098400347586818 ['游戲', '猴子', '王者榮耀', '黃忠', '游戲'] time title \ 854 2017-06-07 10:11:00 華碩ZenFone AR宣布本月發(fā)售 101 2017-06-07 10:39:20 猴子重做之后是加強還是削弱?狂到站對面泉水拿雙殺 doc_words \ 854 [報道, 華碩, 已經(jīng), 正式, 媒體, 發(fā)出, 邀請, 定于, 月, 日, 臺灣, 舉辦,... 101 [月, 日, 王者, 榮耀, 猴三棍, 重, 做, 引起, 王者, 峽谷, 一陣, 軒然大波... corpus \ 854 [(142, 1), (362, 1), (472, 1), (475, 1), (494,... 101 [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)... tfidf visual01 visual02 \ 854 [(142, 0.13953435619531032), (362, 0.046441336... 21.684397 -30.567736 101 [(0, 0.012838015508020575), (68, 0.04742284222... 67.188065 21.183245 keyword_index 854 1 101 3 print the first sample Unnamed: 0 854 content 據(jù)Mobileexpose報道,華碩已經(jīng)正式向媒體發(fā)出邀請,定于6月14日在臺灣舉辦記者會,... id 6429089676803440897 tags ['科技', '華碩', '華碩ZenFone', '臺灣', '手機'] time 2017-06-07 10:11:00 title 華碩ZenFone AR宣布本月發(fā)售 doc_words [報道, 華碩, 已經(jīng), 正式, 媒體, 發(fā)出, 邀請, 定于, 月, 日, 臺灣, 舉辦,... corpus [(142, 1), (362, 1), (472, 1), (475, 1), (494,... tfidf [(142, 0.13953435619531032), (362, 0.046441336... visual01 21.6844 visual02 -30.5677 keyword_index 1 Name: 854, dtype: object test_data_frame.iloc[0].corpus: [(142, 1), (362, 1), (472, 1), (475, 1), (494, 1), (530, 1), (872, 1), (909, 1), (1254, 1), (1312, 1), (1878, 1), (2577, 1), (2783, 1), (2979, 1), (3697, 1), (5508, 1), (9052, 1), (12204, 1), (12256, 1), (12591, 1), (12936, 1), (12991, 1), (13128, 1), (13194, 1), (13244, 1), (13317, 1), (31670, 1), (31683, 1), (33417, 1)] [1.45708072e-43 1.78656934e-66 7.12148875e-63 1.71090490e-534.71385662e-54 2.08405934e-64] [-35.34436300647761, -16.431856044032266, -20.267559000416433, -22.405433968586664, -27.97121661401147, -18.05089965903481] F:\File_Jupyter\實用代碼\naive_bayes(簡單貝葉斯)\TextClassPrediction_kNN_NB_LDA_P.py:346: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copytest_data_frame['predicted_class'] = test_data_frame['corpus'].apply(predict_text_ByMax) #預(yù)測所有測試文檔 predict all test documents test_data_frame: Unnamed: 0 content \ 854 854 據(jù)Mobileexpose報道,華碩已經(jīng)正式向媒體發(fā)出邀請,定于6月14日在臺灣舉辦記者會,... 101 101 6月6日,王者榮耀猴三棍重做引起王者峽谷一陣軒然大波,畢竟這個強勢的猴子已經(jīng)陪伴我們好幾個... 738 738 騙子往往都很會講故事,比如以下這些硅谷騙局:驗血公司Theranos,號稱只要從指尖抽幾滴血... 511 511 專訪 Whyd 創(chuàng)始人 孟崨在學校,他是最調(diào)皮,卻又成績最好的學生,讓老師頭疼不已。在公司,... 725 725 據(jù)介紹,喜馬拉雅FM會員月費為18元,年度會員188元,價格與視頻網(wǎng)站會員價格相仿。在會員福... ... ... ... 805 805 每經(jīng)記者 王海慜 每經(jīng)編輯 葉峰今日盤中,昨日領(lǐng)漲的中小創(chuàng)出現(xiàn)休整,而昨日暫時休整的一批龍頭... 448 448 中國人買什么都喜歡大的,房子要買面積大的、手機要買屏大的,買車自然也是要挑選空間大的。拋開拉... 782 782 中證網(wǎng)訊 (記者 徐金忠)6月7日,國能電動汽車瑞典有限公司(NEVS)亮相CES亞洲消費電... 1264 1264 目前日系豪華品牌謳歌已經(jīng)開啟了國產(chǎn)之路,在推出CDX車型后,謳歌在國內(nèi)的知名度一度飆升。CD... 1195 1195 近日有爆料稱,樂視位于北京達美中心的辦公地因未及時繳納辦公地費用已被停止物業(yè)一切服務(wù);物業(yè)公... id tags \ 854 6429089676803440897 ['科技', '華碩', '華碩ZenFone', '臺灣', '手機'] 101 6429098400347586818 ['游戲', '猴子', '王者榮耀', '黃忠', '游戲'] 738 6413133652368982274 ['科技', '廚衛(wèi)電器', '榨汁機', '小家電', '硅谷'] 511 6428827159980867842 ['科技', '智能家居', '音箱', '蘋果公司', '法國'] 725 6428841852455354625 ['科技', '喜馬拉雅山', '科技'] ... ... ... 805 6429151552733069569 ['財經(jīng)', '財經(jīng)'] 448 6415852634885341441 ['汽車', 'SUV', '國產(chǎn)車', '概念車', '汽車用品'] 782 6428858665063383297 ['科技', '新能源汽車', '電動汽車', '新能源', '經(jīng)濟'] 1264 6427822755417194753 ['汽車', '日本汽車', '謳歌汽車', 'SUV', '空調(diào)'] 1195 6429093420292210945 ['科技', '樂視', '科技'] time title \ 854 2017-06-07 10:11:00 華碩ZenFone AR宣布本月發(fā)售 101 2017-06-07 10:39:20 猴子重做之后是加強還是削弱?狂到站對面泉水拿雙殺 738 2017-04-26 10:41:39 絕!他用一臺榨汁機騙了8億 511 2017-06-08 11:06:00 他的智能音箱一上市,蘋果公司就推出了HomePod 725 2017-06-07 18:37:00 喜馬拉雅FM推出“付費會員”,當天召集超221萬名會員 ... ... ... 805 2017-06-08 14:30:00 盤中近20家龍頭白馬股集體創(chuàng)下歷史新高 448 2017-05-03 18:37:20 別瞎找了!10萬左右尺寸最大的SUV都在這里了 782 2017-06-07 19:12:00 倡導移動出行新概念 NEVS兩款概念量產(chǎn)車亮相 1264 2017-06-08 09:54:40 居然還有一款車,最低配和中高配看不出差別? 1195 2017-06-08 10:45:00 樂視被爆未及時繳物業(yè)費,員工或?qū)⒈蛔柚惯M大樓辦公 doc_words \ 854 [報道, 華碩, 已經(jīng), 正式, 媒體, 發(fā)出, 邀請, 定于, 月, 日, 臺灣, 舉辦,... 101 [月, 日, 王者, 榮耀, 猴三棍, 重, 做, 引起, 王者, 峽谷, 一陣, 軒然大波... 738 [騙子, 往往, 很會, 講故事, 以下, 硅谷, 騙局, 驗血, 公司, 號稱, 指尖, ... 511 [專訪, 創(chuàng)始人, 孟, 崨, 學校, 最, 調(diào)皮, 卻, 成績, 最好, 學生, 老師, ... 725 [據(jù)介紹, 喜馬拉雅, 會員, 月費, 元, 年度, 會員, 元, 價格, 視頻, 網(wǎng)站, ... ... ... 805 [每經(jīng), 記者, 王海, 慜, 每經(jīng), 編輯, 葉峰, 今日, 盤中, 昨日, 領(lǐng)漲, 中小... 448 [中國, 人買, 喜歡, 房子, 買, 面積, 手機, 買, 屏大, 買車, 自然, 挑選,... 782 [中證網(wǎng), 訊, 記者, 徐金忠, 月, 日, 國, 電動汽車, 瑞典, 有限公司, 亮相,... 1264 [目前, 日系, 豪華, 品牌, 謳歌, 已經(jīng), 開啟, 國產(chǎn), 路, 推出, 車型, 后,... 1195 [近日, 爆料, 稱, 樂視, 位于, 北京, 達美, 中心, 辦公地, 因未, 及時, 繳... corpus \ 854 [(142, 1), (362, 1), (472, 1), (475, 1), (494,... 101 [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)... 738 [(0, 2), (45, 1), (48, 1), (133, 2), (155, 1),... 511 [(0, 10), (13, 2), (14, 2), (20, 1), (45, 1), ... 725 [(30, 1), (102, 1), (142, 1), (154, 1), (189, ... ... ... 805 [(113, 1), (167, 1), (169, 1), (214, 1), (258,... 448 [(4, 2), (8, 1), (14, 1), (51, 6), (53, 2), (6... 782 [(15, 2), (30, 1), (53, 7), (93, 1), (143, 1),... 1264 [(0, 1), (20, 1), (51, 1), (176, 1), (225, 1),... 1195 [(57, 1), (111, 1), (191, 1), (361, 1), (476, ... tfidf visual01 visual02 \ 854 [(142, 0.13953435619531032), (362, 0.046441336... 21.684397 -30.567736 101 [(0, 0.012838015508020575), (68, 0.04742284222... 67.188065 21.183245 738 [(0, 0.008984009118453712), (45, 0.01791359767... -22.855194 -11.270862 511 [(0, 0.04361196171462796), (13, 0.028607388065... -22.198786 12.217076 725 [(30, 0.05815947983270004), (102, 0.0450585853... 26.268911 21.240065 ... ... ... ... 805 [(113, 0.030899018921031703), (167, 0.02103003... -66.232071 0.221611 448 [(4, 0.04071064284477513), (8, 0.0235138776022... 41.836094 -44.539528 782 [(15, 0.03392075672049564), (30, 0.03003603467... -26.810091 -29.602842 1264 [(0, 0.009883726180653873), (20, 0.04080153677... 36.279522 -52.474297 1195 [(57, 0.09668298763559263), (111, 0.1255406499... -6.373239 16.101738 keyword_index predicted_class 854 1 1 101 3 3 738 1 1 511 1 2 725 1 1 ... ... ... 805 2 2 448 5 5 782 1 1 1264 5 5 1195 1 1 [647 rows x 13 columns] SModel_CS_acc_score: 0.7047913446676971 300 label_category_ID 2 一個 一些 概念 經(jīng)營 補貼 股市 增持 成本 乳業(yè) 萬噸 train_data_frame.corpus[0] [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2), (6, 1), (7, 1), (8, 2), (9, 1), (10, 3), (11, 1), (12, 2), (13, 2), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 3), (31, 4), (32, 3), (33, 1), (34, 1), (35, 1), (36, 7), (37, 1), (38, 1), (39, 2), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 4), (51, 21), (52, 3), (53, 7), (54, 1), (55, 2), (56, 1), (57, 4), (58, 2), (59, 1), (60, 5), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 3), (67, 1), (68, 2), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 4), (84, 7), (85, 2), (86, 3), (87, 1), (88, 9), (89, 1), (90, 1), (91, 8), (92, 3), (93, 1), (94, 4), (95, 1), (96, 2), (97, 1), (98, 7), (99, 1), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 2), (110, 1), (111, 2), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 2), (122, 1), (123, 1), (124, 1), (125, 1), (126, 5), (127, 1), (128, 4), (129, 1), (130, 1), (131, 1), (132, 2), (133, 2), (134, 1), (135, 5), (136, 1), (137, 1), (138, 3), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 2), (146, 1), (147, 1), (148, 2), (149, 4), (150, 1), (151, 1), (152, 2), (153, 2), (154, 1), (155, 3), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 2), (162, 1), (163, 1), (164, 1), (165, 2), (166, 1), (167, 3), (168, 1), (169, 1), (170, 3), (171, 3), (172, 1), (173, 2), (174, 1), (175, 1), (176, 2), (177, 5), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1), (183, 1), (184, 4), (185, 1), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 14), (192, 2), (193, 2), (194, 2), (195, 1), (196, 3), (197, 1), (198, 1), (199, 11), (200, 6), (201, 1), (202, 1), (203, 2), (204, 1), (205, 8), (206, 2), (207, 2), (208, 2), (209, 1), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 3), (217, 1), (218, 1), (219, 2), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 17), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 2), (233, 1), (234, 1), (235, 3), (236, 1), (237, 1), (238, 2), (239, 1), (240, 1), (241, 1), (242, 1), (243, 2), (244, 2), (245, 1), (246, 1), (247, 2), (248, 2), (249, 2), (250, 1), (251, 1), (252, 2), (253, 1), (254, 1), (255, 1), (256, 1), (257, 1), (258, 3), (259, 3), (260, 1), (261, 3), (262, 2), (263, 1), (264, 1), (265, 6), (266, 1), (267, 3), (268, 1), (269, 1), (270, 3), (271, 2), (272, 1), (273, 2), (274, 1), (275, 1), (276, 5), (277, 1), (278, 4), (279, 4), (280, 25), (281, 2), (282, 2), (283, 2), (284, 7), (285, 1), (286, 1), (287, 2), (288, 2), (289, 1), (290, 1), (291, 1), (292, 1), (293, 3), (294, 2), (295, 1), (296, 3), (297, 1), (298, 3), (299, 2), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 1), (306, 1), (307, 2), (308, 2), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 7), (317, 2), (318, 2), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 1), (325, 4), (326, 1), (327, 2), (328, 1), (329, 1), (330, 3), (331, 3), (332, 1), (333, 2), (334, 2), (335, 1), (336, 1), (337, 2), (338, 1), (339, 1), (340, 1), (341, 1), (342, 1), (343, 1), (344, 2), (345, 1), (346, 1), (347, 2), (348, 1), (349, 2), (350, 5), (351, 2), (352, 3), (353, 1), (354, 4), (355, 1), (356, 1), (357, 2), (358, 4), (359, 2), (360, 2), (361, 1), (362, 9), (363, 2), (364, 2), (365, 1), (366, 1), (367, 7), (368, 1), (369, 4), (370, 2), (371, 1), (372, 1), (373, 1), (374, 1), (375, 1), (376, 1), (377, 1), (378, 2), (379, 1), (380, 3), (381, 1), (382, 2), (383, 1), (384, 3), (385, 26), (386, 1), (387, 1), (388, 1), (389, 3), (390, 1), (391, 2), (392, 1), (393, 4), (394, 4), (395, 4), (396, 2), (397, 1), (398, 40), (399, 2), (400, 4), (401, 1), (402, 1), (403, 2), (404, 1), (405, 1), (406, 2), (407, 1), (408, 1), (409, 3), (410, 1), (411, 1), (412, 2), (413, 7), (414, 4), (415, 2), (416, 1), (417, 1), (418, 1), (419, 3), (420, 1), (421, 1), (422, 1), (423, 1), (424, 1), (425, 1), (426, 1), (427, 2), (428, 1), (429, 1), (430, 1), (431, 1), (432, 5), (433, 1), (434, 1), (435, 1), (436, 1), (437, 1), (438, 1), (439, 1), (440, 1), (441, 1), (442, 1), (443, 3), (444, 3), (445, 2), (446, 5), (447, 1), (448, 1), (449, 1), (450, 4), (451, 1), (452, 2), (453, 2), (454, 1), (455, 4), (456, 1), (457, 1), (458, 1), (459, 2), (460, 1), (461, 1), (462, 5), (463, 2), (464, 1), (465, 5), (466, 74), (467, 2), (468, 1), (469, 1), (470, 2), (471, 22), (472, 2), (473, 1), (474, 1), (475, 2), (476, 2), (477, 2), (478, 2), (479, 1), (480, 1), (481, 1), (482, 1), (483, 2), (484, 1), (485, 1), (486, 2), (487, 1), (488, 2), (489, 1), (490, 1), (491, 1), (492, 4), (493, 1), (494, 2), (495, 4), (496, 2), (497, 1), (498, 1), (499, 1), (500, 1), (501, 5), (502, 1), (503, 13), (504, 4), (505, 3), (506, 1), (507, 7), (508, 1), (509, 1), (510, 1), (511, 1), (512, 1), (513, 1), (514, 2), (515, 1), (516, 3), (517, 4), (518, 1), (519, 1), (520, 1), (521, 1), (522, 1), (523, 1), (524, 1), (525, 1), (526, 2), (527, 2), (528, 1), (529, 1), (530, 1), (531, 1), (532, 1), (533, 1), (534, 1), (535, 2), (536, 5), (537, 2), (538, 1), (539, 1), (540, 1), (541, 7), (542, 1), (543, 1), (544, 1), (545, 2), (546, 1), (547, 3), (548, 2), (549, 1), (550, 1), (551, 2), (552, 1), (553, 2), (554, 1), (555, 1), (556, 2), (557, 1), (558, 2), (559, 5), (560, 2), (561, 1), (562, 1), (563, 1), (564, 1), (565, 1), (566, 1), (567, 7), (568, 2), (569, 1), (570, 2), (571, 1), (572, 1), (573, 1), (574, 4), (575, 1), (576, 2), (577, 2), (578, 1), (579, 2), (580, 1), (581, 1), (582, 1), (583, 2), (584, 1), (585, 1), (586, 1), (587, 4), (588, 1), (589, 4), (590, 2), (591, 1), (592, 1), (593, 1), (594, 2), (595, 1), (596, 1), (597, 1), (598, 1), (599, 1), (600, 1), (601, 1), (602, 1), (603, 1), (604, 1), (605, 1), (606, 1), (607, 1), (608, 2), (609, 1), (610, 2), (611, 1), (612, 1), (613, 11), (614, 1), (615, 1), (616, 3), (617, 1), (618, 1), (619, 1), (620, 1), (621, 1), (622, 1), (623, 1), (624, 32), (625, 2), (626, 1), (627, 8), (628, 1), (629, 3), (630, 3), (631, 1), (632, 1), (633, 4), (634, 1), (635, 1), (636, 2), (637, 1), (638, 3), (639, 2), (640, 1), (641, 1), (642, 1), (643, 3), (644, 5), (645, 4), (646, 1), (647, 1), (648, 3), (649, 1), (650, 1), (651, 1), (652, 1), (653, 1), (654, 1), (655, 2), (656, 1), (657, 7), (658, 1), (659, 2), (660, 1), (661, 2), (662, 1), (663, 1), (664, 1), (665, 1), (666, 1), (667, 1), (668, 4), (669, 1), (670, 1), (671, 3), (672, 1), (673, 1), (674, 2), (675, 1), (676, 1), (677, 1), (678, 1), (679, 1), (680, 2), (681, 2), (682, 1), (683, 1), (684, 1), (685, 3), (686, 1), (687, 1), (688, 1), (689, 1), (690, 4), (691, 1), (692, 2), (693, 3), (694, 1), (695, 2), (696, 1), (697, 1), (698, 2), (699, 1), (700, 1), (701, 4), (702, 1), (703, 1), (704, 2), (705, 1), (706, 1), (707, 1), (708, 1), (709, 2), (710, 1), (711, 3), (712, 1), (713, 1), (714, 4), (715, 1), (716, 1), (717, 1), (718, 2), (719, 1), (720, 1), (721, 2), (722, 1), (723, 1), (724, 4), (725, 1), (726, 1), (727, 1), (728, 1), (729, 2), (730, 12), (731, 2), (732, 1), (733, 2), (734, 3), (735, 1), (736, 26), (737, 1), (738, 5), (739, 1), (740, 2), (741, 5), (742, 2), (743, 3), (744, 3), (745, 2), (746, 1), (747, 3), (748, 2), (749, 2), (750, 2), (751, 1), (752, 1), (753, 2), (754, 1), (755, 1), (756, 1), (757, 1), (758, 1), (759, 4), (760, 1), (761, 1), (762, 1), (763, 1), (764, 1), (765, 2), (766, 1), (767, 1), (768, 1), (769, 2), (770, 8), (771, 2), (772, 4), (773, 1), (774, 8), (775, 3), (776, 1), (777, 1), (778, 3), (779, 1), (780, 1), (781, 1), (782, 5), (783, 2), (784, 2), (785, 1), (786, 4), (787, 1), (788, 1), (789, 1), (790, 1), (791, 1), (792, 1), (793, 4), (794, 1), (795, 1), (796, 1), (797, 5), (798, 3), (799, 5), (800, 3), (801, 1), (802, 1), (803, 1), (804, 1), (805, 2), (806, 2), (807, 2), (808, 1), (809, 1), (810, 1), (811, 1), (812, 1), (813, 1), (814, 1), (815, 3), (816, 1), (817, 2), (818, 1), (819, 1), (820, 11), (821, 1), (822, 1), (823, 2), (824, 3), (825, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 3), (831, 4), (832, 46), (833, 1), (834, 1), (835, 2), (836, 2), (837, 1), (838, 1), (839, 2), (840, 2), (841, 1), (842, 1), (843, 2), (844, 2), (845, 2), (846, 1), (847, 1), (848, 2), (849, 1), (850, 1), (851, 1), (852, 3), (853, 1), (854, 1), (855, 6), (856, 1), (857, 1), (858, 1)] [33. 74. 73. 31. 47. 48.] <class 'numpy.ndarray'> SModel_acc_score: 0.8114374034003091 kNNC_acc_score: 0.8160741885625966 GNBC_acc_score: 0.6352395672333848 MNBC_acc_score: 0.6352395672333848 BNBC_acc_score: 0.29675425038639874 LDAC_acc_score: 0.8238021638330757 PerceptronC_acc_score: 0.8222565687789799?
?
核心代碼
class GaussianNB Found at: sklearn.naive_bayesclass GaussianNB(_BaseNB):"""Gaussian Naive Bayes (GaussianNB)Can perform online updates to model parameters via :meth:`partial_fit`.For details on algorithm used to update feature means and variance online,see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdfRead more in the :ref:`User Guide <gaussian_naive_bayes>`.Parameters----------priors : array-like of shape (n_classes,)Prior probabilities of the classes. If specified the priors are notadjusted according to the data.var_smoothing : float, default=1e-9Portion of the largest variance of all features that is added tovariances for calculation stability... versionadded:: 0.20Attributes----------class_count_ : ndarray of shape (n_classes,)number of training samples observed in each class.class_prior_ : ndarray of shape (n_classes,)probability of each class.classes_ : ndarray of shape (n_classes,)class labels known to the classifierepsilon_ : floatabsolute additive value to variancessigma_ : ndarray of shape (n_classes, n_features)variance of each feature per classtheta_ : ndarray of shape (n_classes, n_features)mean of each feature per classExamples-------->>> import numpy as np>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])>>> Y = np.array([1, 1, 1, 2, 2, 2])>>> from sklearn.naive_bayes import GaussianNB>>> clf = GaussianNB()>>> clf.fit(X, Y)GaussianNB()>>> print(clf.predict([[-0.8, -1]]))[1]>>> clf_pf = GaussianNB()>>> clf_pf.partial_fit(X, Y, np.unique(Y))GaussianNB()>>> print(clf_pf.predict([[-0.8, -1]]))[1]"""@_deprecate_positional_argsdef __init__(self, *, priors=None, var_smoothing=1e-9):self.priors = priorsself.var_smoothing = var_smoothingdef fit(self, X, y, sample_weight=None):"""Fit Gaussian Naive Bayes according to X, yParameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samplesand n_features is the number of features.y : array-like of shape (n_samples,)Target values.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted)... versionadded:: 0.17Gaussian Naive Bayes supports fitting with *sample_weight*.Returns-------self : object"""X, y = self._validate_data(X, y)y = column_or_1d(y, warn=True)return self._partial_fit(X, y, np.unique(y), _refit=True, sample_weight=sample_weight)def _check_X(self, X):return check_array(X)@staticmethoddef _update_mean_variance(n_past, mu, var, X, sample_weight=None):"""Compute online update of Gaussian mean and variance.Given starting sample count, mean, and variance, a new set ofpoints X, and optionally sample weights, return the updated mean andvariance. (NB - each dimension (column) in X is treated as independent-- you get variance, not covariance).Can take scalar mean and variance, or vector mean and variance tosimultaneously update a number of independent Gaussians.See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdfParameters----------n_past : intNumber of samples represented in old mean and variance. If sampleweights were given, this should contain the sum of sampleweights represented in old mean and variance.mu : array-like of shape (number of Gaussians,)Means for Gaussians in original set.var : array-like of shape (number of Gaussians,)Variances for Gaussians in original set.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted).Returns-------total_mu : array-like of shape (number of Gaussians,)Updated mean for each Gaussian over the combined set.total_var : array-like of shape (number of Gaussians,)Updated variance for each Gaussian over the combined set."""if X.shape[0] == 0:return mu, var# Compute (potentially weighted) mean and variance of new datapointsif sample_weight is not None:n_new = float(sample_weight.sum())new_mu = np.average(X, axis=0, weights=sample_weight)new_var = np.average((X - new_mu) ** 2, axis=0, weights=sample_weight)else:n_new = X.shape[0]new_var = np.var(X, axis=0)new_mu = np.mean(X, axis=0)if n_past == 0:return new_mu, new_varn_total = float(n_past + n_new)# Combine mean of old and new data, taking into consideration# (weighted) number of observationstotal_mu = (n_new * new_mu + n_past * mu) / n_total# Combine variance of old and new data, taking into consideration# (weighted) number of observations. This is achieved by combining# the sum-of-squared-differences (ssd)old_ssd = n_past * varnew_ssd = n_new * new_vartotal_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu - new_mu) ** 2total_var = total_ssd / n_totalreturn total_mu, total_vardef partial_fit(self, X, y, classes=None, sample_weight=None):"""Incremental fit on a batch of samples.This method is expected to be called several times consecutivelyon different chunks of a dataset so as to implement out-of-coreor online learning.This is especially useful when the whole dataset is too big to fit inmemory at once.This method has some performance and numerical stability overhead,hence it is better to call partial_fit on chunks of data that areas large as possible (as long as fitting in the memory budget) tohide the overhead.Parameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samples andn_features is the number of features.y : array-like of shape (n_samples,)Target values.classes : array-like of shape (n_classes,), default=NoneList of all the classes that can possibly appear in the y vector.Must be provided at the first call to partial_fit, can be omittedin subsequent calls.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted)... versionadded:: 0.17Returns-------self : object"""return self._partial_fit(X, y, classes, _refit=False, sample_weight=sample_weight)def _partial_fit(self, X, y, classes=None, _refit=False, sample_weight=None):"""Actual implementation of Gaussian NB fitting.Parameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samples andn_features is the number of features.y : array-like of shape (n_samples,)Target values.classes : array-like of shape (n_classes,), default=NoneList of all the classes that can possibly appear in the y vector.Must be provided at the first call to partial_fit, can be omittedin subsequent calls._refit : bool, default=FalseIf true, act as though this were the first time we called_partial_fit (ie, throw away any past fitting and start over).sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted).Returns-------self : object"""X, y = check_X_y(X, y)if sample_weight is not None:sample_weight = _check_sample_weight(sample_weight, X)# If the ratio of data variance between dimensions is too small, it# will cause numerical errors. To address this, we artificially# boost the variance by epsilon, a small fraction of the standard# deviation of the largest dimension.self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()if _refit:self.classes_ = Noneif _check_partial_fit_first_call(self, classes):# This is the first call to partial_fit:# initialize various cumulative countersn_features = X.shape[1]n_classes = len(self.classes_)self.theta_ = np.zeros((n_classes, n_features))self.sigma_ = np.zeros((n_classes, n_features))self.class_count_ = np.zeros(n_classes, dtype=np.float64)# Initialise the class prior# Take into account the priorsif self.priors is not None:priors = np.asarray(self.priors)# Check that the provide prior match the number of classesif len(priors) != n_classes:raise ValueError('Number of priors must match number of'' classes.')# Check that the sum is 1if not np.isclose(priors.sum(), 1.0):raise ValueError('The sum of the priors should be 1.') # Check that the prior are non-negativeif (priors < 0).any():raise ValueError('Priors must be non-negative.')self.class_prior_ = priorselse:self.class_prior_ = np.zeros(len(self.classes_), dtype=np.float64) # Initialize the priors to zeros for each classelse:if X.shape[1] != self.theta_.shape[1]:msg = "Number of features %d does not match previous data %d."raise ValueError(msg % (X.shape[1], self.theta_.shape[1]))# Put epsilon back in each time::]self.epsilon_self.sigma_[ -= classes = self.classes_unique_y = np.unique(y)unique_y_in_classes = np.in1d(unique_y, classes)if not np.all(unique_y_in_classes):raise ValueError("The target label(s) %s in y do not exist in the ""initial classes %s" % (unique_y[~unique_y_in_classes], classes))for y_i in unique_y:i = classes.searchsorted(y_i)X_i = X[y == y_i:]if sample_weight is not None:sw_i = sample_weight[y == y_i]N_i = sw_i.sum()else:sw_i = NoneN_i = X_i.shape[0]new_theta, new_sigma = self._update_mean_variance(self.class_count_[i], self.theta_[i:], self.sigma_[i:], X_i, sw_i)self.theta_[i:] = new_thetaself.sigma_[i:] = new_sigmaself.class_count_[i] += N_iself.sigma_[::] += self.epsilon_# Update if only no priors is providedif self.priors is None:# Empirical prior, with sample_weight taken into accountself.class_prior_ = self.class_count_ / self.class_count_.sum()return selfdef _joint_log_likelihood(self, X):joint_log_likelihood = []for i in range(np.size(self.classes_)):jointi = np.log(self.class_prior_[i])n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:]))n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) / (self.sigma_[i:]), 1)joint_log_likelihood.append(jointi + n_ij)joint_log_likelihood = np.array(joint_log_likelihood).Treturn joint_log_likelihoodclass MultinomialNB Found at: sklearn.naive_bayesclass MultinomialNB(_BaseDiscreteNB):"""Naive Bayes classifier for multinomial modelsThe multinomial Naive Bayes classifier is suitable for classification withdiscrete features (e.g., word counts for text classification). Themultinomial distribution normally requires integer feature counts. However,in practice, fractional counts such as tf-idf may also work.Read more in the :ref:`User Guide <multinomial_naive_bayes>`.Parameters----------alpha : float, default=1.0Additive (Laplace/Lidstone) smoothing parameter(0 for no smoothing).fit_prior : bool, default=TrueWhether to learn class prior probabilities or not.If false, a uniform prior will be used.class_prior : array-like of shape (n_classes,), default=NonePrior probabilities of the classes. If specified the priors are notadjusted according to the data.Attributes----------class_count_ : ndarray of shape (n_classes,)Number of samples encountered for each class during fitting. Thisvalue is weighted by the sample weight when provided.class_log_prior_ : ndarray of shape (n_classes, )Smoothed empirical log probability for each class.classes_ : ndarray of shape (n_classes,)Class labels known to the classifiercoef_ : ndarray of shape (n_classes, n_features)Mirrors ``feature_log_prob_`` for interpreting MultinomialNBas a linear model.feature_count_ : ndarray of shape (n_classes, n_features)Number of samples encountered for each (class, feature)during fitting. This value is weighted by the sample weight whenprovided.feature_log_prob_ : ndarray of shape (n_classes, n_features)Empirical log probability of featuresgiven a class, ``P(x_i|y)``.intercept_ : ndarray of shape (n_classes, )Mirrors ``class_log_prior_`` for interpreting MultinomialNBas a linear model.n_features_ : intNumber of features of each sample.Examples-------->>> import numpy as np>>> rng = np.random.RandomState(1)>>> X = rng.randint(5, size=(6, 100))>>> y = np.array([1, 2, 3, 4, 5, 6])>>> from sklearn.naive_bayes import MultinomialNB>>> clf = MultinomialNB()>>> clf.fit(X, y)MultinomialNB()>>> print(clf.predict(X[2:3]))[3]Notes-----For the rationale behind the names `coef_` and `intercept_`, i.e.naive Bayes as a linear classifier, see J. Rennie et al. (2003),Tackling the poor assumptions of naive Bayes text classifiers, ICML.References----------C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction toInformation Retrieval. Cambridge University Press, pp. 234-265.https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html"""@_deprecate_positional_argsdef __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None):self.alpha = alphaself.fit_prior = fit_priorself.class_prior = class_priordef _more_tags(self):return {'requires_positive_X':True}def _count(self, X, Y):"""Count and smooth feature occurrences."""check_non_negative(X, "MultinomialNB (input X)")self.feature_count_ += safe_sparse_dot(Y.T, X)self.class_count_ += Y.sum(axis=0)def _update_feature_log_prob(self, alpha):"""Apply smoothing to raw counts and recompute log probabilities"""smoothed_fc = self.feature_count_ + alphasmoothed_cc = smoothed_fc.sum(axis=1)self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))def _joint_log_likelihood(self, X):"""Calculate the posterior log probability of the samples X"""return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_class BernoulliNB Found at: sklearn.naive_bayesclass BernoulliNB(_BaseDiscreteNB):"""Naive Bayes classifier for multivariate Bernoulli models.Like MultinomialNB, this classifier is suitable for discrete data. Thedifference is that while MultinomialNB works with occurrence counts,BernoulliNB is designed for binary/boolean features.Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.Parameters----------alpha : float, default=1.0Additive (Laplace/Lidstone) smoothing parameter(0 for no smoothing).binarize : float or None, default=0.0Threshold for binarizing (mapping to booleans) of sample features.If None, input is presumed to already consist of binary vectors.fit_prior : bool, default=TrueWhether to learn class prior probabilities or not.If false, a uniform prior will be used.class_prior : array-like of shape (n_classes,), default=NonePrior probabilities of the classes. If specified the priors are notadjusted according to the data.Attributes----------class_count_ : ndarray of shape (n_classes)Number of samples encountered for each class during fitting. Thisvalue is weighted by the sample weight when provided.class_log_prior_ : ndarray of shape (n_classes)Log probability of each class (smoothed).classes_ : ndarray of shape (n_classes,)Class labels known to the classifierfeature_count_ : ndarray of shape (n_classes, n_features)Number of samples encountered for each (class, feature)during fitting. This value is weighted by the sample weight whenprovided.feature_log_prob_ : ndarray of shape (n_classes, n_features)Empirical log probability of features given a class, P(x_i|y).n_features_ : intNumber of features of each sample.Examples-------->>> import numpy as np>>> rng = np.random.RandomState(1)>>> X = rng.randint(5, size=(6, 100))>>> Y = np.array([1, 2, 3, 4, 4, 5])>>> from sklearn.naive_bayes import BernoulliNB>>> clf = BernoulliNB()>>> clf.fit(X, Y)BernoulliNB()>>> print(clf.predict(X[2:3]))[3]References----------C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction toInformation Retrieval. Cambridge University Press, pp. 234-265.https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.htmlA. McCallum and K. Nigam (1998). A comparison of event models for naiveBayes text classification. Proc. AAAI/ICML-98 Workshop on Learning forText Categorization, pp. 41-48.V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering withnaive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS)."""@_deprecate_positional_argsdef __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True, class_prior=None):self.alpha = alphaself.binarize = binarizeself.fit_prior = fit_priorself.class_prior = class_priordef _check_X(self, X):X = super()._check_X(X)if self.binarize is not None:X = binarize(X, threshold=self.binarize)return Xdef _check_X_y(self, X, y):X, y = super()._check_X_y(X, y)if self.binarize is not None:X = binarize(X, threshold=self.binarize)return X, ydef _count(self, X, Y):"""Count and smooth feature occurrences."""self.feature_count_ += safe_sparse_dot(Y.T, X)self.class_count_ += Y.sum(axis=0)def _update_feature_log_prob(self, alpha):"""Apply smoothing to raw counts and recompute log probabilities"""smoothed_fc = self.feature_count_ + alphasmoothed_cc = self.class_count_ + alpha * 2self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))def _joint_log_likelihood(self, X):"""Calculate the posterior log probability of the samples X"""n_classes, n_features = self.feature_log_prob_.shapen_samples, n_features_X = X.shapeif n_features_X != n_features:raise ValueError("Expected input with %d features, got %d instead" % (n_features, n_features_X))neg_prob = np.log(1 - np.exp(self.feature_log_prob_))# Compute neg_prob · (1 - X).T as ∑neg_prob - X · neg_probjll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T)jll += self.class_log_prior_ + neg_prob.sum(axis=1)return jll?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 支配树算法
- 下一篇: Wikioi 天梯 1501/1842/