日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

学术前沿趋势分析Task01

發(fā)布時間:2024/3/26 编程问答 41 豆豆
生活随笔 收集整理的這篇文章主要介紹了 学术前沿趋势分析Task01 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、任務說明

  • 任務主題:論文數(shù)量統(tǒng)計,即統(tǒng)計2019年全年計算機各個方向論文數(shù)量;
  • 任務內(nèi)容:賽題的理解、使用 Pandas 讀取數(shù)據(jù)并進行統(tǒng)計;
  • 任務內(nèi)容:任務成果:學習 Pandas 的基礎操作;

二、數(shù)據(jù)集介紹

數(shù)據(jù)集的格式如下:

  • id:arXiv ID,可用于訪問論文;
  • submitter:論文提交者;
  • authors:論文作者;
  • title:論文標題;
  • comments:論文頁數(shù)和圖表等其他信息;
  • journal-ref:論文發(fā)表的期刊的信息;
  • doi:數(shù)字對象標識符,https://www.doi.org;
  • report-no:報告編號;
  • categories:論文在 arXiv 系統(tǒng)的所屬類別或標簽;
  • license:文章的許可證;
  • abstract:論文摘要;
  • versions:論文版本;
  • authors_parsed:作者的信息。

三、arxiv論文類別介紹

‘a(chǎn)stro-ph’: ‘Astrophysics’,
‘a(chǎn)stro-ph.CO’: ‘Cosmology and Nongalactic Astrophysics’,
‘a(chǎn)stro-ph.EP’: ‘Earth and Planetary Astrophysics’,
‘a(chǎn)stro-ph.GA’: ‘Astrophysics of Galaxies’,
‘cs.AI’: ‘Artificial Intelligence’,
‘cs.AR’: ‘Hardware Architecture’,
‘cs.CC’: ‘Computational Complexity’,
‘cs.CE’: ‘Computational Engineering, Finance, and Science’,
‘cs.CV’: ‘Computer Vision and Pattern Recognition’,
‘cs.CY’: ‘Computers and Society’,
‘cs.DB’: ‘Databases’,
‘cs.DC’: ‘Distributed, Parallel, and Cluster Computing’,
‘cs.DL’: ‘Digital Libraries’,
‘cs.NA’: ‘Numerical Analysis’,
‘cs.NE’: ‘Neural and Evolutionary Computing’,
‘cs.NI’: ‘Networking and Internet Architecture’,
‘cs.OH’: ‘Other Computer Science’,
‘cs.OS’: ‘Operating Systems’,

四、具體代碼實現(xiàn)以及講解

4.1 導入package并讀取原始數(shù)據(jù)

# 導入所需的package import seaborn as sns #用于畫圖 from bs4 import BeautifulSoup #用于爬取arxiv的數(shù)據(jù) import re #用于正則表達式,匹配字符串的模式 import requests #用于網(wǎng)絡連接,發(fā)送網(wǎng)絡請求,使用域名獲取對應信息 import json #讀取數(shù)據(jù),我們的數(shù)據(jù)為json格式的 import pandas as pd #數(shù)據(jù)處理,數(shù)據(jù)分析 import matplotlib.pyplot as plt #畫圖工具 # 讀入數(shù)據(jù) data = [] #初始化 #使用with語句優(yōu)勢:1.自動關閉文件句柄;2.自動顯示(處理)文件讀取數(shù)據(jù)異常 with open("arxiv-metadata-oai-snapshot.json", 'r') as f: for line in f: data.append(json.loads(line))data = pd.DataFrame(data) #將list變?yōu)閐ataframe格式,方便使用pandas進行分析 data.shape #顯示數(shù)據(jù)大小 >>> (1796911, 14)

(1796911, 14) 其中的1796911表示數(shù)據(jù)總量,14表示特征數(shù),對應我們1.2節(jié)說明的論文的14種信息。

data.head() idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsupdate_dateauthors_parsed01234
0704.0001Pavel NadolskyC. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...Calculation of prompt diphoton production cros...37 pages, 15 figures; published versionPhys.Rev.D76:013009,200710.1103/PhysRevD.76.013009ANL-HEP-PR-07-12hep-phNoneA fully differential calculation in perturba...[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...2008-11-26[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
0704.0002Louis TheranIleana Streinu and Louis TheranSparsity-certifying Graph DecompositionsTo appear in Graphs and CombinatoricsNoneNoneNonemath.CO cs.CGhttp://arxiv.org/licenses/nonexclusive-distrib...We describe a new algorithm, the $(k,\ell)$-...[{'version': 'v1', 'created': 'Sat, 31 Mar 200...2008-12-13[[Streinu, Ileana, ], [Theran, Louis, ]]
0704.0003Hongjun PanHongjun PanThe evolution of the Earth-Moon system based o...23 pages, 3 figuresNoneNoneNonephysics.gen-phNoneThe evolution of Earth-Moon system is descri...[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...2008-01-13[[Pan, Hongjun, ]]
0704.0004David CallanDavid CallanA determinant of Stirling cycle numbers counts...11 pagesNoneNoneNonemath.CONoneWe show that a determinant of Stirling cycle...[{'version': 'v1', 'created': 'Sat, 31 Mar 200...2007-05-23[[Callan, David, ]]
0704.0005Alberto TorchinskyWael Abu-Shammala and Alberto TorchinskyFrom dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...NoneIllinois J. Math. 52 (2008) no.2, 681-689NoneNonemath.CA math.FANoneIn this paper we show how to compute the $\L...[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...2013-10-15[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]

4.2 數(shù)據(jù)預處理

  • count:一列數(shù)據(jù)的元素個數(shù);
  • unique:一列數(shù)據(jù)中元素的種類;
  • top:一列數(shù)據(jù)中出現(xiàn)頻率最高的元素;
  • freq:一列數(shù)據(jù)中出現(xiàn)頻率最高的元素的個數(shù);
data['categories'].describe()count 1796911 unique 62055 top astro-ph freq 86914 Name: categories, dtype: object

以上的結果表明:共有1,796,911個數(shù)據(jù),有62,055個子類,其中最多的種類是astro-ph,即Astrophysics(天體物理學),共出現(xiàn)了86914次。

由于部分論文的類別不止一種,所以下面我們判斷在本數(shù)據(jù)集中共出現(xiàn)了多少種獨立的數(shù)據(jù)集。

# 所有的種類(獨立的) unique_categories = set([i for l in [x.split(' ') for x in data['categories']] for i in l]) len(unique_categories) >>> 176 unique_categories{'acc-phys','adap-org','alg-geom',...'stat.OT','stat.TH','supr-con'}

這里使用了 split 函數(shù)將多類別使用 “ ”(空格)分開,組成list,并使用 for 循環(huán)將獨立出現(xiàn)的類別找出來,并使用 set 類別,將重復項去除得到最終所有的獨立paper種類。

從以上結果發(fā)現(xiàn),共有176種論文種類,而官網(wǎng)上是153種論文種類,這說明存在一些官網(wǎng)上沒有的類別,不過對于我們的計算機方向的論文沒有影響。

我們的任務要求對于2019年以后的paper進行分析,所以首先對于時間特征進行預處理,從而得到2019年以后的所有種類的論文:

data['update_date'].astype # object類型<bound method NDFrame.astype of 0 2008-11-26 1 2008-12-13 2 2008-01-13 3 2007-05-23 4 2013-10-15... 1796906 2009-10-30 1796907 2016-11-18 1796908 2009-10-30 1796909 2009-10-30 1796910 2009-10-30 Name: update_date, Length: 1796911, dtype: object> #將update_date從例如2019-02-20的str變?yōu)閐atetime格式,并提取處year data['year'] = pd.to_datetime(data['update_date']).dt.year #找出 year 中2019年以后的數(shù)據(jù) data = data[data['year'] >= 2019] data.shape # 還剩395123條數(shù)據(jù) >>> (395123, 15) data.reset_index(drop=True, inplace=True) #重新編號 data.head() #查看結果 idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsupdate_dateauthors_parsedyear01234
0704.0297Sung-Chul YoonSung-Chul Yoon, Philipp Podsiadlowski and Step...Remnant evolution after a carbon-oxygen white ...15 pages, 15 figures, 3 tables, submitted to M...None10.1111/j.1365-2966.2007.12161.xNoneastro-phNoneWe systematically explore the evolution of t...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...2019
0704.0342Patrice Ntumba PunguB. Dugmore and PP. NtumbaCofibrations in the Category of Frolicher Spac...27 pagesNoneNoneNonemath.ATNoneCofibrations are defined in the category of ...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Dugmore, B., ], [Ntumba, PP., ]]2019
0704.0360ZaqarashviliT.V. Zaqarashvili and K MurawskiTorsional oscillations of longitudinally inhom...6 pages, 3 figures, accepted in A&ANone10.1051/0004-6361:20077246Noneastro-phNoneWe explore the effect of an inhomogeneous ma...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Zaqarashvili, T. V., ], [Murawski, K, ]]2019
0704.0525Sezgin Ayg\"unSezgin Aygun, Ismail Tarhan, Husnu BaysalOn the Energy-Momentum Problem in Static Einst...This submission has been withdrawn by arXiv ad...Chin.Phys.Lett.24:355-358,200710.1088/0256-307X/24/2/015Nonegr-qcNoneThis paper has been removed by arXiv adminis...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...2019-10-21[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...2019
0704.0535Antonio PipinoAntonio Pipino (1,3), Thomas H. Puzia (2,4), a...The Formation of Globular Cluster Systems in M...32 pages (referee format), 9 figures, ApJ acce...Astrophys.J.665:295-305,200710.1086/519546Noneastro-phNoneThe most massive elliptical galaxies show a ...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...2019-08-19[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...2019

這里我們就已經(jīng)得到了所有2019年以后的論文,下面我們挑選出計算機領域內(nèi)的所有文章:

#爬取所有的類別 website_url = requests.get('https://arxiv.org/category_taxonomy').text #獲取網(wǎng)頁的文本數(shù)據(jù) soup = BeautifulSoup(website_url,'lxml') #爬取數(shù)據(jù),這里使用lxml的解析器,加速 root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 對應的標簽入口 tags = root.find_all(["h2","h3","h4","p"], recursive=True) #讀取 tags

https://www.jianshu.com/p/fb6ee6cc5c1c


import requests
requests.get(url) #get請求核心代碼是requests.get(url)
requests.post(url) #post請求核心代碼是requests.post(url,data={請求體的字典})
requests.put(url)
requests.delete(url)
requests.head(url)
requests.options(url)

https://cuiqingcai.com/1319.html


Beautiful Soup 將復雜 HTML 文檔轉(zhuǎn)換成一個復雜的樹形結構,每個節(jié)點都是 Python 對象,所有對象可以歸納為 4 種:

  • Tag:就是 HTML 中的一個個標簽,可以利用 soup 加標簽名輕松地獲取這些標簽的內(nèi)容
  • NavigableString:獲取標簽內(nèi)部的文字用 .string 即可,它的類型是一個 NavigableString
  • BeautifulSoup:BeautifulSoup 對象表示的是一個文檔的全部內(nèi)容。大部分時候,可以把它當作 Tag 對象,是一個特殊的 Tag,我們可以分別獲取它的類型,名稱,以及屬性
  • Comment:一個特殊類型的 NavigableString 對象,輸出的內(nèi)容不包括注釋符號
#初始化 str 和 list 變量 level_1_name = "" level_2_name = "" level_2_code = "" level_1_names = [] level_2_codes = [] level_2_names = [] level_3_codes = [] level_3_names = [] level_3_notes = [] #進行 for t in tags:if t.name == "h2":level_1_name = t.text level_2_code = t.textlevel_2_name = t.textelif t.name == "h3":raw = t.textlevel_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正則表達式:模式字符串:(.*)\((.*)\);被替換字符串"\2";被處理字符串:rawlevel_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)elif t.name == "h4":raw = t.textlevel_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)elif t.name == "p":notes = t.textlevel_1_names.append(level_1_name)level_2_names.append(level_2_name)level_2_codes.append(level_2_code)level_3_names.append(level_3_name)level_3_codes.append(level_3_code)level_3_notes.append(notes) #根據(jù)以上信息生成dataframe格式的數(shù)據(jù) df_taxonomy = pd.DataFrame({'group_name' : level_1_names,'archive_name' : level_2_names,'archive_id' : level_2_codes,'category_name' : level_3_names,'categories' : level_3_codes,'category_description': level_3_notes }) df_taxonomy.head() group_namearchive_namearchive_idcategory_namecategoriescategory_description01234
Computer ScienceComputer ScienceComputer ScienceArtificial Intelligencecs.AICovers all areas of AI except Vision, Robotics...
Computer ScienceComputer ScienceComputer ScienceHardware Architecturecs.ARCovers systems organization and hardware archi...
Computer ScienceComputer ScienceComputer ScienceComputational Complexitycs.CCCovers models of computation, complexity class...
Computer ScienceComputer ScienceComputer ScienceComputational Engineering, Finance, and Sciencecs.CECovers applications of computer science to the...
Computer ScienceComputer ScienceComputer ScienceComputational Geometrycs.CGRoughly includes material in ACM Subject Class...
#按照 "group_name" 進行分組,在組內(nèi)使用 "archive_name" 進行排序 df_taxonomy.groupby(["group_name","archive_name"]) df_taxonomy group_namearchive_namearchive_idcategory_namecategoriescategory_description01234...150151152153154
Computer ScienceComputer ScienceComputer ScienceArtificial Intelligencecs.AICovers all areas of AI except Vision, Robotics...
Computer ScienceComputer ScienceComputer ScienceHardware Architecturecs.ARCovers systems organization and hardware archi...
Computer ScienceComputer ScienceComputer ScienceComputational Complexitycs.CCCovers models of computation, complexity class...
Computer ScienceComputer ScienceComputer ScienceComputational Engineering, Finance, and Sciencecs.CECovers applications of computer science to the...
Computer ScienceComputer ScienceComputer ScienceComputational Geometrycs.CGRoughly includes material in ACM Subject Class...
..................
StatisticsStatisticsStatisticsComputationstat.COAlgorithms, Simulation, Visualization
StatisticsStatisticsStatisticsMethodologystat.MEDesign, Surveys, Model Selection, Multiple Tes...
StatisticsStatisticsStatisticsMachine Learningstat.MLCovers machine learning papers (supervised, un...
StatisticsStatisticsStatisticsOther Statisticsstat.OTWork in statistics that does not fit into the ...
StatisticsStatisticsStatisticsStatistics Theorystat.THstat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

說明一下上面代碼中的正則操作,這里我們使用re.sub來用于替換字符串中的匹配項

raw = 'Astrophysics(astro-ph)' re.sub(r"(.*)\((.*)\)",r"\2",raw) >>> 'astro-ph'

對應的參數(shù):

  • 正則中的模式字符串 pattern 的格式為 “任意字符” + “(” + “任意字符” + “)”。
  • 替換的字符串 repl 為第2個分組的內(nèi)容。
  • 要被查找替換的原始字符串 string 為原始的爬取的數(shù)據(jù)。

這里推薦大家一個在線正則表達式測試的網(wǎng)站:https://tool.oschina.net/regex/

4.3 數(shù)據(jù)分析及可視化

接下來我們首先看一下所有大類的paper數(shù)量分布:

# 使用merge函數(shù),以兩個dataframe共同的屬性 “categories” 進行合并,并以 “group_name” 作為類別進行統(tǒng)計,統(tǒng)計結果放入 “id” 列中并排序。 _df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()_df group_nameid01234567
Physics79985
Mathematics51567
Computer Science40067
Statistics4054
Electrical Engineering and Systems Science3297
Quantitative Biology1994
Quantitative Finance826
Economics576

下面我們使用餅圖進行上圖結果的可視化:

fig = plt.figure(figsize=(15,12)) explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) plt.pie(_df["id"], labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode) plt.tight_layout() # tight_layout會自動調(diào)整子圖參數(shù),使之填充整個圖像區(qū)域。 plt.show()

https://www.cnblogs.com/biyoulin/p/9565350.html


def pie(x, explode=None, labels=None, colors=None, autopct=None,
pctdistance=0.6, shadow=False, labeldistance=1.1, startangle=None,
radius=None, counterclock=True, wedgeprops=None, textprops=None,
center=(0, 0), frame=False, rotatelabels=False, hold=None, data=None)

下面統(tǒng)計在計算機各個子領域2019年后的paper數(shù)量:

# 我們同樣使用 merge 函數(shù),對于兩個dataframe 共同的特征 categories 進行合并并且進行查詢。然后我們再對于數(shù)據(jù)進行統(tǒng)計和排序: group_name="Computer Science" cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name") cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") year20192020category_nameArtificial IntelligenceComputation and LanguageComputational ComplexityComputational Engineering, Finance, and ScienceComputational GeometryComputer Science and Game TheoryComputer Vision and Pattern RecognitionComputers and SocietyCryptography and SecurityData Structures and AlgorithmsDatabasesDigital LibrariesDiscrete MathematicsDistributed, Parallel, and Cluster ComputingEmerging TechnologiesFormal Languages and Automata TheoryGeneral LiteratureGraphicsHardware ArchitectureHuman-Computer InteractionInformation RetrievalLogic in Computer ScienceMachine LearningMathematical SoftwareMultiagent SystemsMultimediaNetworking and Internet ArchitectureNeural and Evolutionary ComputingNumerical AnalysisOperating SystemsOther Computer SciencePerformanceProgramming LanguagesRoboticsSocial and Information NetworksSoftware EngineeringSoundSymbolic ComputationSystems and Control
558757
21532906
131188
108205
199216
281323
55596517
346564
10671238
711902
282342
125157
8481
715774
10184
152137
55
116151
95159
420580
245331
470504
177538
2745
8590
7666
864783
235279
4011
3633
6769
4551
268294
9171298
202325
659804
74
4436
415133

我們可以從結果看出,Computer Vision and Pattern Recognition(計算機視覺與模式識別)類是CS中paper數(shù)量最多的子類,遙遙領先于其他的CS子類,并且paper的數(shù)量還在逐年增加;另外,Computation and Language(計算與語言)、Cryptography and Security(密碼學與安全)以及 Robotics(機器人學)的2019年paper數(shù)量均超過1000或接近1000,這與我們的認知是一致的。

總結

以上是生活随笔為你收集整理的学术前沿趋势分析Task01的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。