當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【竞赛算法学习】学术前沿趋势分析-论文数据统计

發(fā)布時間：2023/12/15 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了【竞赛算法学习】学术前沿趋势分析-论文数据统计小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

任務(wù)1：論文數(shù)據(jù)統(tǒng)計

1.1 任務(wù)說明

任務(wù)主題：論文數(shù)量統(tǒng)計，即統(tǒng)計2019年全年計算機各個方向論文數(shù)量；
任務(wù)內(nèi)容：賽題的理解、使用 Pandas 讀取數(shù)據(jù)并進行統(tǒng)計；
任務(wù)成果：學(xué)習 Pandas 的基礎(chǔ)操作；
可參考的學(xué)習資料：開源組織Datawhale joyful-pandas項目

1.2 數(shù)據(jù)集介紹

數(shù)據(jù)集來源：數(shù)據(jù)集鏈接；
數(shù)據(jù)集的格式如下：
- id：arXiv ID，可用于訪問論文；
- submitter：論文提交者；
- authors：論文作者；
- title：論文標題；
- comments：論文頁數(shù)和圖表等其他信息；
- journal-ref：論文發(fā)表的期刊的信息；
- doi：數(shù)字對象標識符，https://www.doi.org；
- report-no：報告編號；
- categories：論文在 arXiv 系統(tǒng)的所屬類別或標簽；
- license：文章的許可證；
- abstract：論文摘要；
- versions：論文版本；
- authors_parsed：作者的信息。
數(shù)據(jù)集實例：

"root":{"id":string"0704.0001""submitter":string"Pavel Nadolsky""authors":string"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan""title":string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies""comments":string"37 pages, 15 figures; published version""journal-ref":string"Phys.Rev.D76:013009,2007""doi":string"10.1103/PhysRevD.76.013009""report-no":string"ANL-HEP-PR-07-12""categories":string"hep-ph""license":NULL"abstract":string" A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events.""versions":[0:{"version":string"v1""created":string"Mon, 2 Apr 2007 19:18:42 GMT"}1:{"version":string"v2""created":string"Tue, 24 Jul 2007 20:10:27 GMT"}]"update_date":string"2008-11-26""authors_parsed":[0:[0:string"Balázs"1:string"C."2:string""]1:[0:string"Berger"1:string"E. L."2:string""]2:[0:string"Nadolsky"1:string"P. M."2:string""]3:[0:string"Yuan"1:string"C. -P."2:string""]] }

1.3 arxiv論文類別介紹

我們從arxiv官網(wǎng)，查詢到論文的類別名稱以及其解釋如下。

鏈接：https://arxiv.org/help/api/user-manual 的 5.3 小節(jié)的 Subject Classifications 的部分，或 https://arxiv.org/category_taxonomy，具體的153種paper的類別部分如下：

'astro-ph': 'Astrophysics', 'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics', 'astro-ph.EP': 'Earth and Planetary Astrophysics', 'astro-ph.GA': 'Astrophysics of Galaxies', 'cs.AI': 'Artificial Intelligence', 'cs.AR': 'Hardware Architecture', 'cs.CC': 'Computational Complexity', 'cs.CE': 'Computational Engineering, Finance, and Science', 'cs.CV': 'Computer Vision and Pattern Recognition', 'cs.CY': 'Computers and Society', 'cs.DB': 'Databases', 'cs.DC': 'Distributed, Parallel, and Cluster Computing', 'cs.DL': 'Digital Libraries', 'cs.NA': 'Numerical Analysis', 'cs.NE': 'Neural and Evolutionary Computing', 'cs.NI': 'Networking and Internet Architecture', 'cs.OH': 'Other Computer Science', 'cs.OS': 'Operating Systems',

1.4 具體代碼實現(xiàn)以及講解

1.4.1 導(dǎo)入package并讀取原始數(shù)據(jù)

# 導(dǎo)入所需的package import seaborn as sns #用于畫圖 from bs4 import BeautifulSoup #用于爬取arxiv的數(shù)據(jù) import re #用于正則表達式，匹配字符串的模式 import requests #用于網(wǎng)絡(luò)連接，發(fā)送網(wǎng)絡(luò)請求，使用域名獲取對應(yīng)信息 import json #讀取數(shù)據(jù)，我們的數(shù)據(jù)為json格式的 import pandas as pd #數(shù)據(jù)處理，數(shù)據(jù)分析 import matplotlib.pyplot as plt #畫圖工具

這里使用的package的版本如下（python 3.7.4）：

seaborn：0.9.0
BeautifulSoup：4.8.0
requests：2.22.0
json：0.8.5
pandas：0.25.1
matplotlib：3.1.1

# 讀入數(shù)據(jù)data = [] #初始化 #使用with語句優(yōu)勢：1.自動關(guān)閉文件句柄；2.自動顯示（處理）文件讀取數(shù)據(jù)異常 with open("arxiv-metadata-oai-snapshot.json", 'r') as f: for line in f: data.append(json.loads(line))data = pd.DataFrame(data) #將list變?yōu)閐ataframe格式，方便使用pandas進行分析 data.shape #顯示數(shù)據(jù)大小 Output: (1778381, 14)

其中的1778381表示數(shù)據(jù)總量，14表示特征數(shù)，對應(yīng)我們1.2節(jié)說明的論文的14種信息。

data.head() #顯示數(shù)據(jù)的前五行

1.4.2 數(shù)據(jù)預(yù)處理

首先我們先來粗略統(tǒng)計論文的種類信息：

''' count：一列數(shù)據(jù)的元素個數(shù)； unique：一列數(shù)據(jù)中元素的種類； top：一列數(shù)據(jù)中出現(xiàn)頻率最高的元素； freq：一列數(shù)據(jù)中出現(xiàn)頻率最高的元素的個數(shù)； '''data["categories"].describe() count 1778381 unique 61371 top astro-ph freq 86914 Name: categories, dtype: object

以上的結(jié)果表明：共有1338381個數(shù)據(jù)，有61371個子類（因為有論文的類別是多個，例如一篇paper的類別是CS.AI & CS.MM和一篇paper的類別是CS.AI & CS.OS屬于不同的子類別，這里僅僅是粗略統(tǒng)計），其中最多的種類是astro-ph，即Astrophysics（天體物理學(xué)），共出現(xiàn)了86914次。

由于部分論文的類別不止一種，所以下面我們判斷在本數(shù)據(jù)集中共出現(xiàn)了多少種獨立的數(shù)據(jù)集。

# 所有的種類(獨立的)unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l]) len(unique_categories) unique_categories

這里使用了 split 函數(shù)將多類別使用 “ ”（空格）分開，組成list，并使用 for 循環(huán)將獨立出現(xiàn)的類別找出來，并使用 set 類別，將重復(fù)項去除得到最終所有的獨立paper種類。

176{'acc-phys', 'adap-org', 'alg-geom', 'ao-sci', 'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR', 'atom-ph', 'bayes-an', 'chao-dyn', 'chem-ph', 'cmp-lg', 'comp-gas', 'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas', 'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con', 'cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY', 'dg-ga', 'econ.EM', 'econ.GN','econ.TH', 'eess.AS', 'eess.IV', 'eess.SP', 'eess.SY', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph', 'hep-th', 'math-ph', 'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG', 'math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.IT', 'math.KT', 'math.LO', 'math.MG', 'math.MP', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST', 'mtrl-th', 'nlin.AO', 'nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI', 'nucl-ex', 'nucl-th', 'patt-sol', 'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph', 'physics.space-ph', 'plasm-ph', 'q-alg', 'q-bio', 'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN', 'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 'q-bio.QM', 'q-bio.SC', 'q-bio.TO', 'q-fin.CP', 'q-fin.EC', 'q-fin.GN', 'q-fin.MF', 'q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR', 'quant-ph', 'solv-int', 'stat.AP', 'stat.CO', 'stat.ME', 'stat.ML', 'stat.OT', 'stat.TH', 'supr-con'}

從以上結(jié)果發(fā)現(xiàn)，共有176種論文種類，比我們直接從 https://arxiv.org/help/api/user-manual 的 5.3 小節(jié)的 Subject Classifications 的部分或 https://arxiv.org/category_taxonomy中的到的類別少，這說明存在一些官網(wǎng)上沒有的類別，這是一個小細節(jié)。不過對于我們的計算機方向的論文沒有影響，依然是以下的40個類別，我們從原數(shù)據(jù)中提取的和從官網(wǎng)的到的種類是可以一一對應(yīng)的。

'cs.AI': 'Artificial Intelligence', 'cs.AR': 'Hardware Architecture', 'cs.CC': 'Computational Complexity', 'cs.CE': 'Computational Engineering, Finance, and Science', 'cs.CG': 'Computational Geometry', 'cs.CL': 'Computation and Language', 'cs.CR': 'Cryptography and Security', 'cs.CV': 'Computer Vision and Pattern Recognition', 'cs.CY': 'Computers and Society', 'cs.DB': 'Databases', 'cs.DC': 'Distributed, Parallel, and Cluster Computing', 'cs.DL': 'Digital Libraries', 'cs.DM': 'Discrete Mathematics', 'cs.DS': 'Data Structures and Algorithms', 'cs.ET': 'Emerging Technologies', 'cs.FL': 'Formal Languages and Automata Theory', 'cs.GL': 'General Literature', 'cs.GR': 'Graphics', 'cs.GT': 'Computer Science and Game Theory', 'cs.HC': 'Human-Computer Interaction', 'cs.IR': 'Information Retrieval', 'cs.IT': 'Information Theory', 'cs.LG': 'Machine Learning', 'cs.LO': 'Logic in Computer Science', 'cs.MA': 'Multiagent Systems', 'cs.MM': 'Multimedia', 'cs.MS': 'Mathematical Software', 'cs.NA': 'Numerical Analysis', 'cs.NE': 'Neural and Evolutionary Computing', 'cs.NI': 'Networking and Internet Architecture', 'cs.OH': 'Other Computer Science', 'cs.OS': 'Operating Systems', 'cs.PF': 'Performance', 'cs.PL': 'Programming Languages', 'cs.RO': 'Robotics', 'cs.SC': 'Symbolic Computation', 'cs.SD': 'Sound', 'cs.SE': 'Software Engineering', 'cs.SI': 'Social and Information Networks', 'cs.SY': 'Systems and Control',

我們的任務(wù)要求對于2019年以后的paper進行分析，所以首先對于時間特征進行預(yù)處理，從而得到2019年以后的所有種類的論文：

data["year"] = pd.to_datetime(data["update_date"]).dt.year #將update_date從例如2019-02-20的str變?yōu)閐atetime格式，并提取處year del data["update_date"] #刪除 update_date特征，其使命已完成 data = data[data["year"] >= 2019] #找出 year 中2019年以后的數(shù)據(jù)，并將其他數(shù)據(jù)刪除 # data.groupby(['categories','year']) #以 categories 進行排序，如果同一個categories 相同則使用 year 特征進行排序 data.reset_index(drop=True, inplace=True) #重新編號 data #查看結(jié)果

這里我們就已經(jīng)得到了所有2019年以后的論文，下面我們挑選出計算機領(lǐng)域內(nèi)的所有文章：

#爬取所有的類別 website_url = requests.get('https://arxiv.org/category_taxonomy').text #獲取網(wǎng)頁的文本數(shù)據(jù) soup = BeautifulSoup(website_url,'lxml') #爬取數(shù)據(jù)，這里使用lxml的解析器，加速 root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 對應(yīng)的標簽入口 tags = root.find_all(["h2","h3","h4","p"], recursive=True) #讀取 tags#初始化 str 和 list 變量 level_1_name = "" level_2_name = "" level_2_code = "" level_1_names = [] level_2_codes = [] level_2_names = [] level_3_codes = [] level_3_names = [] level_3_notes = []#進行 for t in tags:if t.name == "h2":level_1_name = t.text level_2_code = t.textlevel_2_name = t.textelif t.name == "h3":raw = t.textlevel_2_code = re.sub(r"(.*)$(.*)$",r"\2",raw) #正則表達式：模式字符串：(.*)$(.*)$；被替換字符串"\2"；被處理字符串：rawlevel_2_name = re.sub(r"(.*)$(.*)$",r"\1",raw)elif t.name == "h4":raw = t.textlevel_3_code = re.sub(r"(.*) $(.*)$",r"\1",raw)level_3_name = re.sub(r"(.*) $(.*)$",r"\2",raw)elif t.name == "p":notes = t.textlevel_1_names.append(level_1_name)level_2_names.append(level_2_name)level_2_codes.append(level_2_code)level_3_names.append(level_3_name)level_3_codes.append(level_3_code)level_3_notes.append(notes)#根據(jù)以上信息生成dataframe格式的數(shù)據(jù) df_taxonomy = pd.DataFrame({'group_name' : level_1_names,'archive_name' : level_2_names,'archive_id' : level_2_codes,'category_name' : level_3_names,'categories' : level_3_codes,'category_description': level_3_notes})#按照 "group_name" 進行分組，在組內(nèi)使用 "archive_name" 進行排序 df_taxonomy.groupby(["group_name","archive_name"]) df_taxonomy

這里主要說明一下上面代碼中的正則操作，這里我們使用re.sub來用于替換字符串中的匹配項

''' pattern : 正則中的模式字符串。 repl : 替換的字符串，也可為一個函數(shù)。 string : 要被查找替換的原始字符串。 count : 模式匹配后替換的最大次數(shù)，默認 0 表示替換所有的匹配。 flags : 編譯時用的匹配模式，數(shù)字形式。其中pattern、repl、string為必選參數(shù) '''re.sub(pattern, repl, string, count=0, flags=0)

實例如下：

import rephone = "2004-959-559 # 這是一個電話號碼"# 刪除注釋 num = re.sub(r'#.*$', "", phone) print ("電話號碼 : ", num)# 移除非數(shù)字的內(nèi)容 num = re.sub(r'\D', "", phone) print ("電話號碼 : ", num)

執(zhí)行結(jié)果：

電話號碼 : 2004-959-559 電話號碼 : 2004959559

詳細了解可以參考：https://www.runoob.com/python3/python3-reg-expressions.html

對于我們的代碼來說：

re.sub(r"(.*)$(.*)$",r"\2",raw)#raw = Astrophysics(astro-ph) #output = astro-ph

對應(yīng)的參數(shù)

正則中的模式字符串 pattern 的格式為 “任意字符” + “(” + “任意字符” + “)”。
替換的字符串 repl 為第2個分組的內(nèi)容。
要被查找替換的原始字符串 string 為原始的爬取的數(shù)據(jù)。

這里推薦大家一個在線正則表達式測試的網(wǎng)站：https://tool.oschina.net/regex/

1.4.3 數(shù)據(jù)分析及可視化

接下來我們首先看一下所有大類的paper數(shù)量分布：

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()_df

我們使用merge函數(shù)，以兩個dataframe共同的屬性 “categories” 進行合并，并以 “group_name” 作為類別進行統(tǒng)計，統(tǒng)計結(jié)果放入 “id” 列中并排序。

結(jié)果如下：

下面我們使用餅圖進行上圖結(jié)果的可視化：

fig = plt.figure(figsize=(15,12)) explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) plt.pie(_df["id"], labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode) plt.tight_layout() plt.show()

結(jié)果如下：

下面統(tǒng)計在計算機各個子領(lǐng)域2019年后的paper數(shù)量：

group_name="Computer Science" cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name") cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

我們同樣使用 merge 函數(shù)，對于兩個dataframe 共同的特征 categories 進行合并并且進行查詢。然后我們再對于數(shù)據(jù)進行統(tǒng)計和排序從而得到以下的結(jié)果：

我們可以從結(jié)果看出，Computer Vision and Pattern Recognition（計算機視覺與模式識別）類是CS中paper數(shù)量最多的子類，遙遙領(lǐng)先于其他的CS子類，并且paper的數(shù)量還在逐年增加；另外，Computation and Language（計算與語言）、Cryptography and Security（密碼學(xué)與安全）以及 Robotics（機器人學(xué)）的2019年paper數(shù)量均超過1000或接近1000，這與我們的認知是一致的。

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯，堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結(jié)

以上是生活随笔為你收集整理的【竞赛算法学习】学术前沿趋势分析-论文数据统计的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： IDC机房管理系统软件
下一篇：【算法竞赛学习】学术前沿趋势-论文作者统