當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

GNN大有可为,从这篇开始学以致用

發布時間：2025/3/12 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 GNN大有可为,从这篇开始学以致用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：十方

說到GNN,各位煉丹師會想到哪些算法呢?不管想到哪些算法,我們真正用到過哪些呢?確實我們可能都看過GNN相關論文,但是還是缺乏實戰經驗的.特別是對于推薦系統而言,我們又該如何應用這些模型呢?下面我們就從DeepWalk這篇論文開始,先講原理,再講實戰,最后講應用.

GNN相關背景知識

GNN的本質,是要學習網絡中每個節點的表達的,這些潛在的表達對圖中每個節點的“社交”關系進行了編碼,把離散值的節點編碼成稠密向量,后續可用于分類回歸,或者作為下游任務的特征.Deepwalk充分利用了隨機游走提取的“句子”,去學習句子中每個單詞的表達.Deepwalk原文就提到了在數據稀疏的情況下可以把F1-scores提升10%,在一些實驗中,能夠用更少的訓練數據獲得了更好的效果.看下圖的例子:

Deepwalk

問題定義:先把問題定義為給社交網絡中的每個節點進行分類,圖可以表達為G=<V,E>,V就是圖上所有節點,E是所有邊.有一部分有label的數據GL=(V,E,X,Y),X就是節點的特征,Y就是分類的label.在傳統機器學習算法中,我們都是直接學習(X,Y),并沒有充分利用節點間的依賴關系.Deepwalk可以捕捉圖上的拓撲關系,通過無監督方法學習每個節點的特征,學到的圖特征和標簽的分布是相互獨立的.

隨機游走:選定一個根節點,“隨機”走出一條路徑,基于相鄰的節點必然相似,我們就可以用這種策略去挖掘網絡的社群信息.隨機游走很方便并行,可以同時提取一張圖上各個部分的信息.即使圖有小的改動,這些路徑也不需要重新計算.和word的出現頻率類似,通過隨機游走得到的節點訪問頻率都符合冪律分布,所以我們就可以用NLP相關方法對隨機游走結果做相似處理,如下圖所示:

所以在隨機游走后,我們只需要通過下公式,學習節點向量即可:

該公式就是skip-gram,通過某個節點學習它左右的節點.我們都知道skip-gram用于文本時的語料庫就是一個個句子,現在我們提取圖的句子.如下所示:

算法很簡單,把所有節點順序打亂(加速收斂),然后遍歷這些節點隨機游走出序列,再通過skipgram算法去擬合每個節點的向量.如此反復.注:這里的隨機是均勻分布去隨機.當然有些圖會有些“副產物”,比如用戶瀏覽網頁的順序,可以直接輸入到模型.

接下來我們看下deepwalks的核心代碼:

# 代碼來源 # https://github.com/phanein/deepwalk #?Random?walk with?open(f,?'w')?as?fout:for?walk?in?graph.build_deepwalk_corpus_iter(G=G,?#?圖num_paths=num_paths, # 路徑數path_length=path_length, # 路徑長度alpha=alpha,?#rand=rand): #fout.write(u"{}\n".format(u"?".join(v?for?v?in?walk)))class?Graph(defaultdict):"""Efficient?basic?implementation?of?nx這里我們看到,該類繼承defaultdict,圖其實可以簡單的表示為dict,key為節點,value為與之相連的節點"""??def __init__(self):super(Graph, self).__init__(list)def nodes(self):return self.keys()def adjacency_iter(self):return self.iteritems()def subgraph(self, nodes={}):# 提取子圖subgraph = Graph()for n in nodes:if n in self:subgraph[n] = [x for x in self[n] if x in nodes]return subgraphdef make_undirected(self):#因為是無向圖,所以v?in?self[u]并且 u in self[v]t0 = time()for v in list(self):for other in self[v]:if v != other:self[other].append(v)t1 = time()logger.info('make_directed: added missing edges {}s'.format(t1-t0))self.make_consistent()return selfdef make_consistent(self):# 去重t0 = time()for k in iterkeys(self):self[k] = list(sorted(set(self[k])))t1 = time()logger.info('make_consistent: made consistent in {}s'.format(t1-t0))self.remove_self_loops()return selfdef remove_self_loops(self):# 自已不會與自己有連接removed = 0t0 = time()for x in self:if x in self[x]: self[x].remove(x)removed += 1t1 = time()logger.info('remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))return selfdef check_self_loops(self):for x in self:for y in self[x]:if x == y:return Truereturn Falsedef has_edge(self, v1, v2):# 兩個節點是否有邊if v2 in self[v1] or v1 in self[v2]:return Truereturn Falsedef degree(self, nodes=None):#?節點的度數if isinstance(nodes, Iterable):return {v:len(self[v]) for v in nodes}else:return len(self[nodes])def order(self):"Returns the number of nodes in the graph"return len(self) def number_of_edges(self):# 圖中邊的數量"Returns the number of nodes in the graph"return sum([self.degree(x) for x in self.keys()])/2def number_of_nodes(self):# 圖中結點數量"Returns the number of nodes in the graph"return self.order()# 核心代碼def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):""" Returns a truncated random walk.path_length: Length of the random walk.alpha: probability of restarts.start: the start node of the random walk."""G = selfif start:path = [start]else:# Sampling is uniform w.r.t V, and not w.r.t Epath = [rand.choice(list(G.keys()))]while len(path) < path_length:cur = path[-1]if len(G[cur]) > 0:if rand.random() >= alpha:path.append(rand.choice(G[cur]))?#?相鄰節點隨機選else:path.append(path[0])?#?有一定概率選擇回到起點else:breakreturn [str(node) for node in path]# TODO add build_walks in heredef build_deepwalk_corpus(G, num_paths, path_length, alpha=0,rand=random.Random(0)):walks = []nodes = list(G.nodes())#?這里和上面論文算法流程對應for cnt in range(num_paths): # 外循環,相當于要迭代多少epochrand.shuffle(nodes)?#?打亂nodes順序,加速收斂for node in nodes: # 每個node都會產生一條路徑walks.append(G.random_walk(path_length, rand=rand, alpha=alpha, start=node))return walksdef build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,rand=random.Random(0)):# 流式處理用walks = []nodes = list(G.nodes())for cnt in range(num_paths):rand.shuffle(nodes)for node in nodes:yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)def clique(size):return from_adjlist(permutations(range(1,size+1))) # http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python def grouper(n, iterable, padvalue=None):"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)def parse_adjacencylist(f):adjlist = []for l in f:if l and l[0] != "#":introw = [int(x) for x in l.strip().split()]row = [introw[0]]row.extend(set(sorted(introw[1:])))adjlist.extend([row])return adjlistdef parse_adjacencylist_unchecked(f):adjlist = []for l in f:if l and l[0] != "#":adjlist.extend([[int(x) for x in l.strip().split()]])return adjlistdef load_adjacencylist(file_, undirected=False, chunksize=10000, unchecked=True):if unchecked:parse_func = parse_adjacencylist_uncheckedconvert_func = from_adjlist_uncheckedelse:parse_func = parse_adjacencylistconvert_func = from_adjlistadjlist = []t0 = time()total = 0 with open(file_) as f:for idx, adj_chunk in enumerate(map(parse_func, grouper(int(chunksize), f))):adjlist.extend(adj_chunk)total += len(adj_chunk)t1 = time()logger.info('Parsed {} edges with {} chunks in {}s'.format(total, idx, t1-t0))t0 = time()G = convert_func(adjlist)t1 = time()logger.info('Converted edges to graph in {}s'.format(t1-t0))if undirected:t0 = time()G = G.make_undirected()t1 = time()logger.info('Made graph undirected in {}s'.format(t1-t0))return G def load_edgelist(file_, undirected=True):G = Graph()with open(file_) as f:for l in f:x, y = l.strip().split()[:2]x = int(x)y = int(y)G[x].append(y)if undirected:G[y].append(x)G.make_consistent()return Gdef load_matfile(file_, variable_name="network", undirected=True):mat_varables = loadmat(file_)mat_matrix = mat_varables[variable_name]return from_numpy(mat_matrix, undirected)def from_networkx(G_input, undirected=True):G = Graph()for idx, x in enumerate(G_input.nodes()):for y in iterkeys(G_input[x]):G[x].append(y)if undirected:G.make_undirected()return Gdef from_numpy(x, undirected=True):G = Graph()if issparse(x):cx = x.tocoo()for i,j,v in zip(cx.row, cx.col, cx.data):G[i].append(j)else:raise Exception("Dense matrices not yet supported.")if undirected:G.make_undirected()G.make_consistent()return Gdef from_adjlist(adjlist):G = Graph()for row in adjlist:node = row[0]neighbors = row[1:]G[node] = list(sorted(set(neighbors)))return Gdef from_adjlist_unchecked(adjlist):G = Graph()for row in adjlist:node = row[0]neighbors = row[1:]G[node] = neighborsreturn G

至于skipgram,大家可以直接用gensim工具即可.

from gensim.models import Word2Vec from gensim.models.word2vec import Vocablogger = logging.getLogger("deepwalk")class Skipgram(Word2Vec):"""A subclass to allow more customization of the Word2Vec internals."""def __init__(self, vocabulary_counts=None, **kwargs):self.vocabulary_counts = Nonekwargs["min_count"] = kwargs.get("min_count", 0)kwargs["workers"] = kwargs.get("workers", cpu_count())kwargs["size"] = kwargs.get("size", 128)kwargs["sentences"] = kwargs.get("sentences", None)kwargs["window"] = kwargs.get("window", 10)kwargs["sg"] = 1kwargs["hs"] = 1if vocabulary_counts != None:self.vocabulary_counts = vocabulary_countssuper(Skipgram, self).__init__(**kwargs)

應用

在推薦場景中,無論是推薦商品還是廣告,用戶和item其實都可以通過點擊/轉化/購買等行為構建二部圖,在此二部圖中進行隨機游走,學習每個節點的向量,在特定場景,缺乏特征和標簽的情況下,可以通過user2user或者item2iterm的方式,很好的泛化到其他的標簽.GNN提取的向量也可以用于下游雙塔召回模型或者排序模型.如果有社交網絡,通過挖掘人與人直接的關系提取特征,供下游任務也是個不錯的選擇.當然大家也可以嘗試在一些推薦比賽中用于豐富特征.

往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載機器學習的數學基礎專輯溫州大學《機器學習課程》視頻本站qq群851320808，加入微信群請掃碼：與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的GNN大有可为,从这篇开始学以致用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：概率论回顾.pptx
下一篇：【NLP】用BERT进行机器阅读理解