metapath2vec 异构网络表示学习-CFANZ编程社区

metapath2vec 异构网络表示学习

前言

周末立了个 Flag, 说要完成两篇博客的编写 (更精准的说法是至少两篇), 昨天完成了一篇 DIN 深度兴趣网络介绍以及源码浅析, 今天白天由于忙着买菜, 洗菜和做菜还有运动, 白天恍恍惚惚的过去了, 现在距离夜里 12 点还有 20 分钟左右, 水一篇~

距离 12 点还有 10 分钟时突然想到 … 可以先写一点, 留个坑, 以后再填, 这样的话, 只需要新立一个小小的 Flag, 不仅能完成我这个周末的 Flag, 还可以督促我未来用功, 一举两得, 一石二鸟, 我简直机智的一笔 ???? ???? ????

(2020-08-16 补充) 上周立的 Flag, 现在终于终于来填坑了. 而且, 本周也完成了两篇博客的编写 (AFM 网络介绍与源码浅析以及 Product-based Neural Network (PNN) 介绍与源码浅析) 简直 6 啊 ????????????

广而告之

可以在微信中搜索 “珍妮的算法之路” 或者 “world4458” 关注我的微信公众号；另外可以看看知乎专栏 PoorMemory-机器学习, 以后文章也会发在知乎专栏中；

metapath2vec

文章信息

论文标题: metapath2vec: Scalable Representation Learning for Heterogeneous Networks
论文地址:https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf
代码地址:https://ericdongyx.github.io/metapath2vec/m2v.html
发表时间: KDD 2017
论文作者: Yuxiao Dong, Nitesh V. Chawla, Ananthram Swami
作者单位: Microsoft Research

核心观点

本文提出的算法主要用来处理异构图的表示学习问题. 异构网络指的是节点或边的类型大于 1 的网络, 即网络中有多种类型的节点或多种类型的边.

对于异构网络, 如果用 Random Walk 的方法来进行游走, 那么在学习时可能更偏向于那些高度可见的节点类型:

However, Sun et al. demonstrated that heterogeneous random walks are biased to highly visible types of nodes—those with a dominant number of paths—and concentrated nodes—those with a governing percentage of paths pointing to a small set of nodes

于是本文在游走时考虑了节点类型, 提出了基于 Meta-Path 的游走方法.

Meta-Path (元路径) 的意思是事先定义好节点类型的变化规律 (节点类型变化的顺序). 举个例子, 比如使用论文（Paper)、作者 (Author)、出版社 (Organization) 这些元素构建了一张图, 并设置 Meta-Path 为 “APA”, 那么就表示游走时一定是按照先走 Author 节点, 再走 Paper 节点, 最后再走 Author 节点; 此外, 一般 Meta-Path 设置为对称的, 即最开始的节点类型和最后的节点类型是一样的, 这样 Meta-Path 走到最后一个节点时又可以重新开始, 有些闭环的感觉.

总的来说, 就是 Meta-Path 相当于预定义了一种规则, 对于异构图来说, 需要按照这种特定的规则去游走, 选择下一个节点时需要考虑节点或边的类型. 其中就游走这个操作来说, 和 Random Walk 没有本质区别, 但是 Meta-Path 考虑了节点与边的类型.

按照 Meta-Path, 通过游走得到一系列的游走序列, 然后可以使用 word2vec 中的 skip-gram 来学习节点的 embedding. metapath2vec 直接使用 negative sampling 进行节点的采样和 embedding 的学习, 而 metapath2vec++ 则认为 metapath2vec 在负采样时没有考虑到节点的类型, 因此它在 negative sampling 时将节点类型考虑了进去, 只对同类型的节点使用 softmax 进行归一化, 这样的话, 每种类型的节点都会有一个分布.

在具体代码实现中, 其实很多借鉴了 Word2Vec 的 C++ 代码. LINE (详见 INE 图嵌入算法介绍与源码浅析) 也是在 Word2Vec 的代码上进行修改的.

核心观点解读

基于 Meta-Path 的随机游走

一个 meta-path scheme

其中表示节点类型, 而表示节点类型与

其中节点 , 说明节点的类型为 ; 表示节点的邻居, 它们的类型为 , 即 .

可以看到, 和 Random Walk 不同的是, 进行下一步游走的时候, 需要考虑下一个节点的类型是否满足 Meta-Path Scheme 中的定义.

作者在 https://ericdongyx.github.io/metapath2vec/m2v.html 给出了产生 Meta-Path 的代码. 不过多介绍.

import sys
import os
import random
from collections import Counter

class MetaPathGenerator:
    def __init__(self):
        self.id_author = dict()
        self.id_conf = dict()
        self.author_coauthorlist = dict()
        self.conf_authorlist = dict()
        self.author_conflist = dict()
        self.paper_author = dict()
        self.author_paper = dict()
        self.conf_paper = dict()
        self.paper_conf = dict()

    def read_data(self, dirpath):
        with open(dirpath + "/id_author.txt") as adictfile:
            for line in adictfile:
                toks = line.strip().split("\t")
                if len(toks) == 2:
                    self.id_author[toks[0]] = toks[1].replace(" ", "")

        #print "#authors", len(self.id_author)

        with open(dirpath + "/id_conf.txt") as cdictfile:
            for line in cdictfile:
                toks = line.strip().split("\t")
                if len(toks) == 2:
                    newconf = toks[1].replace(" ", "")
                    self.id_conf[toks[0]] = newconf

        #print "#conf", len(self.id_conf)

        with open(dirpath + "/paper_author.txt") as pafile:
            for line in pafile:
                toks = line.strip().split("\t")
                if len(toks) == 2:
                    p, a = toks[0], toks[1]
                    if p not in self.paper_author:
                        self.paper_author[p] = []
                    self.paper_author[p].append(a)
                    if a not in self.author_paper:
                        self.author_paper[a] = []
                    self.author_paper[a].append(p)

        with open(dirpath + "/paper_conf.txt") as pcfile:
            for line in pcfile:
                toks = line.strip().split("\t")
                if len(toks) == 2:
                    p, c = toks[0], toks[1]
                    self.paper_conf[p] = c 
                    if c not in self.conf_paper:
                        self.conf_paper[c] = []
                    self.conf_paper[c].append(p)

        sumpapersconf, sumauthorsconf = 0, 0
        conf_authors = dict()
        for conf in self.conf_paper:
            papers = self.conf_paper[conf]
            sumpapersconf += len(papers)
            for paper in papers:
                if paper in self.paper_author:
                    authors = self.paper_author[paper]
                    sumauthorsconf += len(authors)

        print "#confs  ", len(self.conf_paper)
        print "#papers ", sumpapersconf,  "#papers per conf ", sumpapersconf / len(self.conf_paper)
        print "#authors", sumauthorsconf, "#authors per conf", sumauthorsconf / len(self.conf_paper)


    def generate_random_aca(self, outfilename, numwalks, walklength):
        for conf in self.conf_paper:
            self.conf_authorlist[conf] = []
            for paper in self.conf_paper[conf]:
                if paper not in self.paper_author: continue
                for author in self.paper_author[paper]:
                    self.conf_authorlist[conf].append(author)
                    if author not in self.author_conflist:
                        self.author_conflist[author] = []
                    self.author_conflist[author].append(conf)
        #print "author-conf list done"

        outfile = open(outfilename, 'w')
        for conf in self.conf_authorlist:
            conf0 = conf
            for j in xrange(0, numwalks ): #wnum walks
                outline = self.id_conf[conf0]
                for i in xrange(0, walklength):
                    authors = self.conf_authorlist[conf]
                    numa = len(authors)
                    authorid = random.randrange(numa)
                    author = authors[authorid]
                    outline += " " + self.id_author[author]
                    confs = self.author_conflist[author]
                    numc = len(confs)
                    confid = random.randrange(numc)
                    conf = confs[confid]
                    outline += " " + self.id_conf[conf]
                outfile.write(outline + "\n")
        outfile.close()


#python py4genMetaPaths.py 1000 100 net_aminer output.aminer.w1000.l100.txt
#python py4genMetaPaths.py 1000 100 net_dbis   output.dbis.w1000.l100.txt

dirpath = "net_aminer" 
# OR 
dirpath = "net_dbis"

numwalks = int(sys.argv[1])
walklength = int(sys.argv[2])