原文:
MIND: Microsoft News Recommendation Dataset
Implement news recommendation methods
News recommendation is an important technique for personalized news service. Compared with product and movie recommendations which have been comprehensively studied, the research on news recommendation is much more limited, mainly due to the lack of a high-quality benchmark dataset.
The MIND dataset for news recommendation was collected from anonymized behavior logs of Microsoft News website. The data randomly sampled 1 million users who had at least 5 news clicks during 6 weeks from October 12 to November 22, 2019. To protect user privacy, each user is de-linked from the production system when securely hashed into an anonymized ID. Also collected the news click behaviors of these users in this period, which are formatted into impression logs. The impression logs have been used in the last week for test, and the logs in the fifth week for training. For samples in training set, used the click behaviors in the first four weeks to construct the news click history for user modeling. Among the training data, the samples in the last day of the fifth week used as validation set. This dataset is a small version of MIND (MIND-small), by randomly sampling 50,000 users and their behavior logs. Only training and validation sets are contained in the MIND-small dataset.
译:
注意:Microsoft新闻推荐数据集
实施新闻推荐方法
新闻推荐是个性化新闻服务的一项重要技术。与已经被广泛研究的产品推荐和电影推荐相比,新闻推荐的研究更为有限,主要原因是缺乏高质量的基准数据集。
新闻推荐的心智数据集是从微软新闻网站的匿名行为日志中收集的。该数据随机抽取了100万名用户,他们在2019年10月12日至11月22日的6周时间内至少有5次新闻点击。为了保护用户隐私,将每个用户安全地散列为一个匿名ID,然后将其与生产系统断开链接。还收集了这些用户在此期间的新闻点击行为,并将其格式化为impression日志。印象日志已在上周用于测试,第五周用于培训。对于训练集中的样本,利用前四周的点击行为构建新闻点击历史,进行用户建模。在训练数据中,以第五周最后一天的样本作为验证集。这个数据集是MIND(MIND small)的一个小版本,通过随机抽样50000个用户和他们的行为日志。MIND小数据集中只包含训练集和验证集。
大家可以到官网地址下载数据集,我自己也在百度网盘分享了一份。可关注本人公众号,回复“2021011503”获取下载链接。