0
点赞
收藏
分享

微信扫一扫

NLP太卷,我去研究蛋白质了~

罗蓁蓁 2023-07-25 阅读 88


NLP太卷,我去研究蛋白质了~_ipa

NLP太卷,我去研究蛋白质了~_ipa_02

为什么“单词”被省略了:单词的本质是含义简单且可以高频重复的信息,句子的本质是经过多个单词不断消歧最终包含指向性含义的信息。从基因角度来看,大的片段相当于句子,对这些片段再分段起单词作用,密码子(每三个核苷酸)对应一个氨基酸,本质上还是字母。从蛋白质角度来看,二级结构中由氢键造成的较为规律的折叠、螺旋可以视作单词,能实现特定功能的蛋白质才称得上句子。

NLP太卷,我去研究蛋白质了~_ipa_03

NLP太卷,我去研究蛋白质了~_sms_04

NLP太卷,我去研究蛋白质了~_css_05

NLP太卷,我去研究蛋白质了~_html_06

NLP太卷,我去研究蛋白质了~_html_07

NLP太卷,我去研究蛋白质了~_ipa_08

NLP太卷,我去研究蛋白质了~_css_09

NLP太卷,我去研究蛋白质了~_sms_10

NLP太卷,我去研究蛋白质了~_html_11

NLP太卷,我去研究蛋白质了~_ipa_12

NLP太卷,我去研究蛋白质了~_css_13

NLP太卷,我去研究蛋白质了~_ipa_14

NLP太卷,我去研究蛋白质了~_openssh_15

NLP太卷,我去研究蛋白质了~_sms_16

参考文献

  1. 理论基础,思想很重要,但论证得并不好:
    Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler‐Doty, M., & Grzybowski, B. A. (2014). Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie International Edition, 53(31), 8108-8112.
  2. 综述类,关联NLP方法和应用领域的表格挺有价值的:
    Öztürk, H., Özgür, A., Schwaller, P., Laino, T., & Ozkirimli, E. (2020). Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discovery Today, 25(4), 689-705.
  3. 首度提出Protein Vector(Protvec)和Gene Vector(Genevec)的概念:
    Asgari, E., & Mofrad, M. R. K. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE, 10(11), 1–15.
  4. Protein与word embedding的结合:
    Bepler, T., & Berger, B. (2019). Learning protein sequence embeddings using information from structure. 7th International Conference on Learning Representations, ICLR 2019, 1–17.
  5. 虽然漫画中将2018年Schwaller发表的Seq2Seq(被期刊接收且效果好,见6)视作这个方法在生物分子领域的第一次成功应用,但做这方面的论文一般都会引用这篇作为一切故事的开端。两个韩国高中生的作业,能做到这样真的很厉害了:
    Nam, J., & Kim, J. (2016). Linking the neural machine translation and the prediction of organic chemistry reactions. arXiv preprint arXiv:1612.09529.
  6. Seq2Seq最佳:
    Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C., & Laino, T. (2018). “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science, 9(28), 6091-6098.
  7. 另一篇比较有价值的Seq2Seq:
    Karimi, M., Wu, D., Wang, Z., & Shen, Y. (2019). DeepAffinity: Interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics, 35(18), 3329–3338.
  8. 漂亮的标题漂亮的intro,但内容不是很惊艳的BERT应用:
    Vig, J., Madani, A., Varshney, L. R., Xiong, C., Socher, R., & Rajani, N. F. (2020). Bertology meets biology: Interpreting attention in protein language models. arXiv preprint arXiv:2006.15222.

NLP太卷,我去研究蛋白质了~_css_17

举报

相关推荐

0 条评论