NLP太卷，我去研究蛋白质了~-CFANZ编程社区

NLP太卷，我去研究蛋白质了~_ipa

NLP太卷，我去研究蛋白质了~_ipa_02

为什么“单词”被省略了：单词的本质是含义简单且可以高频重复的信息，句子的本质是经过多个单词不断消歧最终包含指向性含义的信息。从基因角度来看，大的片段相当于句子，对这些片段再分段起单词作用，密码子（每三个核苷酸）对应一个氨基酸，本质上还是字母。从蛋白质角度来看，二级结构中由氢键造成的较为规律的折叠、螺旋可以视作单词，能实现特定功能的蛋白质才称得上句子。

NLP太卷，我去研究蛋白质了~_ipa_03

NLP太卷，我去研究蛋白质了~_sms_04

NLP太卷，我去研究蛋白质了~_css_05

NLP太卷，我去研究蛋白质了~_html_06

NLP太卷，我去研究蛋白质了~_html_07

NLP太卷，我去研究蛋白质了~_ipa_08

NLP太卷，我去研究蛋白质了~_css_09

NLP太卷，我去研究蛋白质了~_sms_10

NLP太卷，我去研究蛋白质了~_html_11

NLP太卷，我去研究蛋白质了~_ipa_12

NLP太卷，我去研究蛋白质了~_css_13

NLP太卷，我去研究蛋白质了~_ipa_14

NLP太卷，我去研究蛋白质了~_openssh_15

NLP太卷，我去研究蛋白质了~_sms_16

参考文献

理论基础，思想很重要，但论证得并不好：
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler‐Doty, M., & Grzybowski, B. A. (2014). Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie International Edition, 53(31), 8108-8112.
综述类，关联NLP方法和应用领域的表格挺有价值的：
Öztürk, H., Özgür, A., Schwaller, P., Laino, T., & Ozkirimli, E. (2020). Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discovery Today, 25(4), 689-705.
首度提出Protein Vector(Protvec)和Gene Vector(Genevec)的概念：
Asgari, E., & Mofrad, M. R. K. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE, 10(11), 1–15.
Protein与word embedding的结合：
Bepler, T., & Berger, B. (2019). Learning protein sequence embeddings using information from structure. 7th International Conference on Learning Representations, ICLR 2019, 1–17.
虽然漫画中将2018年Schwaller发表的Seq2Seq（被期刊接收且效果好，见6）视作这个方法在生物分子领域的第一次成功应用，但做这方面的论文一般都会引用这篇作为一切故事的开端。两个韩国高中生的作业，能做到这样真的很厉害了：
Nam, J., & Kim, J. (2016). Linking the neural machine translation and the prediction of organic chemistry reactions. arXiv preprint arXiv:1612.09529.
Seq2Seq最佳：
Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C., & Laino, T. (2018). “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science, 9(28), 6091-6098.
另一篇比较有价值的Seq2Seq：
Karimi, M., Wu, D., Wang, Z., & Shen, Y. (2019). DeepAffinity: Interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics, 35(18), 3329–3338.
漂亮的标题漂亮的intro，但内容不是很惊艳的BERT应用：
Vig, J., Madani, A., Varshney, L. R., Xiong, C., Socher, R., & Rajani, N. F. (2020). Bertology meets biology: Interpreting attention in protein language models. arXiv preprint arXiv:2006.15222.

NLP太卷，我去研究蛋白质了~_css_17