Python千万级字典快速去重脚本-CFANZ编程社区

希望你每天醒来都是阳光的，不会因为别人的几句话，几个表情和几个举止影响自己的心情，好好生活，总会遇见美好的事。。。

---- 网易云热评

一、下载地址

https://github.com/teamssix/quchong

二、下载pyhon脚本到本地

git clone https://github.com/teamssix/quchong.git

Python千万级字典快速去重脚本_List

三、用法

1、必须python2环境

2、把去重的文件和该python脚本放到一起

3、新建几个有重复内容的文件，放到一个与python脚本不在一起的位置/root/123

Python千万级字典快速去重脚本_List_02

4、修改python脚本

#coding=utf-8

  

import sys, re, os

def file_merge():

    input_path = "/root/123/" #此处填好自己的路径，注意最后的"/"

    #使用os.listdir函数获取路径下的所有的文件名，并存在一个list中

    #使用os.path.join函数，将文件名和路径拼成绝对路径

    whole_file = [os.path.join(input_path,file) for file in os.listdir(input_path)]

    content = []

    #对于每一个路径，将其打开之后，使用readlines获取全部内容

    for w in whole_file:

        with open(w,'rb') as f:

            content = content+f.readlines()

    #构造输出的路径，和输入路径在同一个文件夹下，如果该文件夹内没有这个文件会自动创建

    output_path = os.path.join(input_path,'合并所有文件.txt')

    #将内容写入文件

    with open(output_path,'wb') as f:

        f.writelines(content)

  

def getDictList(dict):

    regx = '''[\w\~`\!\@\#\$\%\^\&\*\(\)\_\-\+\=\[\]\{\}\:\;\,\.\/\<\>\?]+'''

    with open(dict) as f:

        data = f.read()

        return re.findall(regx, data)

  

def rmdp(dictList):

    return list(set(dictList))

  

def fileSave(dictRmdp, out):

    with open(out, 'a') as f:

        for line in dictRmdp:

            f.write(line + '\n')

  

def main():

    try:

        dict = '/root/123/合并所有文件.txt'

        out = '/root/123/去重所有文件.txt'

    except Exception, e:

        print 'error:', e

        me = os.path.basename(__file__)

        exit()

  

    dictList = getDictList(dict)

    dictRmdp = rmdp(dictList)

    fileSave(dictRmdp, out)

    

if __name__ == '__main__':

    file_merge()

    main()