如何对大文件进行排序

排序问题虽然有了解，很多只是基于理论上，实践比较少，刚好最近遇到个问题：

!image-20220420153810671

这个以前遇到过类似的问题，假如你有一个20GB的文件，对文件的内容排序，内存限制在 1G，你要怎么排序

以前有做过总结，就是分割文件，逐个文件先排好序，最后然后再合并，说起来很简单，但做起来就。。。。

这次题目有分割文件，我下载了这个压缩包，解压看有1000个 20M的文件

!image-20220420153113356

思路：将这个一千个文件，文件名读取出来，按照条件逐个抽出来，并且排序


void read_lryic(char *path)
{
    DIR *dir = opendir(path); //打开目录文件
    cout << path << endl;
    struct dirent *entry;
    int i = 0;
    while ((entry = readdir(dir)) != 0)
    {
        if (strstr(entry->d_name, ".txt"))
        {
            filename.push_back(entry->d_name);
            printf(" %d %s\n", i++, entry->d_name);
        }
    }
}

现在文件名有了，进行筛选排序，这里用map来存放临时的数据，key 为时间戳，value 为对应的数据，为什么用map，map是会自动自动排序，根据key的值，而这里的没条数据都是 json，用了一个jsoncpp 库（如果用 go 估计已经写完了），原本想直接解析字符串，发现有点麻烦，还是用库吧，提取出时间戳，作为key，提取完一个文件后，重新写入到新的文件里


void sortfile(){
    Json::Reader reader;
    Json::Value obj;
    //读文件名字
    read_lryic("/home/hrp/Cpp/recore/");
    string file;
    for (auto str : filename){
        file = "/home/hrp/Cpp/recore/" + str;
        std::ifstream fIn(file);
        int i = 0;
        if (fIn)
        {
            std::string str;
            while (std::getline(fIn, str))
            {
                if(str.find("张三") != string::npos){

                    //std::cout << i++ << endl;
                    reader.parse(str, obj);
                    long timestamp = obj["timestamp"].asInt64();
                    ret[timestamp] = str;
                }
            }
            cout << file << endl;
        }
        else
        {
            std::cout << "Open file faild." << std::endl;
        }
        ofstream out1(file);
        for(auto i=ret.begin();i!=ret.end();i++){
            //cout << i->second;
            out1 << i->second ;
            out1 << "\n" ;
        }
        ret.clear();
        out1.close();
    }
}

现在完成按照条件完成每个文件的排序了

接下来就是归并，如何归并呢，现在有1000个文件，逐个遍历这1000个文件，每次都提取一行来，一次遍历提取了1000行数据，此时可以用大顶堆的特性，就是1000个数据放进这个大顶堆，堆顶肯定是最大数，按照这个特性，先读取一千条，放在大顶堆了，每次 pop出堆顶的（肯定最大），然后要补上一个，读哪个文件的呢，肯定是pop出去哪条数据所在的文件，所以现在就是要去那个文件里再读一条数据，以此类推，说是这样说，但是这题还有其他附加条件（根据时间分类写）


clock_t start,end;//定义clock_t变量
   start = clock();  //开始时间

   Json::Reader reader;
   Json::Value obj;

   map<string,ifstream*> filesmap;
   map<string,ofstream*> outfile;

   sortfile();
   time_t tt;
   struct tm *ttime;
//用 map 保存文件的读写流，这里是读
   for(auto file: filename){
       auto temp = file;
       file = "/home/hrp/Cpp/recore/" + temp;
       filesmap.emplace(file, new ifstream(file));
   }
//这里用c++ 的优先队列 来模拟堆，插入的数据  为 时间戳和文件的字符名字
  	priority_queue<pair<long, string>> topTime;
   map<long,string> Temp;
   string data;
//先读取 全部文件的 第一条数据，push到 优先队列中
   for(auto iter = filesmap.begin(); iter != filesmap.end(); iter++) {
       std::getline(*iter->second, data);

       reader.parse(data, obj);
       long timestamp = obj["timestamp"].asInt64();
       pair<long,string> b(timestamp,iter->first);

       topTime.push(b);
       Temp[timestamp] = data;
   }
// 现在 对队列操作，此时队列是 出一个，进一个，维持平衡，没有进，就一直出，知道 队列为空位止
// Temp 维护 这1000个数据，动态增减
   while (!topTime.empty()){
       // cout  << Temp[topTime.top()] << endl;
       //时间戳转 日期
       struct tm *ttime;
       tt = topTime.top().first/1000;
       ttime = localtime(&tt);
       char formatime[64];
       strftime(formatime, 64, "%Y-%m-%d", ttime);
       string strtime(formatime);
	//这里是写文件，根据时间戳 转换的日期来 写文件
       // outfile   key 为文件名 value 对应的读写流
       // 为 0 ，说明第一次，创建文件
       if(outfile.count(strtime)  == 0){

           outfile.emplace(strtime,new ofstream(strtime));
           if (!outfile[strtime]->is_open())
               cout << strtime << "error" << endl;
           *outfile[strtime] << Temp[topTime.top().first]+ "\n";
           Temp.erase(topTime.top().first);

       }else{
           //根据 日期写入 数据
           *outfile[strtime] << Temp[topTime.top().first] + "\n";
           Temp.erase(topTime.top().first);

       }
	// top是最大的，此时pop出去，记录下 pop去的是哪个文件
       auto TopTemp = topTime.top();
       topTime.pop();

	//如果 读取到了，更行 优先队列的数据
       if(std::getline(*filesmap[TopTemp.second], data)){
           reader.parse(data, obj);
           long timestamp = obj["timestamp"].asInt64();
           pair<long,string> b(timestamp,TopTemp.second);
           Temp[timestamp] = data;
           topTime.push(b);
       }

   }
//最后 要关闭文件，才会保存
   for(auto iter = outfile.begin(); iter != outfile.end(); iter++) {
       iter->second->close();
   }
   end = clock();   //结束时间
   cout<<"time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;  //输出时间（单位：ｓ）

运行的机器是：腾讯云服务器 4核 4g

!耗时 117 .s

!内存使用不大，500M 以内解决，就是 cpu 占用率高

!排序好的

总结：

认清了知道和做到的差距！！！！！
有些api 日常少使用，很不收悉，都是在查查查
要先有整体思路然后细化问题，逐个解决

代码整体设计不是很好，后续在改进，用多线程多路归并？

菜单

分享

如何对大文件进行排序

如何对大文件进行排序

评论

A2A 初理解：让 AI Agent 真正“互相协作”的通用协议

slow op的排查手段（更新中）

asan内存检测

模型即芯片：AI 推理新分叉

rclone拷贝桶对象失败定位过程

训练初了解：把大模型看成一个复杂函数（通俗版）

vector扩容

智能指针是线程安全的？

ceph中 RBD 使用

cas 无锁编程