How to Extract and Delete Duplicate Data in MongoDB

This article explains how to extract and delete duplicate data in MongoDB.

When you try to add a unique constraint with createIndex(), it cannot be added if duplicate data already exists. This article explains a concrete way to extract and delete duplicate data in that situation.

Preparing sample data

With the following data registered in bulk by the mongo command, we will look at how to delete the duplicate data ({ item: "pen" }).

> db.orders.insertMany([
  { datetime: ISODate("2021-12-15T12:00:00+09:00"), item: "pen", amount: 70 },
  { datetime: ISODate("2021-12-15T12:00:00+09:00"), item: "note", amount: 80 },
  { datetime: ISODate("2021-12-15T12:00:00+09:00"), item: "eraser", amount: 100 },
  { datetime: ISODate("2021-11-13T12:00:00+09:00"), item: "pen", amount: 20 },
  { datetime: ISODate("2021-11-02T12:00:00+09:00"), item: "pen", amount: 20 },
  { datetime: ISODate("2021-10-23T12:00:00+09:00"), item: "pen", amount: 30 },
  { datetime: ISODate("2021-10-18T12:00:00+09:00"), item: "pen", amount: 10 }
]);

Extracting duplicate data

To extract only duplicate data, use $group in aggregate(). During extraction, $$ROOT can be used to push the original data as-is into the items array.

Also, because aggregate() is limited to 100 MB of in-memory processing, specify the allowDiskUse option when you need to process more data than that at once.

Code (index.js)

var MongoClient = require("mongodb").MongoClient;
var URL = "mongodb://localhost:27017";
 
MongoClient.connect(URL, (err, client) => {
  if (err) {
    console.log(err);
    return;
  };
 
  var db = client.db("sample");
 
  db.collection("orders").aggregate([
    // Extract duplicate data.
    {
      $group: {
        _id: "$item",
        items: { $push: "$$ROOT" },
        count: { $sum: 1 }
      }
    },
    {
      $match: { count: { $gt: 1 } }
    }
  ]).toArray().then((docs) => {
    console.log(docs);
  }).catch((err) => {
    console.log(err);
  }).then(() => {
    client.close();
  });
});

Run and result

> node .\index.js
[ { _id: 'pen',
    items: [ [Object], [Object], [Object], [Object], [Object] ],
    count: 5 } ]

The execution result shows [Object], but when you stop in the debugger, you can inspect the contents with JSON.stringify().

Deleting duplicate data

Although we say “delete duplicate data,” one record should remain. This section introduces two ways to decide which data to keep and delete the rest.

  • Keep the top data after sorting
  • Keep data by specifying ObjectId

Keep the top data after sorting

First, this is a method of sorting and keeping the data with the largest or smallest value.

In the sample code, the data is sorted at L.14, and at L.30 the first record is removed from the deletion target before deletion.

Code (index.js)

var MongoClient = require("mongodb").MongoClient;
var URL = "mongodb://localhost:27017";
 
MongoClient.connect(URL, (err, client) => {
  if (err) {
    console.log(err);
    return;
  };
 
  var db = client.db("sample");
 
  db.collection("orders").aggregate([
    {
      $sort: { amount: -1 }
    },
    {
      $group: {
        _id: "$item",
        targets: { $push: "$_id" },
        count: { $sum: 1 }
      }
    },
    {
      $match: { count: { $gt: 1 } }
    },
  ]).toArray().then((docs) => {
    console.log(JSON.stringify(docs));
    var procs = [];
    for (var doc of docs) {
      doc.targets.shift();
      procs[procs.length] = db.collection("orders").deleteMany({
        _id: { $in: doc.targets }
      });
    }
    return Promise.all(procs);
  }).then((results) => {
    console.log("Remove dupulicate data.");
  }).catch((err) => {
    console.log(err);
  }).then(() => {
    client.close();
  });
});

Run and result

> node .\index.js
Remove dupulicate data.

Keep data by specifying ObjectId

In practice, it is often more common to check the data first, specify the data you want to keep, and delete everything else. This method specifies the ObjectId of the data to keep and deletes the rest. You can specify excluded data by adding values separated by commas to the array on L.32 of the sample code below.

Code (index.js)

var MongoClient = require("mongodb").MongoClient;
var ObjectId = require("mongodb").ObjectId;
var URL = "mongodb://localhost:27017";
 
MongoClient.connect(URL, (err, client) => {
  if (err) {
    console.log(err);
    return;
  };
 
  var db = client.db("sample");
 
  db.collection("orders").aggregate([
    // Extract duplicate data.
    {
      $group: {
        _id: "$item",
        items: { $push: "$$ROOT" },
        count: { $sum: 1 }
      }
    },
    {
      $match: { count: { $gt: 1 } }
    },
    // Get only duplicate data.
    {
      $unwind: "$items"
    },
    // Remove the remaining data from the deletion targets.
    {
      $match: {
        "items._id": { $nin: [new ObjectId("5a5091c4ff3735d50c439d41")] }
      }
    },
    // Collect the ObjectIds to delete.
    {
      $group: {
        _id: null,
        targets: { $push: "$items._id" }
      }
    }
  ]).toArray().then((docs) => {
    console.log(JSON.stringify(docs));
    var procs = [];
    for (var doc of docs) {
      procs[procs.length] = db.collection("orders").deleteMany({
        _id: { $in: doc.targets }
      });
    }
    return Promise.all(procs);
  }).then((results) => {
    console.log("Remove dupulicate data.");
  }).catch((err) => {
    console.log(err);
  }).then(() => {
    client.close();
  });
});

Run and result

> node .\index.js
Remove dupulicate data.

Conclusion

This article explained how to extract and delete duplicate data in MongoDB. The key points are as follows.

  • Extract data with $group in aggregate().
  • Delete data by using aggregate() and then deleteMany().