Files
overleaf-cep/services/web/scripts/delete-orphaned-docs
Rebeka Dekany a648015db8 Centralize prettier configuration to root level (#30501)
* Merge all .prettierignore files into top-level config

* Merge all .prettierrc files into top-level config

* Replace service-specific glob patterns in package.json format scripts with `prettier .`

* Add template files with Jinja2, Go template, envsubst, and Handlebars syntax to .prettierignore

* Ignore GitHub templates

* Ignore PUG templates to format them separately with `format:pug`

* Encourage double quotes for YAML, YML files

* Move prettier for PUG source format script to the root

* Move prettier for styles source format script to the root

* Remove prettier for jenkins files from web

* Remove prettier source format script from all services

* Make .prettierrc more readable

* Update format scripts by file type

* Organise `.prettierignore`

* Add `--cache` flag to prettier scripts for faster runs

* Format all files with prettier

* Format all or format services

* Remove `format`/`format:fix` scripts from services since now it runs from root `package.json`

* Avoid conlficts with yamllint configuration

* Remove `--cache` flag from prettier scripts

* Update all service Makefiles to use root-level prettier configuration

* Update all Jenkinsfile to use root-level prettier configuration

* Ignore auto-generated files by build_scripts

* Update package-lock.json

* Update root Makefile format targets

* Update SP Jenkinsfile format target

* Update E2E Makefile format script

* Udpate `format_js` to work in both local and CI env

* Add docker-mailtrap to .prettierignore

docker-mailtrap is a third-party git-ignored directory used for testing

* Added Docker env detection to prevent nested Docker spawning

* Ignore handlebars templates

* Add cryptographic files and test output to `.prettierignore`

* Add terraform modules to `.gitignore`

* Remove prettier-plugin-groovy

* Use npx directly instead of Docker for local formatting for faster formatting

* Auto-generate Makefiles

* Revert "Remove prettier-plugin-groovy"

This reverts commit 194a33589a2e1e4d2225d10c67e9f025e4222025.

* Mount monorepo root in RUN_LINT_FORMAT for prettier config access

* Prettier ignores all `node_modules` by default regardless of location

* Show only changed files in format output

* Ignore LICENSE files

* Enable prettier on rendered build_scripts outputs

* Ignoring all the template folders by prettier

* Remove the public/minjs entry since it does not exist

* Remove all non-existent paths

* Sync `.prettierignore` with ignored files by `.gitignore` and `.dockerignore` files

* Revert "Auto-generate Makefiles"

This reverts commit c0233e490de1bc95fe437219d65e0b66e0331ec9.

* Revert "Use npx directly instead of Docker for local formatting for faster formatting"

This reverts commit 1d2b2cf1a6c6974c76885852a90dd55e84167e41.

* Ignore dashboard JSON files

* Ignore files generated by bin/update_build_scripts

* Remove unsupported file types from `.prettierignore`

* Ignore test fixture generated files

* Ignore README file types by prettier

* Ignore generate snapshots by prettier

* Allow to format generated bin/update_build_scripts by prettier

* Ensure build script outputs prettier-compatible tsconfig.json

* Fix build script output to match prettier formatting
- Fix Jinja2 whitespace in docker-compose templates
- Change YAML quotes from single to double

* Don't read cryptographic files by prettier

* Ignore google verification files by prettier

* Revert npx prettier formatting

* Ignore domain verification files

* Show only changed files in format output

* Make `.github` prettier

* Allow all files to be formatted in jobs by prettier

* Allow server-ce/server-pro files to be formatted by prettier

* Ignore more folders in clsi, filestory, git-bridge by prettier

* Update build script with `RUN_LINTING_CI_MONOREPO`

* Ignore docker-mailtrap and downloads in server-ce by prettier

* Restore prettier configs and prettierignore for V1 since it has its own prettier (an older version)

* Source format

GitOrigin-RevId: 637adc3cc422d1f20c86d6ebc8ec514d60758287
2026-02-04 09:08:22 +00:00
..

Delete Orphaned Docs

Because of the large numbers of documents and projects it is necessary to detect orphaned docs using bulk exports of the raw data.

Exporting Data Files

Follow the directions in google-ops/README.md for exporting data from mongo and copying the files to your local machine.

Exporting docs

Run the following doc export command to export all doc ids and their associated project ids in batches of 10,000,000.

mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection docs --fields '_id,project_id' --skip 0 --limit 10000000 --type=csv --out docs.00000000.csv

This will produce files like:

_id,project_id
ObjectId(5babb6f864c952737a9a4c32),ObjectId(5b98bba5e2f38b7c88f6a625)
ObjectId(4eecaffcbffa66588e000007),ObjectId(4eecaffcbffa66588e00000d)

Concatenate these into a single file: cat docs.*csv > all-docs-doc_id-project_id.csv

For object ids the script will accept either plain hex strings or the ObjectId(...) format used by mongoexport.

Exporting Projects

Export project ids from all projects and deletedProjects

mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection projects --fields '_id' --type=csv --out projects.csv
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection deletedProjects --fields 'project._id' --type=csv --out deleted-projects.csv

Concatenate these: cat projects.csv deleted-projects.csv > all-projects-project_id.csv

Processing Exported Data

Create a unique sorted list of project ids from docs

cut -d, -f 2 all-docs-doc_id-project_id.csv | sort | uniq > all-docs-project_ids.sorted.uniq.csv

Create a unique sorted list of projects ids from projects

sort all-projects-project_id.csv | uniq > all-projects-project_id.sorted.uniq.csv

Create list of project ids in docs but not in projects

comm --check-order -23 all-docs-project_ids.sorted.uniq.csv all-projects-project_id.sorted.uniq.csv > orphaned-doc-project_ids.csv

Create list of docs ids with project ids not in projects

grep -F -f orphaned-doc-project_ids.csv all-docs-doc_id-project_id.csv > orphaned-doc-doc_id-project_id.csv

Run doc deleter

node delete-orphaned-docs orphaned-doc-doc_id-project_id.csv

Commit Changes

By default the script will only print the list of project ids and docs ids to be deleted. In order to actually delete docs run with the --commit argument.

Selecting Input Lines to Process

The --limit and --offset arguments can be used to specify which lines to process. There is one doc per line so a single project will often have multiple lines, but deletion is based on project id, so if one doc for a project is deleted all will be deleted, even if all of the input lines are not processed.