Files
overleaf-cep/services/web/scripts/delete-orphaned-docs/README.md
Rebeka Dekany a648015db8 Centralize prettier configuration to root level (#30501)
* Merge all .prettierignore files into top-level config

* Merge all .prettierrc files into top-level config

* Replace service-specific glob patterns in package.json format scripts with `prettier .`

* Add template files with Jinja2, Go template, envsubst, and Handlebars syntax to .prettierignore

* Ignore GitHub templates

* Ignore PUG templates to format them separately with `format:pug`

* Encourage double quotes for YAML, YML files

* Move prettier for PUG source format script to the root

* Move prettier for styles source format script to the root

* Remove prettier for jenkins files from web

* Remove prettier source format script from all services

* Make .prettierrc more readable

* Update format scripts by file type

* Organise `.prettierignore`

* Add `--cache` flag to prettier scripts for faster runs

* Format all files with prettier

* Format all or format services

* Remove `format`/`format:fix` scripts from services since now it runs from root `package.json`

* Avoid conlficts with yamllint configuration

* Remove `--cache` flag from prettier scripts

* Update all service Makefiles to use root-level prettier configuration

* Update all Jenkinsfile to use root-level prettier configuration

* Ignore auto-generated files by build_scripts

* Update package-lock.json

* Update root Makefile format targets

* Update SP Jenkinsfile format target

* Update E2E Makefile format script

* Udpate `format_js` to work in both local and CI env

* Add docker-mailtrap to .prettierignore

docker-mailtrap is a third-party git-ignored directory used for testing

* Added Docker env detection to prevent nested Docker spawning

* Ignore handlebars templates

* Add cryptographic files and test output to `.prettierignore`

* Add terraform modules to `.gitignore`

* Remove prettier-plugin-groovy

* Use npx directly instead of Docker for local formatting for faster formatting

* Auto-generate Makefiles

* Revert "Remove prettier-plugin-groovy"

This reverts commit 194a33589a2e1e4d2225d10c67e9f025e4222025.

* Mount monorepo root in RUN_LINT_FORMAT for prettier config access

* Prettier ignores all `node_modules` by default regardless of location

* Show only changed files in format output

* Ignore LICENSE files

* Enable prettier on rendered build_scripts outputs

* Ignoring all the template folders by prettier

* Remove the public/minjs entry since it does not exist

* Remove all non-existent paths

* Sync `.prettierignore` with ignored files by `.gitignore` and `.dockerignore` files

* Revert "Auto-generate Makefiles"

This reverts commit c0233e490de1bc95fe437219d65e0b66e0331ec9.

* Revert "Use npx directly instead of Docker for local formatting for faster formatting"

This reverts commit 1d2b2cf1a6c6974c76885852a90dd55e84167e41.

* Ignore dashboard JSON files

* Ignore files generated by bin/update_build_scripts

* Remove unsupported file types from `.prettierignore`

* Ignore test fixture generated files

* Ignore README file types by prettier

* Ignore generate snapshots by prettier

* Allow to format generated bin/update_build_scripts by prettier

* Ensure build script outputs prettier-compatible tsconfig.json

* Fix build script output to match prettier formatting
- Fix Jinja2 whitespace in docker-compose templates
- Change YAML quotes from single to double

* Don't read cryptographic files by prettier

* Ignore google verification files by prettier

* Revert npx prettier formatting

* Ignore domain verification files

* Show only changed files in format output

* Make `.github` prettier

* Allow all files to be formatted in jobs by prettier

* Allow server-ce/server-pro files to be formatted by prettier

* Ignore more folders in clsi, filestory, git-bridge by prettier

* Update build script with `RUN_LINTING_CI_MONOREPO`

* Ignore docker-mailtrap and downloads in server-ce by prettier

* Restore prettier configs and prettierignore for V1 since it has its own prettier (an older version)

* Source format

GitOrigin-RevId: 637adc3cc422d1f20c86d6ebc8ec514d60758287
2026-02-04 09:08:22 +00:00

87 lines
2.7 KiB
Markdown

# Delete Orphaned Docs
Because of the large numbers of documents and projects it is necessary to detect
orphaned docs using bulk exports of the raw data.
## Exporting Data Files
Follow the directions in `google-ops/README.md` for exporting data from mongo
and copying the files to your local machine.
### Exporting docs
Run the following doc export command to export all doc ids and their associated
project ids in batches of 10,000,000.
```
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection docs --fields '_id,project_id' --skip 0 --limit 10000000 --type=csv --out docs.00000000.csv
```
This will produce files like:
```
_id,project_id
ObjectId(5babb6f864c952737a9a4c32),ObjectId(5b98bba5e2f38b7c88f6a625)
ObjectId(4eecaffcbffa66588e000007),ObjectId(4eecaffcbffa66588e00000d)
```
Concatenate these into a single file: `cat docs.*csv > all-docs-doc_id-project_id.csv`
For object ids the script will accept either plain hex strings or the `ObjectId(...)`
format used by mongoexport.
### Exporting Projects
Export project ids from all `projects` and `deletedProjects`
```
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection projects --fields '_id' --type=csv --out projects.csv
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection deletedProjects --fields 'project._id' --type=csv --out deleted-projects.csv
```
Concatenate these: `cat projects.csv deleted-projects.csv > all-projects-project_id.csv`
## Processing Exported Data
### Create a unique sorted list of project ids from docs
```
cut -d, -f 2 all-docs-doc_id-project_id.csv | sort | uniq > all-docs-project_ids.sorted.uniq.csv
```
### Create a unique sorted list of projects ids from projects
```
sort all-projects-project_id.csv | uniq > all-projects-project_id.sorted.uniq.csv
```
### Create list of project ids in docs but not in projects
```
comm --check-order -23 all-docs-project_ids.sorted.uniq.csv all-projects-project_id.sorted.uniq.csv > orphaned-doc-project_ids.csv
```
### Create list of docs ids with project ids not in projects
```
grep -F -f orphaned-doc-project_ids.csv all-docs-doc_id-project_id.csv > orphaned-doc-doc_id-project_id.csv
```
## Run doc deleter
```
node delete-orphaned-docs orphaned-doc-doc_id-project_id.csv
```
### Commit Changes
By default the script will only print the list of project ids and docs ids to be
deleted. In order to actually delete docs run with the `--commit` argument.
### Selecting Input Lines to Process
The `--limit` and `--offset` arguments can be used to specify which lines to
process. There is one doc per line so a single project will often have multiple
lines, but deletion is based on project id, so if one doc for a project is
deleted all will be deleted, even if all of the input lines are not processed.