SPEC 023 — File Storage
| Field | Value |
|---|---|
| Status | DRAFT |
| Priority | P0 — Launch-Critical |
| Backend | equa-server/modules/file-storage/ |
| Frontend | equa-web/src/service/lib/http-client.ts (postMultipart()) |
1. Feature Purpose
File Storage is the foundational module that all file-handling features in Equa depend on. It manages upload, storage, retrieval, and deduplication of files using AWS S3 as the backing store and a content-addressed hashing scheme for deduplication. Every document uploaded to Data Rooms (SPEC 008), agreement PDFs (SPEC 005), generated reports (SPEC 014), and Google Drive synced files (SPEC 016) ultimately flows through this module. A secondary Microsoft file storage integration handles files managed through the Microsoft 365 connector (SPEC 017).2. Current State (Verified)
2.1 Storage Backend
| Detail | Value |
|---|---|
| Primary store | AWS S3 |
| Addressing | Content-addressed by file hash (SHA-256) |
| Deduplication | HashedFiles table stores unique content hashes; multiple Files rows can reference the same hash |
| Upload limit | AWS_S3_UPLOAD_SIZE_LIMIT_MB (default 10 MB) |
| Static URL | STATIC_FILE_URL — base URL for public/signed file access |
2.2 Upload Flow
| Detail | Value |
|---|---|
| Frontend method | postMultipart() in equa-web/src/service/lib/http-client.ts |
| HTTP client | Axios with Content-Type: multipart/form-data |
| Progress tracking | onUploadProgress callback for UI progress bars |
| Server processing | Compute hash → check HashedFiles → upload to S3 if new → create Files record |
2.3 Microsoft File Storage
| Detail | Value |
|---|---|
| Module path | equa-server/modules/file-storage/ (shared module, MS-specific adapters) |
| Integration | Files from Microsoft 365 (SharePoint, OneDrive) stored via same S3 pipeline |
| Metadata | Original Microsoft file metadata preserved in Files record |
2.4 Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
AWS_S3_UPLOAD_SIZE_LIMIT_MB | No | 10 | Maximum upload size in megabytes |
STATIC_FILE_URL | Yes | — | Base URL for serving stored files |
| AWS credentials | Yes | — | Standard AWS credential chain (env vars, IAM role, etc.) |
3. Data Model
Files
| Column | Type | Constraints |
|---|---|---|
| id | uuid | PK |
| hash | Hash | NOT NULL, FK → HashedFiles — content-addressed reference |
| filename | varchar | NOT NULL — original filename |
| url | varchar | NOT NULL — S3 object URL |
| extension | varchar | nullable — file extension (e.g. pdf, docx) |
| contentType | varchar | nullable — MIME type (e.g. application/pdf) |
| size | number | NOT NULL — file size in bytes |
| owner | uuid | FK → Users, NOT NULL |
HashedFiles
| Column | Type | Constraints |
|---|---|---|
| hash | Hash | PK — SHA-256 content hash |
| size | number | NOT NULL — file size in bytes (for quick size lookups without joining Files) |
Relationships
- Files → HashedFiles: Many-to-one. Multiple
Filesrecords (different filenames, owners, or contexts) can reference the sameHashedFilesrow when content is identical. - Files → Users: Many-to-one via
owner. Tracks who uploaded the file. - DirectoryItems → Files: Many-to-one via
DirectoryItems.file(SPEC 008). Data Room entries reference Files.
4. API Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /api/v1/files/upload | Yes | Upload a file (multipart/form-data) |
| POST | /api/v1/organizations/:id/files/upload | Yes | Upload a file scoped to an organization |
| GET | /api/v1/files/:fileId | Yes | Get file metadata |
| GET | /api/v1/files/:fileId/download | Yes | Download file content (streams from S3) |
| DELETE | /api/v1/files/:fileId | Yes | Delete file record (orphan cleanup handles S3) |
| GET | /api/v1/organizations/:id/files | Yes | List files for an organization |
Upload Processing Pipeline
- Client sends
multipart/form-dataviapostMultipart()(Axios) - Server receives file stream, computes SHA-256 hash
- Check
HashedFilesfor existing hash - If hash exists → skip S3 upload (dedup), create new
Filesrecord pointing to existing hash - If hash is new → upload to S3, create
HashedFilesrow, createFilesrecord - Return
Filesrecord withid,url, and metadata
5. Frontend Components
| Component | Path | Description |
|---|---|---|
| postMultipart | service/lib/http-client.ts | Axios wrapper for multipart file uploads with progress callback |
| ProgressBar | (shared component) | Upload progress indicator driven by onUploadProgress |
Frontend Behavior
- Multipart upload —
postMultipart(url, formData, onProgress)sends files asmultipart/form-datawithContent-Typeheader set automatically by Axios. - Progress tracking —
onUploadProgressreceives Axios progress events withloadedandtotalbytes for percentage calculation. - Size validation — Frontend validates file size against
AWS_S3_UPLOAD_SIZE_LIMIT_MBbefore initiating upload; over-limit files show an error toast. - Multiple files — Batch uploads send files sequentially (not parallel) to avoid overwhelming the server; each file gets its own progress bar.
- Retry — No automatic retry on upload failure; users see an error and can retry manually.
6. Business Rules
- Content-addressed deduplication — Files with identical content (same SHA-256 hash) are stored once in S3; additional uploads create new
Filesmetadata records pointing to the sameHashedFilesentry. - Upload size limit — Files exceeding
AWS_S3_UPLOAD_SIZE_LIMIT_MB(default 10 MB) are rejected at the server with a 413 response. The frontend also validates before upload. - Hash as foreign key —
Files.hashreferencesHashedFiles.hash; theHashedFilesrow must exist before theFilesrow is created. - Owner tracking — Every file upload records the authenticated user as
Files.ownerfor attribution and access control. - S3 key structure — S3 object keys include the content hash to enable direct dedup lookup; the exact key format is
{prefix}/{hash}/{filename}. - Soft delete — Deleting a
Filesrecord does not immediately remove the S3 object; orphanedHashedFilesentries (no remainingFilesreferences) are cleaned up by a background job. - MIME type detection — Content type is determined from the file extension and validated against the file’s magic bytes where possible.
- Static file URL —
STATIC_FILE_URLis prepended to file paths for client-facing URLs; this allows CDN or proxy configuration without changing stored paths. - Microsoft file passthrough — Files originating from Microsoft 365 (via SPEC 017) flow through the same upload pipeline; Microsoft-specific metadata is preserved in the
Filesrecord. - No direct S3 access — Clients never interact with S3 directly; all uploads and downloads are proxied through the API server for access control.
7. Acceptance Criteria
- File upload via multipart form-data succeeds and returns file metadata with ID and URL
- Uploading the same file content twice creates two
Filesrecords but only oneHashedFiles/ S3 object - Files exceeding the size limit are rejected with a 413 error and descriptive message
- Frontend validates file size before upload and shows error toast for over-limit files
- Upload progress bar reflects actual upload progress via
onUploadProgress - File download streams content from S3 with correct
Content-TypeandContent-Disposition - File metadata endpoint returns correct filename, size, extension, content type, and owner
- Deleting a file removes the
Filesrecord; S3 object persists if other records reference the same hash - Organization-scoped file listing returns only files belonging to that organization
-
STATIC_FILE_URLis correctly prepended to file URLs in API responses - Microsoft-sourced files are stored and retrievable through the same file endpoints
- File owner is correctly set to the authenticated user on upload
8. Risks
| Risk | Impact | Mitigation |
|---|---|---|
| S3 bucket misconfiguration (public access) | Data exposure of all stored files | Audit bucket policies; block public access; use signed URLs with expiry |
| Upload size limit bypass via chunked encoding | Over-limit files consume storage | Enforce size limit at reverse proxy (nginx/ALB) and application layer |
| Hash collision (SHA-256) | Extremely unlikely but would serve wrong file content | Probability is negligible (~1 in 2^128); no practical mitigation needed |
| Orphan S3 objects accumulate | Wasted storage costs | Run periodic orphan cleanup job comparing HashedFiles references to S3 inventory |
| No multipart/resumable upload for large files | Uploads fail on slow connections near the 10 MB limit | Consider S3 presigned URL uploads or chunked multipart for files >5 MB |
STATIC_FILE_URL change breaks existing URLs | Stored file URLs become unreachable | Use relative paths internally; resolve full URL at response time |
| Concurrent upload of identical content | Race condition creating duplicate HashedFiles rows | Use INSERT ... ON CONFLICT (upsert) on hash column |
| No virus/malware scanning | Malicious files stored and served | Add ClamAV or S3 Object Lambda scanning before persisting |
| File deletion without cascade awareness | Orphaned DirectoryItems pointing to deleted Files | Enforce referential integrity; check for references before delete |