Skip to main content

SPEC 023 — File Storage

FieldValue
StatusDRAFT
PriorityP0 — Launch-Critical
Backendequa-server/modules/file-storage/
Frontendequa-web/src/service/lib/http-client.ts (postMultipart())

1. Feature Purpose

File Storage is the foundational module that all file-handling features in Equa depend on. It manages upload, storage, retrieval, and deduplication of files using AWS S3 as the backing store and a content-addressed hashing scheme for deduplication. Every document uploaded to Data Rooms (SPEC 008), agreement PDFs (SPEC 005), generated reports (SPEC 014), and Google Drive synced files (SPEC 016) ultimately flows through this module. A secondary Microsoft file storage integration handles files managed through the Microsoft 365 connector (SPEC 017).

2. Current State (Verified)

2.1 Storage Backend

DetailValue
Primary storeAWS S3
AddressingContent-addressed by file hash (SHA-256)
DeduplicationHashedFiles table stores unique content hashes; multiple Files rows can reference the same hash
Upload limitAWS_S3_UPLOAD_SIZE_LIMIT_MB (default 10 MB)
Static URLSTATIC_FILE_URL — base URL for public/signed file access

2.2 Upload Flow

DetailValue
Frontend methodpostMultipart() in equa-web/src/service/lib/http-client.ts
HTTP clientAxios with Content-Type: multipart/form-data
Progress trackingonUploadProgress callback for UI progress bars
Server processingCompute hash → check HashedFiles → upload to S3 if new → create Files record

2.3 Microsoft File Storage

DetailValue
Module pathequa-server/modules/file-storage/ (shared module, MS-specific adapters)
IntegrationFiles from Microsoft 365 (SharePoint, OneDrive) stored via same S3 pipeline
MetadataOriginal Microsoft file metadata preserved in Files record

2.4 Environment Variables

VariableRequiredDefaultDescription
AWS_S3_UPLOAD_SIZE_LIMIT_MBNo10Maximum upload size in megabytes
STATIC_FILE_URLYesBase URL for serving stored files
AWS credentialsYesStandard AWS credential chain (env vars, IAM role, etc.)

3. Data Model

Files

ColumnTypeConstraints
iduuidPK
hashHashNOT NULL, FK → HashedFiles — content-addressed reference
filenamevarcharNOT NULL — original filename
urlvarcharNOT NULL — S3 object URL
extensionvarcharnullable — file extension (e.g. pdf, docx)
contentTypevarcharnullable — MIME type (e.g. application/pdf)
sizenumberNOT NULL — file size in bytes
owneruuidFK → Users, NOT NULL

HashedFiles

ColumnTypeConstraints
hashHashPK — SHA-256 content hash
sizenumberNOT NULL — file size in bytes (for quick size lookups without joining Files)

Relationships

  • Files → HashedFiles: Many-to-one. Multiple Files records (different filenames, owners, or contexts) can reference the same HashedFiles row when content is identical.
  • Files → Users: Many-to-one via owner. Tracks who uploaded the file.
  • DirectoryItems → Files: Many-to-one via DirectoryItems.file (SPEC 008). Data Room entries reference Files.

4. API Endpoints

MethodPathAuthDescription
POST/api/v1/files/uploadYesUpload a file (multipart/form-data)
POST/api/v1/organizations/:id/files/uploadYesUpload a file scoped to an organization
GET/api/v1/files/:fileIdYesGet file metadata
GET/api/v1/files/:fileId/downloadYesDownload file content (streams from S3)
DELETE/api/v1/files/:fileIdYesDelete file record (orphan cleanup handles S3)
GET/api/v1/organizations/:id/filesYesList files for an organization

Upload Processing Pipeline

  1. Client sends multipart/form-data via postMultipart() (Axios)
  2. Server receives file stream, computes SHA-256 hash
  3. Check HashedFiles for existing hash
  4. If hash exists → skip S3 upload (dedup), create new Files record pointing to existing hash
  5. If hash is new → upload to S3, create HashedFiles row, create Files record
  6. Return Files record with id, url, and metadata

5. Frontend Components

ComponentPathDescription
postMultipartservice/lib/http-client.tsAxios wrapper for multipart file uploads with progress callback
ProgressBar(shared component)Upload progress indicator driven by onUploadProgress

Frontend Behavior

  • Multipart uploadpostMultipart(url, formData, onProgress) sends files as multipart/form-data with Content-Type header set automatically by Axios.
  • Progress trackingonUploadProgress receives Axios progress events with loaded and total bytes for percentage calculation.
  • Size validation — Frontend validates file size against AWS_S3_UPLOAD_SIZE_LIMIT_MB before initiating upload; over-limit files show an error toast.
  • Multiple files — Batch uploads send files sequentially (not parallel) to avoid overwhelming the server; each file gets its own progress bar.
  • Retry — No automatic retry on upload failure; users see an error and can retry manually.

6. Business Rules

  1. Content-addressed deduplication — Files with identical content (same SHA-256 hash) are stored once in S3; additional uploads create new Files metadata records pointing to the same HashedFiles entry.
  2. Upload size limit — Files exceeding AWS_S3_UPLOAD_SIZE_LIMIT_MB (default 10 MB) are rejected at the server with a 413 response. The frontend also validates before upload.
  3. Hash as foreign keyFiles.hash references HashedFiles.hash; the HashedFiles row must exist before the Files row is created.
  4. Owner tracking — Every file upload records the authenticated user as Files.owner for attribution and access control.
  5. S3 key structure — S3 object keys include the content hash to enable direct dedup lookup; the exact key format is {prefix}/{hash}/{filename}.
  6. Soft delete — Deleting a Files record does not immediately remove the S3 object; orphaned HashedFiles entries (no remaining Files references) are cleaned up by a background job.
  7. MIME type detection — Content type is determined from the file extension and validated against the file’s magic bytes where possible.
  8. Static file URLSTATIC_FILE_URL is prepended to file paths for client-facing URLs; this allows CDN or proxy configuration without changing stored paths.
  9. Microsoft file passthrough — Files originating from Microsoft 365 (via SPEC 017) flow through the same upload pipeline; Microsoft-specific metadata is preserved in the Files record.
  10. No direct S3 access — Clients never interact with S3 directly; all uploads and downloads are proxied through the API server for access control.

7. Acceptance Criteria

  • File upload via multipart form-data succeeds and returns file metadata with ID and URL
  • Uploading the same file content twice creates two Files records but only one HashedFiles / S3 object
  • Files exceeding the size limit are rejected with a 413 error and descriptive message
  • Frontend validates file size before upload and shows error toast for over-limit files
  • Upload progress bar reflects actual upload progress via onUploadProgress
  • File download streams content from S3 with correct Content-Type and Content-Disposition
  • File metadata endpoint returns correct filename, size, extension, content type, and owner
  • Deleting a file removes the Files record; S3 object persists if other records reference the same hash
  • Organization-scoped file listing returns only files belonging to that organization
  • STATIC_FILE_URL is correctly prepended to file URLs in API responses
  • Microsoft-sourced files are stored and retrievable through the same file endpoints
  • File owner is correctly set to the authenticated user on upload

8. Risks

RiskImpactMitigation
S3 bucket misconfiguration (public access)Data exposure of all stored filesAudit bucket policies; block public access; use signed URLs with expiry
Upload size limit bypass via chunked encodingOver-limit files consume storageEnforce size limit at reverse proxy (nginx/ALB) and application layer
Hash collision (SHA-256)Extremely unlikely but would serve wrong file contentProbability is negligible (~1 in 2^128); no practical mitigation needed
Orphan S3 objects accumulateWasted storage costsRun periodic orphan cleanup job comparing HashedFiles references to S3 inventory
No multipart/resumable upload for large filesUploads fail on slow connections near the 10 MB limitConsider S3 presigned URL uploads or chunked multipart for files >5 MB
STATIC_FILE_URL change breaks existing URLsStored file URLs become unreachableUse relative paths internally; resolve full URL at response time
Concurrent upload of identical contentRace condition creating duplicate HashedFiles rowsUse INSERT ... ON CONFLICT (upsert) on hash column
No virus/malware scanningMalicious files stored and servedAdd ClamAV or S3 Object Lambda scanning before persisting
File deletion without cascade awarenessOrphaned DirectoryItems pointing to deleted FilesEnforce referential integrity; check for references before delete