Starting the Ingest Process

Discover how to create documents from uploaded files and initiate the ingestion process using the Vulgate API.

Creating a document and initiating an ingest job

To create a new document from the uploaded file(s) and initiate a job, make a request to the following endpoint:

POST /api/jobs

Request schema

The request body must include either files or document_id:

Field	Type	Default	Description
`files`	`array`	Conditional*	Array of file objects, each with an `id` (string, from the upload step) and an optional `source_url` (string, recorded as the document’s source). Required if `document_id` is not provided.
`document_id`	`string` (UUID)	Conditional*	UUID of an existing document to reprocess. Required if `files` is not provided. The new job reuses the document’s previous files and settings unless you override them.
`team`	`string`	Key’s team	Team slug. Optional when your API key is scoped to a team (the key’s team is used); if provided, it must match the key’s team.
`ingest_mode`	`string`	`"standard"`	The processing tier: `"standard"` or `"pro"`. Determines the extraction engine, supported file types, and credit cost.
`pipeline`	`string`	`"tei"`	The structuring pipeline. Omit this field — the default TEI pipeline is correct for all new integrations.
`model`	`string`	—	Deprecated — ignored. The processing model is fixed by `ingest_mode`; see Models.
`batch_mode`	`boolean`	`false`	Only honored on tiers that support batched processing (currently `"pro"`); silently coerced to `false` otherwise. Batched jobs are cheaper but may take up to 24 hours to complete.
`scope`	`string`	`"private"`	Document visibility: `"private"` (only the uploader) or `"organization"` (all team members). Setting `"organization"` requires an org-manager role.
`document_collections`	`array`	—	Array of collection UUIDs to add the new document to at creation time. Ignored when reprocessing.
`document_metadata`	`object`	—	Metadata to set on the new document at creation time; see Document metadata. Ignored when reprocessing.
`audio_options`	`object`	—	Options for audio/video files; see Audio options.

* Either files or document_id must be provided, but not both.

The created document’s document_format is set automatically on the server from the first uploaded file’s content type; this field is not part of the request body.

Response

The response contains the created document_id and job_id:

{ "document_id": "doc-xyz789", "job_id": "job-abc123" }

Errors

Status	Cause
`400`	Request body failed validation (e.g. unknown `ingest_mode` value, both `files` and `document_id` provided).
`401`	Missing/invalid API key, or `team` does not match the key’s team.
`500`	Job creation failed; `error.message` describes the cause — e.g. a tier not available to your team, non-PDF/image files sent to the `pro` tier, or no remaining ingest credits.

Example

curl -X POST "https://vulgate.ai/api/jobs" \
  -H "Authorization: Bearer $VULGATE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [{ "id": "<file-id-from-upload>" }],
    "ingest_mode": "pro",
    "scope": "organization",
    "document_metadata": { "language": "la", "publisher": "Typis Vaticanis" }
  }'

Processing tiers

The ingest_mode field selects the processing tier. Two tiers are available:

Tier	`ingest_mode`	Credits/page	Accepted file types	Description
Standard	`"standard"`	1	PDF, images, audio, video, Office documents, HTML	Fast OCR-based extraction for typical documents with tables, headings, and mixed layouts. The default.
Pro	`"pro"`	5	PDF and images only	Vision-model extraction with maximum depth for degraded scans and the most complex document structures.

Notes:

If ingest_mode is omitted, standard is used.
All files in a pro job must be PDFs or images (application/pdf or image/*); other content types are rejected with an error.
Other ingest_mode values exist in the schema for legacy and internal use. Requests using a tier that is not enabled for your team fail with "This ingest mode is no longer available for new uploads.".

Models

You no longer choose a model directly — each processing tier maps to a fixed, curated processing engine, and the model request field is ignored. Audio and video files are always routed to speech-to-text transcription regardless of tier.

Pipelines

The pipeline field selects how extracted content is structured:

Value	Description
`"tei"`	Default. Structures the document into TEI/XML — chapters, sections, paragraphs, and footnotes. Use this for all new integrations.
`"default"`	Legacy pipeline, kept for reprocessing old documents. Do not use for new documents.
`"documentai"`	Legacy Google Document AI pipeline, kept for reprocessing old documents. Do not use for new documents.

Omit the field unless you have a specific reason not to: new jobs default to "tei".

Document metadata

The optional document_metadata object sets metadata on the newly created document. All fields are optional:

Field	Type	Description
`language`	`string`	Language code (e.g. `"en"`, `"la"`).
`license`	`string`	License.
`publisher`	`string`	Publisher name.
`publication_place`	`string`	Place of publication.
`publication_date`	`string`	Publication date as text.
`publication_date_year`	`number \| null`	Publication year.
`publication_date_month`	`number \| null`	Publication month (1–12).
`publication_date_day`	`number \| null`	Publication day (1–31).
`publication_date_precision`	`string \| null`	Precision of the date (e.g. `"year"`, `"month"`, `"day"`).
`edition`	`string`	Edition.
`citation_doctype`	`string`	Document type used in citations.
`citation_container_title`	`string`	Container title (journal, series) used in citations.
`categories`	`string[]`	Category tags.

document_metadata only applies when a new document is created (files flow). Reprocessing with document_id ignores it — use PATCH /api/documents/{document_id} to change metadata on an existing document.

Audio options

For audio and video files, audio_options controls the processing performed:

Field	Type	Description
`audio_options.transcription`	`boolean`	Transcribe speech to text.
`audio_options.music_analysis`	`boolean`	Analyze musical content.

Both fields are required when audio_options is present.

Monitoring job progress

Poll the jobs endpoint to track a job until it finishes:

GET /api/jobs?job_id={job_id}

Parameter	Type	Description
`job_id`	`string`	Filter to specific job IDs. May be repeated.
`document_id`	`string`	Filter to a document’s jobs.
`status`	`string`	`"any"` (default), `"incomplete"`, or a specific status value.
`team`	`string`	Team slug. Optional when the API key is team-scoped.
`page`	`number`	Page number (18 jobs per page; default `1`).

The response contains a data array of job rows (including status), plus count and pageCount. The statuses you will typically observe are pending → processing → processed (ready to finalize), or error if processing failed (the row’s status_text describes the failure).

Finalizing the document

Creating a job starts processing. Once the job reaches the processed status, finalize the document to generate its searchable parts (embeddings) and publish it:

POST /api/jobs/{job_id}/complete

Until this step runs, the document stays unpublished with no parts and is not returned by search. The request body is optional; include metadata fields (the same ones accepted by PATCH /api/documents/{document_id}) to set them at finalize time.

Response

{ "data": { "id": "doc-xyz789" }, "error": null }

The full ingest sequence is therefore: upload → POST /api/jobs → poll GET /api/jobs until processed → POST /api/jobs/{job_id}/complete.