All samples · Multimodal vision

Document Vision Reader

Use gpt-4o-mini image input to extract structured fields from a form, screenshot, or poster.

Problem

Teams often receive information as images. Manually copying fields slows the workflow and creates mistakes.

Users

Students submitting forms, staff checking event materials, and teams processing public screenshots.

Why this track

This practices multimodal input from the curriculum using a currently verified service path. It does not require Azure AI Vision by default.

Architecture

Stay minimal. 5-6 nodes. Each arrow is one network hop.

image

Public or synthetic image

encoder

Data URI encoder

model

gpt-4o-mini vision input

schema

Extraction JSON schema

review

Human confirmation screen

log

Safe demo log

Edges

  • image encoder — read bytes
  • encoder model — image_url content
  • model schema — json_object
  • schema review — editable fields
  • review log — confirmed summary only

Prompt Pack

Starting prompts. Iterate. Move the system prompt into prompts/system.md so it can be versioned.

system
Extract only visible information from the image. Return JSON with title, detected_fields, missing_fields, confidence, and next_step. Use null when a field is not visible. Do not infer private identity details.
user
Image: sample student project form screenshot.

Code Snippet

The pattern shape. Read it, run the matching scaffold, then adapt the idea for your own team.

python
response = client.chat.completions.create(
    model=deployment,
    messages=[{"role": "user", "content": [
        {"type": "text", "text": EXTRACTION_PROMPT},
        {"type": "image_url", "image_url": {"url": data_uri}},
    ]}],
    response_format={"type": "json_object"},
)
fields = json.loads(response.choices[0].message.content)
# ... your turn: add a confirmation screen before saving

Reference: src/techniques/vision_multimodal/ in halla-ai/hackathon-sample-2026

Demo Screens

Three screens that prove the prototype works.

1

Image upload

User selects a small public or synthetic image.

2

Extracted JSON

Detected fields, missing fields, confidence, and next step.

3

Confirm fields

User edits or rejects extraction before it enters the project flow.

Azure budget

Use the existing gpt-4o-mini deployment. Keep images small and test with 3-5 examples; do not batch private documents.

Pitfalls

  • • Symptom: image_parse_error. Cause: invalid image bytes. Fix: test with a valid small PNG first.
  • • Symptom: model fills missing fields. Cause: prompt rewards completion. Fix: require null for unseen data.
  • • Symptom: private data appears in demo. Cause: real ID or student record image. Fix: use synthetic images.

Possible Extensions

If you finish the 1-day path early, use one question below to make the project more original.

  • How would your version compare extraction against a hand-labeled answer?
  • How would you reject unsafe images before model submission?
  • How would you combine image extraction with the evaluation harness?