Files
avaaz/docs/PRD.md
Madava 20d7a66d57
All checks were successful
Continuous Integration / Validate and test changes (push) Successful in 3s
Update PRD, plan, and agent instructions
2025-12-04 08:26:58 +01:00

40 KiB
Raw Permalink Blame History

Product Requirements Document

Product Information

Title: Avaaz
Change History:

Date Version Author Description
2025-12-04 0.6.0 Internal (Codex) Tightened health-check requirements and ensured all endpoints and criteria are testable.
2025-12-04 0.5.0 Internal (Codex) Added requirements from README/architecture; ensured testable, learner-facing feature coverage.
2025-12-04 0.4.0 Internal (Codex) Simplified language, reduced redundancy, clarified non-functional requirements.
2025-12-03 0.3.0 Internal (Codex) Reinforced mobile-first learner flow, clarified spoken-skill focus, and (A1B2) oral practice scope.
2025-12-03 0.2.0 Internal (Codex) Clarified A1B2 scope; added curriculum/exam, authoring, persistence, and health requirements.
2025-12-03 0.1.0 Internal (Codex) Initial PRD drafted from README.md and docs/architecture.md.

Date: 2025-12-04
Status: Draft

Product Overview

Avaaz is a mobile and web app with a conversational AI tutor. It teaches speaking skills through structured, interactive, voice-first lessons that adapt to each learners pace and performance. Avaaz supports CEFR levels A1B2, with a primary goal of B2 oral exam readiness and confident real-life conversation in the destination country.

Avaaz combines a CEFR-aligned curriculum with real-time AI conversation to deliver low-latency speech-to-speech practice across devices. Learners primarily use native iOS and Android apps. Instructors, coordinators, and administrators use a responsive web portal to manage curricula, reporting, and settings.

Problem Statement:
Adult immigrants and other language learners struggle to achieve confident speaking ability in their target language, especially at the B2 level required for exams, citizenship, or professional roles. Existing solutions (apps, textbooks, group classes) emphasize passive skills (reading, vocabulary drills, grammar) that do not directly translate into fluent speech. Avaaz intentionally keeps reading and writing as contextual supports only—every lesson, scenario, and assessment is designed around spoken interaction, pronunciation, fluency, and comprehension. Human tutors are expensive, scarce in many regions, and difficult to scale, leaving learners underprepared for real-life conversations and high-stakes oral exams.

Product Vision:
To be the trusted AI speaking coach for immigrants and global learners. Avaaz should feel like a human tutor—natural dialogue, rich corrective feedback, and realistic scenarios—while scaling to thousands of learners. Avaaz will measurably improve speaking confidence and B2 exam readiness by:

  • Reducing learners anxiety in real conversations.
  • Increasing B2 oral exam pass rates.
  • Shortening the time required to progress from A1 → A2 → B1 → B2 speaking proficiency.

User and Audience

Avaaz serves adults learning a new language for migration, work, and social integration, with an initial focus on English → Norwegian Bokmål. Learners use the mobile apps or web app to practice speaking; text and transcripts are supporting aids, not the main focus. Instructors, coordinators, and administrators use a web portal to manage curricula, monitor cohorts, and configure settings. Learners can start from A1 and progress through A2 and B1 up to B2; the main goal is to help them reach and pass the B2 oral exam while adding value at each stage.

Personas:

  • Primary Persona Adult Immigrant Exam Candidate (Primary):

    • Age 2045; recently moved to a new country (e.g., Norway).
    • Needs to pass a B2 oral exam for residency, citizenship, or professional accreditation.
    • Has limited time (work, family duties) and mixed confidence speaking with natives.
    • Uses a mid-range phone or laptop; often learns on evenings and weekends.
    • Pain points: insufficient speaking practice, fear of making mistakes, difficulty accessing affordable tutors, and lack of clear feedback on exam readiness.
  • Secondary Persona Working Professional Needing Workplace Fluency:

    • Already employed or seeking employment; needs to operate in the target language at work (meetings, clients, daily conversations).
    • Wants targeted practice around workplace scenarios (e.g., stand-ups, 1:1s, presentations).
    • Pain points: embarrassed about accent/fluency, no safe space to practice, needs domain-specific vocabulary and politeness strategies.
  • Secondary Persona Language School / Program Coordinator:

    • Manages groups of learners in a language school, NGO, or integration program.
    • Wants a scalable speaking practice tool that complements classes and provides data on learner progress.
    • Pain points: limited classroom time, uneven speaking opportunities for students, lack of granular speaking analytics.

User Scenarios:

  • Daily commute micro-lessons (Primary Persona):
    On the bus after work, a learner starts a 10-minute speaking session on “Small Talk at the Workplace.” Avaaz adapts prompts based on mistakes, gives immediate pronunciation and grammar feedback, and updates progress toward the learners target level.

  • Mock B2 oral exam before test day (Primary Persona):
    A week before the exam, the learner runs a full mock oral exam in “Exam Mode.” Avaaz simulates an examiner with timed sections, tracks key speaking skills, and produces an exam-style report with an estimated CEFR level and clear improvement suggestions.

  • Preparing for workplace interactions (Working Professional):
    Before a performance review, a learner practices the “Performance Review Conversation” scenario. Avaaz role-plays manager and colleague, uses realistic workplace language, and coaches polite but assertive phrasing and cultural norms.

  • Program-wide monitoring (Program Coordinator):
    An instructor encourages all students to complete three speaking sessions per week. The coordinator reviews dashboards (e.g., minutes spoken, estimated CEFR band, completion of key scenarios) to spot learners who need support and to report impact to stakeholders.

Functional Requirements (Features)

This section describes the core capabilities required for a production-grade Avaaz full-stack application. Each feature is expressed via user stories with acceptance criteria and dependencies.

1. Voice-First Conversational Lessons

  1. User Story: Real-Time Voice Tutoring

    • As a learner between A1 and B2 (with a special focus on B1B2 exam preparation),
    • I want to speak with an AI tutor using my microphone in near real time,
    • so that I can practice spontaneous spoken interaction and receive immediate feedback.

    Acceptance Criteria:

    • Learner can start a voice session from mobile or web with one tap/click.
    • Audio is streamed via WebRTC with end-to-end latency low enough to support natural turn-taking (target < 250 ms one-way).
    • AI tutor responds using synthesized voice and on-screen text.
    • Transcription and persistence of audio and text follow the persistent conversation and transcript requirements described below.
    • If the microphone or network fails, the app displays an actionable error and offers a retry or text-only fallback.

    Dependencies: LiveKit server (signaling + media), LLM realtime APIs (OpenAI Realtime, Gemini Live), Caddy reverse proxy, WebRTC-capable browsers/mobile clients, backend session orchestration.

  2. User Story: Adaptive Conversational Flow

    • As a learner with uneven skills,
    • I want to receive dynamically adjusted prompts and scaffolding,
    • so that conversations stay challenging but not overwhelming.

    Acceptance Criteria:

    • AI tutor adjusts prompt complexity based on recent performance (e.g., error rate, hesitation, completion rate) and current CEFR level (A1B2).
    • System can slow down, rephrase, or switch to simplified questions when the learner struggles.
    • System can increase complexity (longer turns, follow-up questions, abstract topics) when the learner performs well.
    • Explanations include level-appropriate grammar focus (e.g., simple present and basic word order at A1, more complex clause structures and connectors at B1B2).
    • When explaining, the AI tutor supplements speech with visual and textual aids (images, tables, short written examples) where appropriate.
    • Changes in difficulty are logged for analytics.

    Dependencies: Backend lesson/lesson-state models, LLM prompt engineering and agent logic, PostgreSQL + pgvector for storing session metrics.

  3. User Story: Comprehensive Speaking Feedback

    • As a learner preparing for real conversations and exams,
    • I want to receive detailed feedback on my speaking, not just pronunciation and grammar,
    • so that I understand my strengths and weaknesses across all key speaking skills.

    Acceptance Criteria:

    • After a lesson or mock exam, the system can display or generate scores or qualitative ratings for fluency, pronunciation, grammar, vocabulary, and coherence.
    • Feedback includes at least 23 concrete examples from the session (e.g., misused word, unclear phrasing, hesitation).
    • Feedback format is consistent across sessions and mock exams so results are comparable over time.
    • Learner can view previous feedback reports from a “History” or equivalent section.

    Dependencies: Conversation transcription, scoring and analysis models, feedback formatting logic, persistent storage for feedback reports.

2. CEFR Aligned Curriculum & Real-Life Scenarios

  1. User Story: Structured CEFR-Aligned Path

    • As a motivated learner starting anywhere between A1 and B2,
    • I want to follow a clear sequence of speaking lessons mapped to CEFR descriptors,
    • so that I can track my progress toward B2 and avoid gaps in my skills.

    Acceptance Criteria:

    • Curriculum is structured into levels (A1, A2, B1, B2) and modules (e.g., “Everyday Life,” “Workplace,” “Public Services”).
    • Each speaking lesson includes goals, target CEFR descriptors, example prompts, and success criteria.
    • Learner can see which lessons are completed, in progress, or locked.
    • The system records completion, time spent, and estimated performance for each lesson.

    Dependencies: Backend curriculum models and APIs, frontend curriculum navigation views, content authoring workflow (internal or admin UI), PostgreSQL for storing lesson metadata.

  2. User Story: Immigrant-Focused Real-Life Scenarios

    • As a newly arrived immigrant,
    • I want to practice conversations that match my daily life (e.g., at the doctor, at school, at work, at public offices),
    • so that I feel confident handling real interactions in my new country.

    Acceptance Criteria:

    • Library of scenario templates linked to CEFR levels and contexts (workplace, healthcare, school, housing, etc.).
    • For each scenario, the AI tutor can role-play multiple participants (e.g., nurse, receptionist, colleague).
    • Visual cues (images, documents, forms) can be shown where relevant.
    • Scenarios are localizable (e.g., cultural norms, common phrases) per destination country.
    • Scenarios can be designed to emphasize key oral communication purposes seen in official exams: self-presentation, describing pictures or situations, exchanging information, expressing opinions, and arguing for or against a statement.
    • Scenario templates support both individual and pair/role-play modes, with configurable durations and turn-taking rules.

    Dependencies: Media storage for images/documents, LLM prompt templates by scenario, localization framework, content governance and review processes.

  3. User Story: Curriculum Model with Multi-Skill Objectives

    • As a curriculum designer,
    • I want to model learning objectives for each level (A1B2) across reception, production, interaction, and mediation skills,
    • so that I can align the digital curriculum with established language frameworks and reuse it across languages.

    Acceptance Criteria:

    • For each CEFR level (A1B2), the curriculum can capture objectives for listening/reading (reception), speaking/writing (production), interaction (dialogue and conversations), and mediation (explaining and rephrasing).
    • Lessons and mock exams reference one or more of these objectives, enabling coverage analysis and reporting.
    • Objectives and mappings are configurable per language pair and per country-specific curriculum where applicable.

    Dependencies: Backend curriculum data model, admin tooling for curriculum management, reporting/analytics based on objectives.

  4. User Story: Accent and Cultural Adaptation

    • As a learner moving to a specific country or region,
    • I want to practice with local accents and culturally appropriate language,
    • so that my speech sounds natural and polite in real life.

    Acceptance Criteria:

    • Lessons and scenarios can be tagged with destination country/region and typical dialect or accent.
    • AI tutor can switch between at least one default accent and one local accent where the target language supports it.
    • Scenarios include common cultural norms and politeness strategies (e.g., formal vs informal address) that the tutor can explain on request.
    • Coordinators or admins can choose which regional variants are enabled for their learners.

    Dependencies: Content localization by region, voice configuration options for accents, cultural notes in curriculum content, admin configuration UI or settings.

3. Mock Oral Exam Mode & Assessment

  1. User Story: Full B2 Mock Exam

    • As a learner preparing for a B2 oral exam,
    • I want to take a timed mock exam that follows the official exam structure,
    • so that I know what to expect and can benchmark my readiness.

    Acceptance Criteria:

    • System supports predefined exam templates (sections, timings, types of prompts) for levels A1A2, A2B1, and B1B2, based on local exam formats where applicable.
    • Exam templates can include warm-up tasks that are not scored, as well as scored tasks.
    • Each exam part can be configured as individual or pair conversation, and as one of several task types: self-presentation, describing a picture or situation, speaking about a familiar topic, exchanging views, expressing opinions, and taking a position on a statement with arguments.
    • During the exam, the system enforces timing (visible countdown) and turn-taking rules.
    • At the end, the learner receives an exam-like report with an estimated CEFR level and component scores (fluency, pronunciation, vocabulary, grammar, coherence).
    • Report is saved and viewable later in the “Results” or “History” section.
    • The system can optionally present a small number of stretch tasks from the next higher level to detect learners whose skills may exceed the nominal exam level.

    Dependencies: Assessment rubric definitions, scoring models (LLM-based + heuristic), backend report generation, persistent storage of exam sessions and scores.

  2. User Story: Performance Summaries After Each Session

    • As a learner who just completed a session,
    • I want to see a concise summary of what I did well and what to improve,
    • so that I can focus my next practice and see my progress over time.

    Acceptance Criteria:

    • Post-session screen shows key strengths, common errors, and 23 prioritized recommendations.
    • Summary highlights examples from the conversation (e.g., misused prepositions, pronunciation errors).
    • Learner can share or export summaries (e.g., PDF or link) where allowed.
    • Summaries contribute to longitudinal analytics (trends by skill over time).

    Dependencies: Conversation transcription, error detection pipeline, LLM feedback processing, analytics storage and querying.

4. Multilingual Scaffolding & Integrated Translation

  1. User Story: Localized UI and Instructions

    • As a learner with limited proficiency in the target language,
    • I want to see the apps UI and core instructions in my native or preferred language,
    • so that I am not blocked by interface comprehension while focusing on speaking practice.

    Acceptance Criteria:

    • App supports multiple UI languages with a clear selector during onboarding and in settings.
    • Static text (menus, buttons, error messages) is localized.
    • Critical flows (onboarding, subscription, exam mode) are fully localized.
    • Default UI language is inferred from locale but always user overrideable.

    Dependencies: Localization/i18n system on frontend and backend, translations management process, design support for longer text variants.

  2. User Story: On-Demand Translations During Practice

    • As a low-confidence speaker,
    • I want to quickly translate AI prompts or my own utterances between my language and the target language,
    • so that I can stay engaged rather than getting stuck on unknown words.

    Acceptance Criteria:

    • In-session controls allow optional translations of AI messages and user messages.
    • Translation support is clearly marked and can be disabled by instructors (to reduce over-reliance).
    • Translation usage is logged for analytics (e.g., frequency by user, session).
    • Translations are fast enough to not break conversational flow.

    Dependencies: LLM-based or external translation APIs, usage limits and cost management, UI surface in chat and transcripts.

5. Progress Tracking, Gamification, and Analytics

  1. User Story: Personal Progress Dashboard

    • As a learner targeting a CEFR speaking level (A1B2),
    • I want to see my progress over time across key skills,
    • so that I stay motivated and know where to focus and, ultimately, reach my target (often B2).

    Acceptance Criteria:

    • Dashboard shows time spent speaking, session count, streaks, and estimated CEFR band over time.
    • Learner can view trends in specific skill dimensions (fluency, pronunciation, grammar, vocabulary).
    • Streaks, badges, and milestones are clearly displayed, with rules explained.
    • Data refreshes near-real-time after a session.

    Dependencies: Analytics database structures, data aggregation jobs, frontend charts, privacy/consent handling.

  2. User Story: Program-Level Reporting (Secondary Persona)

    • As a coordinator of a small group of learners,
    • I want to see anonymized or per-learner usage and progress,
    • so that I can measure impact and intervene early for learners who are falling behind.

    Acceptance Criteria:

    • Secure, role-based access for coordinators/instructors.
    • Metrics include active learners, sessions per week, minutes spoken, and average skill trends.
    • Simple export (CSV or PDF) for reporting.
    • Data access respects privacy settings and relevant regulations.

    Dependencies: Role-based access control, reporting queries, secure data storage and anonymization, UI components for analytics.

  3. User Story: Gamified Challenges and Rewards

    • As a learner who struggles to keep a regular speaking habit,
    • I want to earn streaks, badges, and other rewards when I practice,
    • so that I feel motivated to return and build a long-term habit.

    Acceptance Criteria:

    • System tracks daily and weekly speaking activity and calculates streaks based on defined rules (e.g., at least one completed session per day).
    • Learners can unlock badges or milestones based on clear criteria (e.g., total minutes spoken, number of sessions, mock exams completed).
    • Gamification status (streaks, badges, milestones) is visible in the dashboard and updates after each session.
    • All streak and badge rules are documented in-app so they can be tested and verified.

    Dependencies: Analytics and event tracking, gamification rules engine or logic, frontend components to display streaks and badges.

6. User Accounts, Authentication, and Subscription Management

  1. User Story: Account Creation and Sign-In

    • As a new learner,
    • I want to create an account using my email and password (and optionally social login),
    • so that my progress, preferences, and subscriptions are stored securely.

    Acceptance Criteria:

    • Email + password registration with verification flow.
    • Login with JWT-based sessions; secure password hashing in storage.
    • Basic account management (confirm email, change email, password, profile data).
    • Session expiry and logout behaviors are clearly implemented.

    Dependencies: FastAPI Users (or equivalent auth library), PostgreSQL user table/schema, email service for verification, frontend auth flows.

  2. User Story: Subscription Plans and Billing

    • As a serious learner,
    • I want to choose a subscription plan that fits my needs (e.g., free tier, standard, premium),
    • so that I can access the right level of usage and features.

    Acceptance Criteria:

    • Plan definitions (e.g., “Spark,” “Glow,” etc.) with clearly described limits (minutes per month, features like mock exam mode).
    • Billing integrated with Stripe (or similar) for recurring subscriptions.
    • System enforces plan limits gracefully (e.g., warn at 80% usage, block after limit with clear upgrade options).
    • Admin tooling to manage plans and handle refunds/adjustments.

    Dependencies: Payment service integration (Stripe), secure webhook handling, backend plan enforcement, accounting/ledger storage.

7. Cross-Device Learning Continuity

  1. User Story: Seamless Device Switching

    • As a learner who uses both phone and laptop,
    • I want to continue my learning across devices without losing progress,
    • so that I can practice whenever and wherever its convenient.

    Acceptance Criteria:

    • Sessions, progress, and settings are stored server-side and synced across devices.
    • Resume-last-lesson feature available on login.
    • PWA support on mobile for near-native experience and offline access to limited features (where feasible).
    • Conflict-handling behaviors are defined (e.g., two devices active at once).

    Dependencies: Next.js PWA configuration, centralized state in backend, device/session tracking, secure token handling.

  2. User Story: Consistent Tutor Experience Across Devices

    • As a learner who sometimes uses headphones and sometimes speakers,
    • I want to have a consistent AI tutor voice and behavior on all my devices,
    • so that my listening practice is predictable and comfortable.

    Acceptance Criteria:

    • Tutor voice selection (gender, regional accent) is stored in the user profile and applied to all new sessions on any device.
    • When a learner changes voice settings on one device, the change is reflected on other devices within one session or logout/login cycle.
    • At least two distinct tutor voice options are available at launch; more can be added later without breaking existing settings.
    • A simple test script or admin view can confirm which voice configuration is currently active for a given user.

    Dependencies: Voice provider configuration, user profile settings for voice, frontend settings UI, backend APIs for voice preference storage and retrieval.

8. AI-Assisted Curriculum and Lesson Authoring

  1. User Story: Instructor-Designed Lessons with AI Support

    • As a language instructor or admin,
    • I want to design and manage lessons for each CEFR level (A1B2) with AI support, using documents or images as the basis for lessons,
    • so that I can efficiently create high-quality, curriculum-aligned speaking practice tailored to my learners.

    Acceptance Criteria:

    • Instructors can upload documents and images (e.g., forms, articles, exam prompts, everyday photos) into the system.
    • The backend parses and indexes uploaded material via a document processing and embedding pipeline so that AI can reference it during lessons.
    • Instructors can select target level(s), objectives, and exam formats when creating or editing a lesson.
    • AI suggests lesson structures, prompts, and example dialogues that instructors can review and modify before publishing.
    • Lessons are stored with metadata (level, skills, topics, exam parts) and become available in the learner curriculum and mock exams.

    Dependencies: Document upload and processing services, LLM-based content generation, instructor/admin UI, PostgreSQL + pgvector storage.

  2. User Story: Learner-Generated Lessons from Uploaded Material

    • As a learner,
    • I want to upload documents or images that are relevant to my life or exams and have the AI tutor form the basis of a lesson from them,
    • so that my practice feels directly useful and is adapted to my current level (A1B2).

    Acceptance Criteria:

    • Learners can upload files (e.g., work documents, letters from authorities, school forms, pictures from daily life) from web or mobile.
    • System detects or uses the learners current CEFR level to adapt the conversation difficulty and grammar focus appropriately.
    • AI tutor uses the uploaded material as shared context (e.g., refers to specific sections of a document or objects in an image) during the lesson.
    • Uploaded content is stored securely, scoped to the learner/account or organization according to configuration and privacy requirements.

    Dependencies: Same document ingestion pipeline as instructor authoring, user-facing upload UI, LLM prompts conditioned on user level and uploaded context.

9. Persistent Conversations, Transcripts, and Tutor Greetings

  1. User Story: Persistent Conversation History and Context Loading

    • As a returning learner,
    • I want to have my previous conversations, transcripts, and progress persisted and used to initialize new lessons,
    • so that the AI tutor can pick up where we left off and provide a sense of continuity.

    Acceptance Criteria:

    • Audio from both the user and the AI tutor is always transcribed and stored persistently with each session (subject to retention and privacy policies).
    • Each session stores metadata including date, mode (lesson, mock exam, free conversation), level, topics, and key performance indicators.
    • The backend exposes an endpoint (e.g., /sessions/default) that returns or creates a persistent conversational session containing historical summaries and progress context.
    • When a user starts a new lesson, the AI tutors context includes a short summary of recent sessions plus key goals and challenges.

    Dependencies: Session and transcript storage in PostgreSQL + pgvector, summarization logic in backend LLM services, session management API, LiveKit session orchestration.

  2. User Story: Contextual Greeting on Login

    • As a returning learner,
    • I want to hear a short spoken greeting from the AI tutor that reminds me where I left off previously,
    • so that I immediately know what I was working on and can resume with confidence.

    Acceptance Criteria:

    • After login and reconnecting to the conversational session, the AI tutor greets the user verbally and gives a brief, level-appropriate summary of their most recent activity and suggested next step.
    • Greeting content is generated from stored summaries and progress records, not from scratch each time.
    • Learners can adjust how much historical detail is included (e.g., “short summary only” vs. “more detailed recap”).

    Dependencies: Same as for persistent conversation history; frontend behavior to play greeting early in the session and surface a text version of the summary.

10. Health Checks and Admin Observability

  1. User Story: Backend Health Check Endpoint

    • As a platform operator,
    • I want to have standard health check endpoints for liveness, readiness, and detailed status,
    • so that CI/CD pipelines, uptime monitors, and dashboards can verify that the API is running, ready, and healthy.

    Acceptance Criteria:

    • Backend exposes three unauthenticated endpoints:
      • GET /health/live returns HTTP 200 and body "live" when the process is running.
      • GET /health/ready returns HTTP 200 and body "ready" when critical dependencies are OK, and HTTP 503 otherwise.
      • GET /health returns a JSON body with an overall status field and per-component checks, and uses HTTP 200 for "pass" and HTTP 503 for "fail".
    • Health endpoints are used in deployment pipelines and monitoring (e.g., https://api.<domain>/health, /health/ready, /health/live).
    • All three endpoints are lightweight enough to be polled on the order of seconds without impacting users.

    Dependencies: FastAPI route for health checks, integration with basic internal dependency checks (DB, LiveKit, LLM connectivity where feasible).

  2. User Story: Admin Health Dashboard in Frontend

    • As an admin or operator,
    • I want to view a dashboard in the frontend showing the health of core components,
    • so that I can quickly detect and diagnose issues without logging into servers directly.

    Acceptance Criteria:

    • Frontend provides an admin-only view that aggregates health data for frontend, backend, database, LiveKit, and external LLM APIs.
    • Dashboard polls or subscribes to backend health endpoints and visualizes status (e.g., up/down, latency, last check time).
    • Critical issues are highlighted and optionally surfaced as alerts/notifications.

    Dependencies: Backend health endpoints and metrics, role-based access control, frontend admin UI components.

Non-Functional Requirements (Technical)

Frontend Requirements

Supported Browsers/Devices:

  • Desktop: Latest 2 versions of Chrome, Firefox, Safari, and Edge.
  • Mobile: Latest 2 major versions of iOS and Android (Safari/Chrome), including PWA install support.
  • Minimum viewport: responsive layouts down to 360px width.

Design/UI:

  • Voice-first interaction prioritized: microphone and conversation views are obvious and usable with one hand on mobile.
  • Consistent brand identity for Avaaz across web and mobile (colors, typography, logo).
  • Dark and light modes preferred for accessibility and comfort.
  • UI is localized alongside backend error messages; language selection is a first-class setting.
  • UI components built in React/Next.js with Tailwind or equivalent utility-first styling, following design system guidelines (buttons, forms, cards, modals).

Backend & Database Requirements

API Specifications:

  • RESTful JSON APIs served by a FastAPI backend for core operations: authentication, user management, lessons, sessions, progress, subscriptions.
  • Real-time endpoints for voice and agent control:
    • LiveKit signaling endpoints (/sessions/default, /sessions/default/token and equivalents).
    • WebSocket or WebRTC connections from backend to LLM realtime APIs (OpenAI Realtime, Gemini Live).
  • API documentation exposed via OpenAPI/Swagger (and/or equivalent documentation tooling).
  • All APIs versioned (e.g., /api/v1/...) with change management.

Database Schema (High-Level):

  • PostgreSQL with pgvector for semantic search and embeddings.
  • Core entities include (indicative, not exhaustive):
    • User (profile, locale, level, subscription plan).
    • Session (conversation metadata, timestamps, mode, links to transcripts).
    • Lesson / Scenario (curriculum structure, CEFR mapping).
    • ProgressSnapshot (aggregated metrics per user over time).
    • Subscription / Payment (plan, billing status, Stripe references).
    • Embedding / Document (semantic chunks for search and content retrieval).
  • Migrations managed via Alembic, with reproducible dev/prod schemas.

Security:

  • All traffic between client and server encrypted via HTTPS (Caddy as reverse proxy with automatic TLS).
  • Authentication via JWT or session tokens implemented with FastAPI Users (or equivalent), with configurable token lifetimes and refresh flows.
  • Passwords stored with modern hashing algorithms (e.g., Argon2, bcrypt).
  • Role-based access control (e.g., learner, coordinator, admin) for sensitive features (analytics, content management).
  • Strict input validation and output encoding following OWASP best practices.
  • Secrets stored securely (e.g., environment variables, secret manager), never hard-coded in the repository.
  • Rate limiting, abuse detection, and monitoring around critical endpoints.

Deployment & CI/CD

  • Production deployment uses Docker-based stacks on a single VPS, with a separate infra stack (caddy, gitea, gitea-runner) and app stack (frontend, backend, postgres, livekit) defined in version-controlled Compose files.
  • Caddy terminates TLS for all public domains and routes traffic to the correct internal services (frontend, backend, LiveKit, Gitea) over a shared Docker network.
  • A Gitea Actions-based CI pipeline runs on each feature branch and pull request, executing backend/frontend tests, static analysis, and image builds, and must pass before merge to main.
  • A tag-based CD pipeline (tags matching v* on main) builds production images and redeploys the app stack on the VPS in a controlled way, minimizing downtime.
  • CI/CD workflows are themselves versioned in the repository so changes to validation or deployment steps are reviewable and reproducible.

Performance, Reliability, and Scalability

Performance:

  • Target median API response time: < 200 ms for standard JSON endpoints under normal load.
  • Voice interaction round-trip (user speaks → AI responds) tuned for natural conversation with minimal perceived delay; target < 1.5 seconds for most responses.
  • System supports concurrent sessions per LiveKit instance and scales horizontally as needed.
  • Efficient use of LLM realtime APIs with streaming responses and graceful handling of network jitter.

Reliability & Availability:

  • Initial production target availability: ≥ 99.5%, with a path to ≥ 99.9% as usage grows.
  • Health checks for all containers (frontend, backend, LiveKit, Postgres) integrated with Docker Compose and any orchestration layer, plus the explicit backend /health endpoint and frontend admin dashboard described above.
  • Graceful degradation: if LLM APIs or LiveKit are temporarily unavailable, the system provides clear messaging to learners and surface-level indicators in the admin dashboard.
  • Regular automated backups of PostgreSQL and configuration; tested restore procedures.

Scalability:

  • Docker-based deployment on a production VPS, with clear separation between infra stack (Caddy, Gitea) and app stack (frontend, backend, LiveKit, Postgres).
  • Horizontal scaling supported for stateless services (frontend, backend, LiveKit) and vertical scaling for PostgreSQL as needed.
  • Efficient connection pooling for database access.
  • Architecture designed to move from single VPS to managed services or Kubernetes in future without large rewrites.

Technical Specifications:

  • Frontend: Next.js (React, TypeScript), Tailwind or equivalent, PWA enabled; communicates with backend via HTTPS and LiveKit via WebRTC.
  • Backend: FastAPI (Python), Uvicorn/Gunicorn, Pydantic for validation, structured services for LLM, payments, and documents.
  • Real-time/Media: LiveKit server for WebRTC signaling and media; integration with LiveKit Agent framework for AI tutor.
  • Database: PostgreSQL + pgvector; migrations via Alembic.
  • LLM Providers: OpenAI Realtime API, Google Gemini Live API (WebSocket/WebRTC).
  • Infra: Caddy reverse proxy, Docker Compose for local and production stacks, Gitea + Actions for CI/CD.
  • Testing/Quality: Pytest, Hypothesis, httpx for API testing, Ruff and Pyright for linting and static analysis, ESLint for frontend.

Accessibility:

  • Compliance target: WCAG 2.1 AA for web UI.
  • All key actions accessible via keyboard and screen readers.
  • Sufficient color contrast and scalable font sizes.
  • Voice-first design complemented by transcripts and captions; learners can read as well as listen.
  • Consideration for hearing- or speech-impaired users where feasible (e.g., text-only practice, adjustable speech rate).

Metrics & Release Plan

Success Metrics (KPIs):

  • Learning Outcomes:
    • ≥ 60% of learners who complete a defined program (e.g., 30+ speaking sessions) report increased speaking confidence.
    • ≥ 50% of learners who use Avaaz consistently (e.g., 3+ sessions/week for 8 weeks) pass the B2 oral exam on their first or second attempt.
  • Engagement:
    • Weekly active learners (WAL) growth rate.
    • Median speaking minutes per active learner per week.
    • Retention (e.g., 4-week and 12-week).
  • Product Quality:
    • Average session rating / NPS for speaking sessions.
    • Error rates and crash-free sessions on mobile/web.
    • Latency metrics for voice interactions.

Timeline & Milestones:

  • Phase 1 Foundation (M0M2):
    • Implement core architecture (backend, frontend, LiveKit, LLM integrations).
    • Basic authentication, user accounts, and minimal speaking session flow.
    • Internal alpha with team and close collaborators.
  • Phase 2 Beta Learning Experience (M3M4):
    • CEFR-aligned curriculum MVP, immigrant-focused scenarios, post-session summaries.
    • Progress dashboard and early gamification (streaks, minutes).
    • Invite-only beta with small learner cohorts; collect qualitative and quantitative feedback.
  • Phase 3 Exam & Scale Readiness (M5M6):
    • Mock B2 exam mode, robust assessment reports.
    • Subscription plans and billing.
    • Production hardening (observability, backups, reliability SLOs).
    • Public launch in initial target market(s).

Release Criteria:

  • Core features of voice-first lessons, CEFR-aligned curriculum, post-session feedback, and at least one full mock exam template are stable and usable.
  • User authentication, subscription management, and payment flows validated in staging and production.
  • System meets agreed performance thresholds (latency, error rates) under expected early-production load.
  • No open critical security vulnerabilities; penetration testing and reviews completed for auth, payments, and data storage.
  • Documentation available for learners (help center) and internal teams (runbooks, API docs).

Potential Risks & Assumptions:

  • Risks:
    • Dependence on external LLM realtime APIs and their SLAs, pricing, and model changes.
    • WebRTC and audio performance may vary across networks and devices, impacting perceived quality.
    • Assessment accuracy (CEFR-level estimates) may not initially match human examiner judgments, affecting learner trust.
    • Regulatory or data privacy constraints (e.g., storing voice data, cross-border data flows) may impact certain markets.
  • Assumptions:
    • Learners have access to a smartphone or laptop with a microphone and stable-enough internet for audio sessions.
    • LLM providers continue to support low-latency realtime APIs suitable for spoken dialogue.
    • Target institutions and exam boards accept AI-supported practice tools as preparation, even if they do not formally endorse them.
    • Initial go-to-market focuses on a limited set of language pairs (e.g., English → Norwegian Bokmål) with potential expansion later.