./laravel_rag_architecture_1777200427009




Here's the complete architecture diagram first, then the full step-by-step implementation.Now here's the complete step-by-step implementation:


Step 1 — Project setup & dependencies

composer require pgvector/pgvector
composer require openai-php/laravel   # used for Groq too (OpenAI-compatible)
composer require anthropic-php/client # fallback LLM
pip install sentence-transformers     # on your Python sidecar
# .env
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_SERVICE_URL=http://localhost:8001  # your Python sidecar
VECTOR_DIMENSIONS=1024

Step 2 — Database migration

// database/migrations/xxxx_create_documents_table.php
public function up(): void
{
    DB::statement('CREATE EXTENSION IF NOT EXISTS vector');

    Schema::create('document_chunks', function (Blueprint $table) {
        $table->id();
        $table->foreignId('document_id')->constrained()->cascadeOnDelete();
        $table->string('document_title');
        $table->text('content');
        $table->integer('chunk_index');
        $table->integer('token_count');
        $table->json('metadata')->nullable();
        $table->timestamps();
    });

    // Add the vector column separately (pgvector syntax)
    DB::statement('ALTER TABLE document_chunks ADD COLUMN embedding vector(1024)');

    // HNSW index — best for production (fast approximate search)
    DB::statement('
        CREATE INDEX document_chunks_embedding_hnsw
        ON document_chunks
        USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64)
    ');
}

Step 3 — Embedding sidecar (Python FastAPI)

bge-m3 runs best as a small Python service. This keeps Laravel clean.

# embedding_service/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from typing import List
import uvicorn

app = FastAPI()
model = SentenceTransformer("BAAI/bge-m3")

class EmbedRequest(BaseModel):
    texts: List[str]
    is_query: bool = False   # bge-m3 has separate query/passage modes

@app.post("/embed")
def embed(req: EmbedRequest):
    instruction = "Represent this sentence for searching relevant passages: " if req.is_query else ""
    inputs = [instruction + t for t in req.texts] if req.is_query else req.texts
    vectors = model.encode(inputs, normalize_embeddings=True).tolist()
    return {"embeddings": vectors}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8001)
# Run it
pip install fastapi uvicorn sentence-transformers
python embedding_service/main.py

On GPU: add device="cuda" to SentenceTransformer(...). Inference drops from ~800ms to ~40ms per batch.


Step 4 — Laravel services

app/Services/EmbeddingService.php

namespace App\Services;

use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Cache;

class EmbeddingService
{
    private string $url;

    public function __construct()
    {
        $this->url = config('services.embedding.url', env('EMBEDDING_SERVICE_URL'));
    }

    public function embedPassages(array $texts): array
    {
        return $this->call($texts, isQuery: false);
    }

    public function embedQuery(string $text): array
    {
        $cacheKey = 'emb_query_' . md5($text);
        return Cache::remember($cacheKey, 3600, fn () =>
            $this->call([$text], isQuery: true)[0]
        );
    }

    private function call(array $texts, bool $isQuery): array
    {
        $response = Http::timeout(30)->post("{$this->url}/embed", [
            'texts'    => $texts,
            'is_query' => $isQuery,
        ]);

        return $response->json('embeddings');
    }
}

app/Services/DocumentChunker.php

namespace App\Services;

class DocumentChunker
{
    private int $chunkSize;
    private int $overlap;

    public function __construct(int $chunkSize = 512, int $overlap = 64)
    {
        $this->chunkSize = $chunkSize;
        $this->overlap   = $overlap;
    }

    public function chunk(string $text): array
    {
        // Split by sentence boundary first, then merge to chunkSize tokens
        $sentences = preg_split('/(?<=[.!?؟])\s+/u', trim($text), -1, PREG_SPLIT_NO_EMPTY);
        $chunks    = [];
        $buffer    = [];
        $tokenCount = 0;

        foreach ($sentences as $sentence) {
            $tokens = $this->estimateTokens($sentence);

            if ($tokenCount + $tokens > $this->chunkSize && !empty($buffer)) {
                $chunks[] = implode(' ', $buffer);
                // Keep overlap: last N words
                $words  = explode(' ', implode(' ', $buffer));
                $buffer = array_slice($words, -$this->overlap);
                $tokenCount = count($buffer);
            }

            $buffer[] = $sentence;
            $tokenCount += $tokens;
        }

        if (!empty($buffer)) {
            $chunks[] = implode(' ', $buffer);
        }

        return $chunks;
    }

    private function estimateTokens(string $text): int
    {
        // Rough estimate: 1 token ≈ 4 chars for Latin, 2 chars for Arabic
        return (int) (mb_strlen($text) / 3.5);
    }
}

app/Services/VectorSearchService.php

namespace App\Services;

use Illuminate\Support\Facades\DB;

class VectorSearchService
{
    public function search(array $queryEmbedding, int $topK = 5, ?int $documentId = null): array
    {
        $vector = '[' . implode(',', $queryEmbedding) . ']';

        $query = DB::table('document_chunks')
            ->select([
                'id', 'document_id', 'document_title',
                'content', 'chunk_index', 'metadata',
                DB::raw("1 - (embedding <=> '{$vector}'::vector) AS similarity"),
            ])
            ->where('1 - (embedding <=> \'' . $vector . '\'::vector)', '>=', 0.35);

        if ($documentId) {
            $query->where('document_id', $documentId);
        }

        return $query
            ->orderByDesc('similarity')
            ->limit($topK)
            ->get()
            ->toArray();
    }
}

app/Services/LlmService.php

namespace App\Services;

use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;

class LlmService
{
    public function complete(string $systemPrompt, array $messages, int $maxTokens = 1024): string
    {
        try {
            return $this->groq($systemPrompt, $messages, $maxTokens);
        } catch (\Throwable $e) {
            Log::warning('Groq failed, falling back to Claude', ['error' => $e->getMessage()]);
            return $this->claude($systemPrompt, $messages, $maxTokens);
        }
    }

    private function groq(string $system, array $messages, int $maxTokens): string
    {
        $response = Http::withToken(config('services.groq.key'))
            ->timeout(20)
            ->post('https://api.groq.com/openai/v1/chat/completions', [
                'model'       => config('services.groq.model', 'llama-3.3-70b-versatile'),
                'max_tokens'  => $maxTokens,
                'temperature' => 0.2,
                'messages'    => array_merge(
                    [['role' => 'system', 'content' => $system]],
                    $messages
                ),
            ]);

        return $response->json('choices.0.message.content');
    }

    private function claude(string $system, array $messages, int $maxTokens): string
    {
        $response = Http::withHeaders([
            'x-api-key'         => config('services.anthropic.key'),
            'anthropic-version' => '2023-06-01',
        ])->timeout(30)->post('https://api.anthropic.com/v1/messages', [
            'model'      => 'claude-haiku-4-5-20251001',
            'max_tokens' => $maxTokens,
            'system'     => $system,
            'messages'   => $messages,
        ]);

        return $response->json('content.0.text');
    }
}

app/Services/PromptBuilder.php

namespace App\Services;

class PromptBuilder
{
    public function system(): string
    {
        return <<<PROMPT
        You are a precise, helpful assistant. Answer ONLY using the provided context chunks.
        If the answer is not in the context, say "I don't have enough information to answer this."
        Always cite which source (document title + chunk) you used.
        Respond in the same language as the user's question (Arabic or English).
        PROMPT;
    }

    public function buildUserMessage(string $question, array $chunks): string
    {
        $context = collect($chunks)
            ->map(fn ($c, $i) =>
                "[{$i}] Source: {$c->document_title}\n{$c->content}\nSimilarity: " . round($c->similarity, 3)
            )
            ->implode("\n\n---\n\n");

        return <<<MSG
        Context:
        {$context}

        ---
        Question: {$question}
        MSG;
    }
}

Step 5 — Ingestion job

// app/Jobs/IngestDocumentJob.php
namespace App\Jobs;

use App\Models\Document;
use App\Models\DocumentChunk;
use App\Services\DocumentChunker;
use App\Services\EmbeddingService;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;

class IngestDocumentJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable;

    public function __construct(public Document $document) {}

    public function handle(DocumentChunker $chunker, EmbeddingService $embedder): void
    {
        $text   = $this->document->raw_text;
        $chunks = $chunker->chunk($text);

        // Batch embed for efficiency
        $embeddings = $embedder->embedPassages($chunks);

        $records = [];
        foreach ($chunks as $i => $chunk) {
            $vector = '[' . implode(',', $embeddings[$i]) . ']';
            $records[] = [
                'document_id'    => $this->document->id,
                'document_title' => $this->document->title,
                'content'        => $chunk,
                'chunk_index'    => $i,
                'token_count'    => (int)(mb_strlen($chunk) / 3.5),
                'embedding'      => $vector, // raw SQL insert below
                'metadata'       => json_encode(['lang' => $this->document->language]),
                'created_at'     => now(),
                'updated_at'     => now(),
            ];
        }

        // Must use raw insert because Eloquent doesn't handle vector type
        foreach ($records as $record) {
            $embedding = $record['embedding'];
            unset($record['embedding']);
            $id = \DB::table('document_chunks')->insertGetId($record);
            \DB::statement(
                "UPDATE document_chunks SET embedding = ?::vector WHERE id = ?",
                [$embedding, $id]
            );
        }

        $this->document->update(['status' => 'indexed']);
    }
}

Step 6 — Controller

// app/Http/Controllers/RagController.php
namespace App\Http\Controllers;

use App\Services\{EmbeddingService, VectorSearchService, PromptBuilder, LlmService};
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Cache;

class RagController extends Controller
{
    public function __construct(
        private EmbeddingService  $embedder,
        private VectorSearchService $search,
        private PromptBuilder     $prompt,
        private LlmService        $llm,
    ) {}

    public function ask(Request $request)
    {
        $request->validate(['question' => 'required|string|max:2000']);

        $question   = $request->input('question');
        $cacheKey   = 'rag_answer_' . md5($question);

        if ($cached = Cache::get($cacheKey)) {
            return response()->json(array_merge($cached, ['cached' => true]));
        }

        $embedding = $this->embedder->embedQuery($question);
        $chunks    = $this->search->search($embedding, topK: 5);

        if (empty($chunks)) {
            return response()->json(['answer' => "No relevant information found.", 'sources' => []]);
        }

        $userMessage = $this->prompt->buildUserMessage($question, $chunks);
        $answer      = $this->llm->complete(
            $this->prompt->system(),
            [['role' => 'user', 'content' => $userMessage]]
        );

        $result = [
            'answer'  => $answer,
            'sources' => collect($chunks)->map(fn ($c) => [
                'title'      => $c->document_title,
                'chunk'      => $c->chunk_index,
                'similarity' => round($c->similarity, 3),
                'excerpt'    => mb_substr($c->content, 0, 200) . '…',
            ])->toArray(),
        ];

        Cache::put($cacheKey, $result, 3600);

        return response()->json($result);
    }
}

Step 7 — Routes & config

// routes/api.php
Route::post('/ask', [RagController::class, 'ask'])->middleware('throttle:30,1');
Route::post('/documents', [DocumentController::class, 'store']);
// config/services.php (additions)
'groq' => [
    'key'   => env('GROQ_API_KEY'),
    'model' => env('GROQ_MODEL', 'llama-3.3-70b-versatile'),
],
'anthropic' => [
    'key' => env('ANTHROPIC_API_KEY'),
],
'embedding' => [
    'url' => env('EMBEDDING_SERVICE_URL', 'http://localhost:8001'),
],

Project structure

app/
├── Http/Controllers/
│   ├── RagController.php
│   └── DocumentController.php
├── Jobs/
│   └── IngestDocumentJob.php
├── Models/
│   ├── Document.php
│   └── DocumentChunk.php
├── Services/
│   ├── EmbeddingService.php
│   ├── DocumentChunker.php
│   ├── VectorSearchService.php
│   ├── PromptBuilder.php
│   └── LlmService.php
embedding_service/
└── main.py                 ← Python FastAPI bge-m3
database/migrations/
└── xxxx_create_document_chunks_table.php

LLM recommendation for your stack

ModelCost (input)SpeedQualityBest for
llama-3.3-70b-versatile on Groq$0.00008/1K~200 tok/sGoodDefault, high volume
claude-haiku-4-5$0.001/1KFastVery goodFallback, complex reasoning
gemini-2.0-flash$0.0001/1KFastVery goodAlternative if on Google Cloud

The LlmService above already implements the Groq-primary → Claude-fallback pattern. llama-3.3-70b-versatile on Groq is genuinely the best cost/speed ratio available right now — about 10× cheaper than GPT-4o-mini and significantly faster. For complex multi-step reasoning or when Groq is down, Claude Haiku is the right safety net.


What to implement next

The next logical steps are: adding conversation history (store messages in a conversations table and pass last 4 turns into LlmService), adding a reranker (a second pass with a cross-encoder model after the vector search), and adding a document management API (DocumentController with file upload → dispatch IngestDocumentJob). Let me know which piece you want to build next.