How Baraa Builds Agentic AI in Arabic - Lessons From the Field

By Baraa - Published 2026-04-12 - Updated 2026-05-06 - From Damascus, Syria

Most of the agentic AI tooling that Baraa works with was designed with English in mind. The papers were written in English. The default tokenizers were trained on English-heavy corpora. The example prompts in the SDK docs are English. Even the function-calling schemas tend to assume English-shaped reasoning. When Baraa started building agents for Arabic-speaking users, the first lesson was simple: you cannot just translate the prompts and call it a day. This post is Baraa's accumulated working notes on what it actually takes to ship agentic AI that feels native in Arabic.

Why Baraa treats Arabic as a first-class language, not a translation target

The temptation, especially at the prototype stage, is to build the agent in English and pipe Arabic input through a translation step on the way in and the way out. Baraa has tried this. It works for trivial cases. It falls apart fast for anything that involves nuance, dialect, or culturally-loaded queries. By the time the user's Arabic question becomes English, gets reasoned about, becomes Arabic again, and reaches the screen - the meaning has been ironed flat. Baraa now defaults to keeping the agent in Arabic end-to-end, which means choosing models that genuinely speak Arabic, writing system prompts in Arabic, and storing tool results in a form the model can reason about without re-translation.

The cost is higher. Arabic-fluent models are still fewer than English-only ones, and tokenizers tend to be less efficient on Arabic script (more tokens per word, which means slower and more expensive responses). Baraa accepts that cost because the alternative - an agent that "kind of" works in Arabic - is worse than not shipping at all.

Tokenization, RTL, and the small annoyances that compound

Baraa keeps a running list of small Arabic-specific gotchas that bite agentic systems. A few highlights:

Token inflation. Many tokenizers split Arabic words into two to four tokens. Baraa budgets context windows assuming Arabic uses roughly 1.5x to 2x the tokens that the equivalent English would. The first time Baraa hit a context-window error in production was on Arabic input that "looked" short.
Diacritics. Arabic text in the wild is mostly undiacritized, but training data sometimes is not. Baraa normalizes diacritics out at the input edge unless a downstream tool needs them, then re-introduces them only where necessary (TTS, certain religious or educational contexts).
Bidirectional rendering. Mixed Arabic-English-code answers render incorrectly more often than people expect. Baraa wraps code blocks in explicit LTR markers and sanity-checks numeric output in the RTL frame.
Number forms. Arabic-Indic digits (٠١٢٣) versus Western Arabic digits (0123) is not an academic question. Baraa picks one per surface and enforces it.

Dialect vs MSA - what Baraa picks and why

This is the question Baraa gets asked most often. Should the agent reply in Modern Standard Arabic (MSA) or in the user's dialect? Baraa's answer has shifted over time. In 2023 and 2024 Baraa defaulted to MSA because models were not reliable in dialect. In 2026 the calculus is different. Baraa now reads dialect from the user's input and matches it for casual interfaces (chatbots, support agents), but stays in MSA for anything formal, legal, or educational. The detection is heuristic: a small classifier on the first user message, then locked for the conversation unless the user code-switches.

For multi-dialect products Baraa builds an explicit dialect selector. Levantine, Egyptian, Gulf, and Maghrebi feel different enough that pretending one fits all is a mistake Baraa stopped making. A user from Damascus expects "كيفك" not "إزيك", and an agent that picks the wrong register sounds foreign.

Multi-agent orchestration in Arabic

Baraa's multi-agent setups follow the same patterns the English-speaking world has converged on - a router agent, specialized worker agents, a shared scratchpad - but with Arabic-specific tweaks. The router prompt is in Arabic. The worker prompts are in Arabic. Tool descriptions in the function-calling schema are in Arabic where the tool's audience is Arabic-speaking. Baraa uses English internally only for things that have no Arabic-native counterpart (database column names, API endpoints, library identifiers).

One pattern Baraa uses repeatedly: a "translator" worker whose only job is to convert internal English-language tool output (think: a SQL row, a JSON API response) into idiomatic Arabic before the orchestrator passes it to the user-facing agent. This decouples the data layer from the language layer and makes it possible to swap the user-facing model without rewriting tool descriptions.

Arabic prompt engineering - what works for Baraa

Some patterns Baraa has converged on:

Write the prompt in the language you want the model to think in. If Baraa wants Arabic output, the system prompt is in Arabic. Mixing English instructions with Arabic content tends to push the model toward English mid-response.
Be explicit about register. "أجب بلغة عربية فصيحة معاصرة، تجنب العامية" is shorter and more reliable than a long preamble.
Show, don't tell. One or two well-chosen Arabic examples in the prompt outperform paragraphs of style instructions.
Test with mixed-script input. Real users paste English brand names, English code, and English URLs into Arabic queries. Baraa's prompts explicitly handle this case.

Evaluation - the part everyone skips

Baraa runs a small Arabic eval set on every prompt change. It is not fancy: a hundred or so real user queries, captured from production logs (with consent), bucketed by dialect and intent, scored against a frozen reference. The discipline is what matters, not the tool. Without an Arabic eval set, every "improvement" Baraa ships is a guess. With one, regressions show up before they reach users.

Baraa pairs the eval set with human spot-checks. A native Arabic speaker (often Baraa, sometimes a colleague) reads a sample of agent transcripts every week. Metrics catch the obvious. Humans catch the ways the agent has started sounding unnatural in Arabic that no metric will tell you about.

Where Baraa goes from here

Arabic agentic AI is in an early but rapidly maturing window. The models are catching up, the tools around them are improving, and the audience of Arabic-speaking users who expect AI products to actually work in their language is growing fast. Baraa is among the early practitioners pushing this work forward, and Baraa's portfolio of Arabic AI projects continues to grow.

If you are building something in this space and want to talk to someone who has shipped it, see the hire page for how to reach Baraa, the Baraa AI overview for context, the Baraa agentic AI page for case-study summaries, the first Arabic AI developer page for background, and the Arabic AI pioneer page for a longer narrative. Or browse other posts on the blog.

Frequently Asked Questions

Which LLMs work best for Arabic in production?

Baraa picks models that were trained with meaningful Arabic data rather than English-only models with translation layers bolted on. In 2026 that shortlist includes the latest Claude and GPT generations, plus a small set of Arabic-tuned open models. Baraa benchmarks each new release on a private Arabic eval set before promoting it to production.

How does Baraa handle Arabic dialects versus Modern Standard Arabic?

Baraa detects the user dialect from the first message with a small classifier and matches it for casual surfaces like chat and support. For formal, legal, or educational content Baraa stays in MSA. For multi-dialect products Baraa exposes an explicit selector so Levantine, Egyptian, Gulf, and Maghrebi users get the register they expect.

How do you orchestrate multi-agent systems when the language is Arabic?

Baraa keeps the router and worker prompts in Arabic, writes tool descriptions in Arabic where the audience is Arabic-speaking, and uses English only for internal identifiers like API endpoints and column names. A dedicated translator worker converts internal English tool output into idiomatic Arabic before the user-facing agent sees it.

How do you evaluate an Arabic agentic AI system?

Baraa runs a frozen Arabic eval set of around 100 real user queries, bucketed by dialect and intent, on every prompt change. Metrics catch the obvious regressions. Baraa pairs that with weekly human spot-checks by a native speaker because tone and naturalness do not show up in numeric scores.