azure sre agent
41 TopicsHow we build Azure SRE Agent with agentic workflows
The Challenge: Ops is critical but takes time from innovation Microsoft operates always-on, mission-critical production systems at extraordinary scale. Thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These are titan systems that power organizations around the globe—including our own—with extremely low risk tolerance for downtime. While operations work like incident investigation, response and recovery, and remediation is essential, it’s also disruptive to innovation. For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout. What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance especially in the AI world where code shipping velocity has increased exponentially. At the same time, we needed to integrate with the AI landscape which continues to evolve at a breakneck pace. New models, new tooling, and new best practices released constantly, fragmenting ecosystems between different platforms for observability, DevOps, incident management, and security. Beyond simply automating tasks, we needed to build an adaptable approach that could integrate with existing systems and improve over time. Microsoft needed a fundamentally different way to perform operations—one that reduced toil, accelerated response, and gave engineers the time to focus on building great products. The Solution: How we build Azure SRE Agent using agentic workflows To address these challenges, Microsoft built Azure SRE Agent, an AI-powered operations agent that serves as an always-on SRE partner for engineers. In practice, Azure SRE Agent continuously observes production environments to detect and investigate incidents. It reasons across signals like logs, metrics, code changes, and other deployment records to perform root cause analysis. It supports engineers from triage to resolution and it’s used in a variety of autonomy levels from assistive investigation to automating remediation proposals. Everything occurs within governance guardrails and human approval checks grounded in role‑based access controls and clear escalation paths. What’s more, Azure SRE Agent learns from past incidents, outcomes, and human feedback to improve over time. But just as important as what was built is how it was built. Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations. The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle: Plan & Code: Agents support spec‑driven development to unlock faster inner loop cycles for developers and even product managers. With AI, we can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging which now enables PMs/UX/Engineering to rapidly iterate, generate and improve code even for early-stage merges. Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment agents work together to shift left on quality and security issues. They also continuously assess reliability, ensure performance, and enforce consistent release best practices. Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts, to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time and we provide Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops. While agents surface insights, propose actions, mitigate issues and suggest long term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale. We also designed Azure SRE agent to integrate across existing systems. Our team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them. Built this way, Azure SRE Agent was not just a new tool but a new operational system. And at Microsoft’s scale, transformative systems lead to transformative outcomes. The Impact: Reducing toil at enterprise scale The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution. Internally at Microsoft in the last nine months, we've seen: 35,000+ incidents have been handled autonomously by Azure SRE Agent. 50,000+ developer hours have been saved by reducing manual investigation and response work. Teams experienced a reduced on-call burden and faster time-to-mitigation during incidents. To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity. And this impact is felt within the developer experience. When asked developers about how the agent has changed ops work, one of our engineers had this to say: “[It’s] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/ gave the right RCA / provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking.” - Software Engineer II, Microsoft Engineering Beyond the impact of the agent itself, the agentic workflow process has also completely redefined how we build. Key learnings: Agentic workflow process and impact It's very easy to think of agents as another form of advanced automation, but it's important to understand that Azure SRE agent is also a collaborative tool. Engineers can prompt the agent in their investigations to surface relevant context (logs, metrics, and related code changes) to propose actions far faster and easier than traditional troubleshooting. What’s more, they can also extend it for data analysis and dashboarding. Now engineers can focus on the agent’s findings to approve actions or intervene when necessary. The result is a human-AI partnership that scales operations expertise without sacrificing control. While the process took time and experimentation to refine, the payoff has been extraordinary; our team is building high-quality features faster than ever since we introduced specialized agents for each stage of the SDLC. While these results were achieved inside Microsoft, the underlying patterns are broadly applicable. First, building agents with agents is essential to scaling, as manual development quickly became a bottleneck; agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, etc. When it comes to agents, specialization matters, because generic agents plateau quickly. Real impact comes from agents equipped with domain‑specific skills, context, and access to the right tools and data. Microsoft also learned to integrate deeply with existing systems, embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them. Throughout this process, maintaining tight human‑in‑the‑loop governance proved critical. Autonomy had to be balanced with clear approval boundaries, role‑based access, and safety checks to build trust. Finally, teams learned to invest in continuous feedback and evaluation, using ongoing measurement to improve agents over time and understand where automation added value versus where human judgment should remain central. Want to learn more? Azure SRE Agent is one example of how agentic workflows can transform both product development and operations at scale. Teams at Microsoft are on a mission of leading the industry by example, not just sharing results. We invite you to take the practical learnings from this blog and apply the same principles in your own environments. Discover more about Azure SRE Agent Learn about agents in DevOps tools and processes Read best practices on agent management with Azure88Views0likes0CommentsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.16KViews6likes1CommentAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.11KViews1like1CommentAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.4.1KViews1like1CommentAn update to the active flow billing model for Azure SRE Agent
Earlier today, we announced that Azure SRE Agent now supports multiple AI model providers, starting with Anthropic. To support multi-model choice, and make active usage costs easier to understand, we’re updating how active flow usage is measured, effective April 15, 2026. At a glance What’s changing Active flow billing moves from time-based to token-based usage. You’ll be billed based on the tokens consumed when SRE Agent is actively doing work (for example, investigating an incident, responding to an alert, or helping in chat). Each model provider has its own published rate (AAUs per million tokens), so you can choose the model provider that fits your scenario and budget. What stays the same Azure Agent Unit (AAU) remains the billing unit. Always-on flow pricing is unchanged: 4 AAUs per agent-hour Your bill continues to have two components: a fixed always-on component plus a variable active flow component. What you need to do For most customers, no action is required. Your existing agents continue running. For the latest information on the AAU rates by model provider and estimates of example consumption scenarios, please refer to the pricing documentation. Why we’re making this change In reliability operations, different tasks can look very different: a quick health check isn’t the same as a multi-step investigation across logs, deployments, and metrics. With multi-model provider support, token consumption varies by model provider and by task complexity. Moving active flow billing to a token-based model provides a more direct, transparent connection between the work being performed and the active usage you’re billed for; especially as we expand model options over time. How token-based active flow helps More predictable costs for common tasks Simple interactions typically use fewer tokens. More complex investigations use more. With token-based billing, the relationship between task complexity and active usage is clearer. More flexibility as we add models You choose the provider, we select the best model for the job. As model providers release newer models and we adopt them, we publish updated AAU-per-token rates so you always know what you're paying. See the current rates in the pricing documentation. Spending controls stay in place You can still set a monthly AAU allocation limit in Settings → Agent consumption in the SRE Agent portal. When you reach your active flow limit, your agent continues to run, but pauses chat and autonomous actions until the next month. You can adjust your limit at any time. Next steps For most customers, this change requires no action. Your always-on billing is unchanged, your existing agents continue running, and your AAU meter remains the same. The billing change affects only how active flow usage is measured and calculated. If you're currently using SRE Agent and want to understand the new pricing in detail – including AAU rates per model, example consumption scenarios for light, medium, and heavy workloads, and guidance on setting spending limits – please visit pricing documentation for the latest information. NOTE: The pricing section in product documentation is your authoritative source for current rates until the pricing page is updated. Questions or feedback on the new billing model? Use the Feedback & issues link in the SRE Agent portal or reach out through the Azure SRE Agent community. Additional resources Product documentation: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/docs Self-paced hands-on labs: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/lab Technical videos and demos: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/youtube Azure SRE Agent home page: https://un5gmtkzgjgpdgnw3w.julianrbryant.com/sreagent Azure SRE Agent on X: https://un5v3pg.julianrbryant.com/azuresreagent140Views0likes0CommentsAzure SRE Agent now supports multiple model providers, including Anthropic Claude
Today, Azure SRE Agent adds model provider selection—choose between Azure OpenAI and Anthropic to match the right AI provider to your incident workflow. SRE Agent has saved over 20,000 engineering hours by pulling together logs, deployments, and signals into a single investigation thread, turning scattered data into clearer mitigation steps. Customers like Ecolab have seen daily alerts drop by up to 75%. Choose your model provider: Azure OpenAI or Anthropic Azure SRE Agent has always used Azure OpenAI. Now, Anthropic is also available as a model provider, with Claude Opus 4.6 as the baseline model. Different reliability tasks demand different reasoning capabilities. A quick health check isn't the same as a multi-hour root cause investigation spanning dozens of log streams, deployment histories, and correlated metrics. With model provider selection, you can match the provider to the complexity of the work. When you select Anthropic, Azure SRE Agent automatically routes tasks to the right model for the job. Claude Opus 4.6 is the primary model, bringing a large context window and extended reasoning capabilities well suited for complex, multi-step investigations where the agent needs to retain and connect information across many signals before proposing next steps. Why does this matter for operations teams? Complex incidents are where Azure SRE Agent's value is highest—and where model provider choice matters most. When your agent is correlating logs across services, reviewing deployment history, analyzing a metrics anomaly, and proposing a mitigation runbook, stronger long-context reasoning can improve the quality and consistency of the investigation thread. Model provider selection is also foundational to where Azure SRE Agent is heading. With model provider abstraction, in the future you will be able to select any new providers that become available without changing how your agent works, your existing configuration and setup carry over automatically. The goal: give you the right provider for the job and the flexibility to tune the agent to your operational needs. Get started To use Anthropic Claude, create a new agent and select Anthropic as your model provider during setup. If you're new to Azure SRE Agent, start with the Getting Started guide to create an agent, connect it to your logs or resources, and run your first investigation. Questions or feedback? Use the Feedback & issues link in the SRE Agent portal or reach out through the Azure SRE Agent community Additional resources Product documentation: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/docs Self-paced hands-on labs: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/lab Technical videos and demos: https://un5mzpanxv5t0qg.julianrbryant.com/sreagent/youtube Azure SRE Agent home page: https://un5gmtkzgjgpdgnw3w.julianrbryant.com/sreagent Azure SRE Agent on X: https://un5v3pg.julianrbryant.com/azuresreagent208Views0likes0CommentsAnnouncing AWS with Azure SRE Agent: Cross-Cloud Investigation using the brand new AWS DevOps Agent
Overview Connect Azure SRE Agent to AWS services using the official AWS MCP server. Query AWS documentation, execute any of the 15,000+ AWS APIs, run operational workflows, and kick off incident investigations through AWS DevOps Agent, which is now generally available. The AWS MCP server connects Azure SRE Agent to AWS documentation, APIs, regional availability data, pre-built operational workflows (Agent SOPs), and AWS DevOps Agent for incident investigation. When connected, the proxy exposes 23 MCP tools organized into four categories: documentation and knowledge, API execution, guided workflows, and DevOps Agent operations. How it works The MCP Proxy for AWS runs as a local stdio process that SRE Agent spawns via uvx . The proxy handles AWS authentication using credentials you provide as environment variables. No separate infrastructure or container deployment is needed. In the portal, you use the generic MCP server (User provided connector) option with stdio transport. Key capabilities Area Capabilities Documentation Search all AWS docs, API references, and best practices; retrieve pages as markdown API execution Execute authenticated calls across 15,000+ AWS APIs with syntax validation and error handling Agent SOPs Pre-built multi-step workflows following AWS Well-Architected principles Regional info List all AWS regions, check service and feature availability by region Infrastructure Provision VPCs, databases, compute instances, storage, and networking resources Troubleshooting Analyze CloudWatch logs, CloudTrail events, permission issues, and application failures Cost management Set up billing alerts, analyze resource usage, and review cost data DevOps Agent Start AWS incident investigations, read root cause analyses, get remediation recommendations, and chat with AWS DevOps Agent Note: The AWS MCP Server is free to use. You pay only for the AWS resources consumed by API calls made through the server. All actions respect your existing IAM policies. Prerequisites Azure SRE Agent resource deployed in Azure AWS account with IAM credentials configured uv package manager installed on the SRE Agent host (used to run the MCP proxy via uvx ) IAM permissions: aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool , and optionally aws-mcp:CallReadWriteTool Step 1: Create AWS access keys The AWS MCP server authenticates using AWS access keys (an Access Key ID and a Secret Access Key). These keys are tied to an IAM user in your AWS account. You create them in the AWS Management Console. Navigate to IAM in the AWS Console Sign in to the AWS Management Console In the top search bar, type IAM and select IAM from the results (Direct URL: https://un5kxttrqq5vj5dmhkx794rnk0.julianrbryant.com/iam/ ) In the left sidebar, select Users (Direct URL: https://un5kxttrqq5vj5dmhkx794rnk0.julianrbryant.com/iam/home#/users ) Create a dedicated IAM user Create a dedicated user for SRE Agent rather than reusing a personal account. This makes it easy to scope permissions and rotate keys independently. Select Create user Enter a descriptive user name (e.g., sre-agent-mcp ) Do not check "Provide user access to the AWS Management Console" (this user only needs programmatic access) Select Next Select Attach policies directly Select Create policy (opens in a new tab) and paste the following JSON in the JSON editor: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "aws-mcp:InvokeMcp", "aws-mcp:CallReadOnlyTool", "aws-mcp:CallReadWriteTool" ], "Resource": "*" } ] } Select Next, give the policy a name (e.g., SREAgentMCPAccess ), and select Create policy Back on the Create user tab, select the refresh button in the policy list, search for SREAgentMCPAccess , and check it Select Next > Create user Generate access keys After the user is created, generate the access keys that SRE Agent will use: From the Users list, select the user you just created (e.g., sre-agent-mcp ) Select the Security credentials tab Scroll down to the Access keys section Select Create access key For the use case, select Third-party service Check the confirmation checkbox and select Next Optionally add a description tag (e.g., Azure SRE Agent ) and select Create access key Copy both values immediately: Value Example format Where you'll use it Access Key ID <your-access-key-id> Connector environment variable AWS_ACCESS_KEY_ID Secret Access Key <your-secret-access-key> Connector environment variable AWS_SECRET_ACCESS_KEY Important: The Secret Access Key is shown only once on this screen. If you close the page without copying it, you must delete the key and create a new one. Select Download .csv file as a backup, then store the file securely and delete it after configuring the connector. Tip: For production use, also add service-specific IAM permissions for the AWS APIs you want SRE Agent to call. The MCP permissions above grant access to the MCP server itself, but individual API calls (e.g., ec2:DescribeInstances , logs:GetQueryResults ) require their own IAM actions. Start broad for testing, then scope down using the principle of least privilege. Required permissions summary Permission Description Required? aws-mcp:InvokeMcp Base access to the AWS MCP server Yes aws-mcp:CallReadOnlyTool Read operations (describe, list, get, search) Yes aws-mcp:CallReadWriteTool Write operations (create, update, delete resources) Optional Step 2: Add the MCP connector Connect the AWS MCP server to your SRE Agent using the portal. The proxy runs as a local stdio process that SRE Agent spawns via uvx . It handles SigV4 signing using the AWS credentials you provide as environment variables. Determine the AWS MCP endpoint for your region The AWS MCP server has regional endpoints. Choose the one matching your AWS resources: AWS Region MCP Endpoint URL us-east-1 (default) https://un5mythm4u4a2u6g8rta21e6kezz8b3fqq2142r.julianrbryant.coms/mcp us-west-2 https://un5mythm4u4a2u6g8rt4g1e6keyf8b3fqq2142r.julianrbryant.coms/mcp eu-west-1 https://un5mythm4u4a2u6gw3c1bpa7n5raphjqh6h7gkqd.julianrbryant.coms/mcp Note: Without the --metadata AWS_REGION=<region> argument, operations default to us-east-1 . You can always override the region in your query. Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector with these values: Field Value Name aws-mcp Connection type stdio Command uvx Arguments mcp-proxy-for-aws@latest https://un5mythm4u4a2u6g8rta21e6kezz8b3fqq2142r.julianrbryant.coms/mcp --metadata AWS_REGION=us-west-2 Environment variables AWS_ACCESS_KEY_ID=<your-access-key-id> , AWS_SECRET_ACCESS_KEY=<your-secret-access-key> Select Next to review Select Add connector This is equivalent to the following MCP client configuration used by tools like Claude Desktop or Amazon Kiro CLI: { "mcpServers": { "aws-mcp": { "command": "uvx", "args": [ "mcp-proxy-for-aws@latest", "https://un5mythm4u4a2u6g8rta21e6kezz8b3fqq2142r.julianrbryant.coms/mcp", "--metadata", "AWS_REGION=us-west-2" ] } } } Important: Store the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY securely. In the portal, environment variables for connectors are stored encrypted. For production deployments, consider using a dedicated IAM user with scoped-down permissions (see Step 1). Never commit credentials to source control. Tip: If your SRE Agent host already has AWS credentials configured (e.g., via aws configure or an instance profile), the proxy will pick them up automatically from the environment. In that case, you can omit the explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. Note: After adding the connector, the agent service initializes the MCP connection. This may take up to 30 seconds as uvx downloads the proxy package on first run (~89 dependencies). If the connector does not show Connected status after a minute, see the Troubleshooting section below. Step 3: Add an AWS skill Skills give agents domain knowledge and best practices for specific tool sets. Create an AWS skill so your agent knows how to troubleshoot AWS services, provision infrastructure, and follow operational workflows. Tip: Why skills over subagents? Skills inject domain knowledge into the main agent's context, so it can use AWS expertise without handing off to a separate agent. Conversation context stays intact and there's no handoff latency. Use a subagent when you need full isolation with its own system prompt and tool restrictions. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: aws_infrastructure_operations display_name: AWS Infrastructure & Operations description: | AWS infrastructure and operations: EC2, EKS, Lambda, S3, RDS, CloudWatch, CloudTrail, IAM, VPC, and others. Also covers AWS DevOps Agent for incident investigation, root cause analysis, and remediation. Use for querying AWS resources, investigating issues, provisioning infrastructure, searching documentation, running AWS API calls via the AWS MCP server, and coordinating investigations between Azure SRE Agent and AWS DevOps Agent. instructions: | ## Overview The AWS MCP Server is a managed remote MCP server that gives AI assistants authenticated access to AWS services. It combines documentation access, authenticated API execution, and pre-built Agent SOPs in a single interface. **Authentication:** Handled automatically by the MCP Proxy for AWS, running as a local stdio process. All actions respect existing IAM policies configured in the connector environment variables. **Regional endpoints:** The MCP server has regional endpoints. The proxy is configured with a default region; you can override by specifying a region in your queries (e.g., "list my EC2 instances in eu-west-1"). ## Searching Documentation Use aws___search_documentation to find information across all AWS docs. ## Executing AWS API Calls Use aws___call_aws to execute authenticated AWS API calls. The tool handles SigV4 signing and provides syntax validation. ## Using Agent SOPs Use aws___retrieve_agent_sop to find and follow pre-built workflows. SOPs provide step-by-step guidance following AWS Well-Architected principles. ## Regional Operations Use aws___list_regions to see all available AWS regions and aws___get_regional_availability to check service support in specific regions. ## AWS DevOps Agent Integration The AWS MCP server includes tools for AWS DevOps Agent: - aws___list_agent_spaces / aws___create_agent_space: Manage AgentSpaces - aws___create_investigation: Start incident investigations (5-8 min async) - aws___get_task: Poll investigation status - aws___list_journal_records: Read root cause analysis - aws___list_recommendations / aws___get_recommendation: Get remediation steps - aws___start_evaluation: Run proactive infrastructure evaluations - aws___create_chat / aws___send_message: Chat with AWS DevOps Agent ## Troubleshooting | Issue | Solution | |-------|----------| | Access denied errors | Verify IAM policy includes aws-mcp:InvokeMcp and aws-mcp:CallReadOnlyTool | | API call fails | Check IAM policy includes the specific service action | | Wrong region results | Specify the region explicitly in your query | | Proxy connection error | Verify uvx is installed and the proxy can reach aws-mcp.region.api.aws | mcp_connectors: - aws-mcp Select Save Note: The mcp_connectors: - aws-mcp at the bottom links this skill to the connector you created in Step 2. The skill's instructions teach the agent how to use the 23 AWS MCP tools effectively. Step 4: Test the integration Open a new chat session with your SRE Agent and try these example prompts to verify the connection is working. Quick verification Start with this simple test to confirm the AWS MCP proxy is connected and authenticating correctly: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, go back and verify the IAM credentials and permissions from Step 1. Documentation and knowledge Search AWS documentation for EKS best practices for production clusters What AWS regions support Amazon Bedrock? Read the AWS documentation page about S3 bucket policies Infrastructure queries List all my running EC2 instances in us-east-1 Show me the details of my EKS cluster named "production-cluster" What Lambda functions are deployed in my account? CloudWatch and monitoring What CloudWatch alarms are currently in ALARM state? Show me the CPU utilization metrics for my RDS instance over the last 24 hours Search CloudWatch Logs for errors in the /aws/lambda/my-function log group Troubleshooting workflows My EC2 instance i-0abc123 is not reachable. Help me troubleshoot. My Lambda function is timing out. Walk me through the investigation. Find an Agent SOP for troubleshooting EKS pod scheduling failures Cross-cloud scenarios My Azure Function is failing when calling AWS S3. Check if there are any S3 service issues and review the bucket policy for "my-data-bucket". Compare the health of my AWS EKS cluster with my Azure AKS cluster. AWS DevOps Agent investigations List all available AWS DevOps Agent spaces in my account Create an AWS DevOps Agent investigation for the high error rate on my Lambda function "order-processor" in us-west-2 Start a chat with AWS DevOps Agent about my EKS cluster performance Cross-agent investigation (Azure SRE Agent + AWS DevOps Agent) My application is failing across both Azure and AWS. Start an AWS DevOps Agent investigation for the AWS side while you check Azure Monitor for errors on the Azure side. Then combine the findings into a unified root cause analysis. What's New: AWS DevOps Agent Integration The AWS MCP server now includes full integration with AWS DevOps Agent, which recently became generally available. This means Azure SRE Agent can start autonomous incident investigations on AWS infrastructure and get back root cause analyses and remediation recommendations — all within the same chat session. Available tools by category AgentSpace management Tool Description aws___list_agent_spaces Discover available AgentSpaces aws___get_agent_space Get AgentSpace details including ARN and configuration aws___create_agent_space Create a new AgentSpace for investigations Investigation lifecycle Tool Description aws___create_investigation Start an incident investigation (async, 5-8 min) aws___get_task Poll investigation task status aws___list_tasks List investigation tasks with filters aws___list_journal_records Read root cause analysis journal aws___list_executions List execution runs for a task aws___list_recommendations Get prioritized mitigation recommendations aws___get_recommendation Get full remediation specification Proactive evaluations Tool Description aws___start_evaluation Start an evaluation to find preventive recommendations aws___list_goals List evaluation goals and criteria Real-time chat Tool Description aws___create_chat Start a real-time chat session with AWS DevOps Agent aws___list_chats List recent chat sessions aws___send_message Send a message and get a streamed response Cross-Agent Investigation Workflow With the AWS MCP server connected, SRE Agent can run parallel investigations across both clouds. Here's how the cross-agent workflow works: Start an AWS investigation: Ask SRE Agent to create an AWS DevOps Agent investigation for the AWS-side symptoms Investigate Azure in parallel: While the AWS investigation runs (5-8 minutes), SRE Agent uses its native tools to check Azure Monitor, Log Analytics, and resource health Read AWS results: When the investigation completes, SRE Agent reads the journal records and recommendations Correlate findings: SRE Agent combines both sets of findings into a single root cause analysis with remediation steps for both clouds Common cross-cloud scenarios: Azure app calling AWS services: Investigate Azure Function errors that correlate with AWS API failures Hybrid deployments: Check AWS EKS clusters alongside Azure AKS clusters during multi-cloud outages Data pipeline issues: Trace data flow across Azure Event Hubs and AWS Kinesis or SQS Agent-to-agent investigation: Start an AWS DevOps Agent investigation for the AWS side while Azure SRE Agent checks Azure resources in parallel Architecture The integration uses a stdio proxy architecture. SRE Agent spawns the proxy as a child process, and the proxy forwards requests to the AWS MCP endpoint: Azure SRE Agent | | stdio (local process) v mcp-proxy-for-aws (spawned via uvx) | | Authenticated HTTPS requests v AWS MCP Server (aws-mcp.<region>.api.aws) | |--- Authenticated AWS API calls --> AWS Services | (EC2, S3, CloudWatch, EKS, Lambda, etc.) | '--- DevOps Agent API calls ------> AWS DevOps Agent |-- AgentSpaces (workspaces) |-- Investigations (async root cause analysis) |-- Recommendations (remediation specs) '-- Chat sessions (real-time interaction) Troubleshooting Authentication and connectivity issues Error Cause Solution 403 Forbidden IAM user lacks MCP permissions Add aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool to the IAM policy 401 Unauthorized Invalid or expired AWS credentials Rotate access keys and update the connector environment variables Proxy fails to start uvx not installed or not on PATH Install uv on the SRE Agent host Connection timeout Proxy cannot reach the AWS MCP endpoint Verify outbound HTTPS (port 443) is allowed to aws-mcp.<region>.api.aws Connector added but tools not available MCP connections are initialized at agent startup Redeploy or restart the agent service from the Azure portal Slow first connection uvx downloads ~89 dependencies on first run Wait up to 30 seconds for the initial connection API and permission issues Error Cause Solution AccessDenied on API call IAM user lacks the service-specific permission Add the required IAM action (e.g., ec2:DescribeInstances ) to the user's policy CallReadWriteTool denied Write permission not granted Add aws-mcp:CallReadWriteTool to the IAM policy Wrong region data Proxy configured for a different region Update the AWS_REGION metadata in the connector arguments, or specify the region in your query API not found Newly released or unsupported API Use aws___suggest_aws_commands to find the correct API name Verify the connection Test that the proxy can authenticate by opening a new chat session and asking: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, verify the IAM credentials and permissions from Step 1. Re-authorize the integration If you encounter persistent authentication issues: Navigate to the IAM console Select the user created in Step 1 Navigate to Security credentials > Access keys Deactivate or delete the old access key Create a new access key Update the connector environment variables in the SRE Agent portal with the new credentials Related content AWS MCP Server documentation MCP Proxy for AWS on GitHub AWS MCP Server tools reference AWS DevOps Agent documentation AWS DevOps Agent GA announcement AWS IAM documentation4.6KViews0likes0CommentsGet started with Datadog MCP server in Azure SRE Agent
Overview The Datadog MCP server is a cloud-hosted bridge between your Datadog organization and Azure SRE Agent. Once configured, it enables real-time interaction with logs, metrics, APM traces, monitors, incidents, dashboards, and other Datadog data through natural language. All actions respect your existing Datadog RBAC permissions. The server uses Streamable HTTP transport with two custom headers ( DD_API_KEY and DD_APPLICATION_KEY ) for authentication. Azure SRE Agent connects directly to the Datadog-hosted endpoint—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated Datadog MCP server connector type that pre-populates the required header keys for streamlined setup. Key capabilities Area Capabilities Logs Search and analyze logs with SQL-based queries, filter by facets and time ranges Metrics Query metric values, explore available metrics, get metric metadata and tags APM Search spans, fetch complete traces, analyze trace performance, compare traces Monitors Search monitors, validate configurations, inspect monitor groups and templates Incidents Search and get incident details, view timeline and responders Dashboards Search and list dashboards by name or tag Hosts Search hosts by name, tags, or status Services List services and map service dependencies Events Search events including monitor alerts, deployments, and custom events Notebooks Search and retrieve notebooks for investigation documentation RUM Search Real User Monitoring events for frontend observability This is the official Datadog-hosted MCP server (Preview). The server exposes 16+ core tools with additional toolsets available for alerting, APM, Database Monitoring, Error Tracking, feature flags, LLM Observability, networking, security, software delivery, and Synthetic tests. Tool availability depends on your Datadog plan and RBAC permissions. Prerequisites Azure SRE Agent resource deployed in Azure Datadog organization with an active plan Datadog user account with appropriate RBAC permissions API key: Created from Organization Settings > API Keys Application key: Created from Organization Settings > Application Keys with MCP Read and/or MCP Write permissions Your organization must be allowlisted for the Datadog MCP server Preview Step 1: Create API and Application keys The Datadog MCP server requires two credentials: an API key (identifies your organization) and an Application key (authenticates the user and defines permission scope). Both are created in the Datadog portal. Create an API key Log in to your Datadog organization (use your region-specific URL if applicable—e.g., app.datadoghq.eu for EU1) Select your account avatar in the bottom-left corner of the navigation bar Select Organization Settings In the left sidebar, select API Keys (under the Access section) Direct URL: https://un5my6r2gjytnf5rv7p2eefq.julianrbryant.com/organization-settings/api-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp ) Select Create Key Copy the key value immediately—it is shown only once. If lost, you must create a new key. [!TIP] API keys are organization-level credentials. Any Datadog Admin or user with the API Keys Write permission can create them. The API key alone does not grant data access—it must be paired with an Application key. Create an Application key From the same Organization Settings page, select Application Keys in the left sidebar Direct URL: https://un5my6r2gjytnf5rv7p2eefq.julianrbryant.com/organization-settings/application-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp-app ) Select Create Key Copy the key value immediately—it is shown only once Add MCP permissions to the Application key After creating the Application key, you must grant it the MCP-specific scopes: In the Application Keys list, locate the key you just created Select the key name to open its detail panel In the detail panel, find the Scopes section and select Edit Search for MCP in the scopes search box Check MCP Read to enable read access to Datadog data via MCP tools Optionally check MCP Write if your agent needs to create or modify resources (e.g., feature flags, Synthetic tests) Select Save If you don't see the MCP Read or MCP Write scopes, your organization may not be enrolled in the Datadog MCP server preview. Contact your Datadog account representative to request access. Required permissions summary Permission Description Required? MCP Read Read access to Datadog data via MCP tools (logs, metrics, traces, monitors, etc.) Yes MCP Write Write access for mutating operations (creating feature flags, editing Synthetic tests, etc.) Optional For production use, create keys from a service account rather than a personal account. Navigate to Organization Settings > Service Accounts to create one. This ensures the integration continues to work if team members leave the organization. Apply the principle of least privilege—grant only MCP Read unless write operations are needed. Use scoped Application keys to restrict access to only the permissions your agent needs. This limits blast radius if a key is compromised. Step 2: Add the MCP connector Connect the Datadog MCP server to your SRE Agent using the portal. The portal includes a dedicated Datadog connector type that pre-populates the required configuration. Determine your regional endpoint Select the endpoint URL that matches your Datadog organization's region: Region Endpoint URL US1 (default) https://un5pcer2gjytnf5rv7p2eefq.julianrbryant.com/api/unstable/mcp-server/mcp US3 https://un5pcer2gg0m6tygzbck2khhexreqn8.julianrbryant.com/api/unstable/mcp-server/mcp US5 https://un5pcer2gg0m6regzbck2khhexreqn8.julianrbryant.com/api/unstable/mcp-server/mcp EU1 https://un5pcer2gjytnf5rv7p2eeb4cym0.julianrbryant.com/api/unstable/mcp-server/mcp AP1 https://un5pcer2gjgr3amfhkkfx80cbue68pde.julianrbryant.com/api/unstable/mcp-server/mcp AP2 https://un5pcer2gjgr3amchkkfx80cbue68pde.julianrbryant.com/api/unstable/mcp-server/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select Datadog MCP server and select Next Configure the connector: Field Value Name datadog-mcp Connection type Streamable-HTTP (pre-selected) URL https://un5pcer2gjytnf5rv7p2eefq.julianrbryant.com/api/unstable/mcp-server/mcp (change for non-US1 regions) Authentication Custom headers (pre-selected, disabled) DD_API_KEY Your Datadog API key DD_APPLICATION_KEY Your Datadog Application key Select Next to review Select Add connector The Datadog connector type pre-populates both header keys ( DD_API_KEY and DD_APPLICATION_KEY ) and sets the authentication method to "Custom headers" automatically. The default URL is the US1 endpoint—update it if your organization is in a different region. Once the connector shows Connected status, the Datadog MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a Datadog subagent (optional) Create a specialized subagent to give the AI focused Datadog observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: DatadogObservabilityExpert display_name: Datadog Observability Expert system_prompt: | You are a Datadog observability expert with access to logs, metrics, APM traces, monitors, incidents, dashboards, hosts, services, and more via the Datadog MCP server. ## Capabilities ### Logs - Search logs using facets, tags, and time ranges with `search_datadog_logs` - Perform SQL-based log analysis with `analyze_datadog_logs` for aggregations, grouping, and statistical queries - Correlate log entries with traces and metrics ### Metrics - Query metric time series with `get_datadog_metric` - Get metric metadata, tags, and context with `get_datadog_metric_context` - Discover available metrics with `search_datadog_metrics` ### APM (Application Performance Monitoring) - Fetch complete traces with `get_datadog_trace` - Search distributed traces and spans with `search_datadog_spans` - Analyze service-level performance and latency patterns - Map service dependencies with `search_datadog_service_dependencies` ### Monitors & Alerting - Search monitors by name, tag, or status with `search_datadog_monitors` - Investigate triggered monitors and alert history - Correlate monitor alerts with underlying metrics and logs ### Incidents - Search incidents with `search_datadog_incidents` - Get incident details, timeline, and responders with `get_datadog_incident` - Correlate incidents with monitors, logs, and traces ### Infrastructure - Search hosts by name, tag, or status with `search_datadog_hosts` - List and discover services with `search_datadog_services` - Search dashboards with `search_datadog_dashboards` - Search events (monitor alerts, deployments) with `search_datadog_events` ### Notebooks - Search notebooks with `search_datadog_notebooks` - Retrieve notebook content with `get_datadog_notebook` ### Real User Monitoring - Search RUM events for frontend performance data with `search_datadog_rum_events` ## Best Practices When investigating incidents: - Start with `search_datadog_incidents` or `get_datadog_incident` for context - Check related monitors with `search_datadog_monitors` - Correlate with `search_datadog_logs` and `get_datadog_metric` for root cause - Use `get_datadog_trace` to inspect request flows for latency issues - Check `search_datadog_hosts` for infrastructure-level problems When analyzing logs: - Use `analyze_datadog_logs` for SQL-based aggregation queries - Use `search_datadog_logs` for individual log retrieval and filtering - Include time ranges to narrow results and reduce response size - Filter by service, host, or status to focus on relevant data When working with metrics: - Use `search_datadog_metrics` to discover available metric names - Use `get_datadog_metric_context` to understand metric tags and metadata - Use `get_datadog_metric` to query actual metric values with time ranges When handling errors: - If access is denied, explain which RBAC permission is needed - Suggest the user verify their Application key has `MCP Read` or `MCP Write` - For large traces that appear truncated, note this is a known limitation mcp_connectors: - datadog-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Datadog MCP server. Step 4: Add a Datadog skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a Datadog skill to give your agent expertise in log queries, metric analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: datadog_observability display_name: Datadog Observability description: | Expertise in Datadog's observability platform including logs, metrics, APM, monitors, incidents, dashboards, hosts, and services. Use for searching logs, querying metrics, investigating incidents, analyzing traces, inspecting monitors, and navigating Datadog data via the Datadog MCP server. instructions: | ## Overview Datadog is a cloud-scale observability platform for logs, metrics, APM traces, monitors, incidents, infrastructure, and more. The Datadog MCP server enables natural language interaction with your organization's Datadog data. **Authentication:** Two custom headers—`DD_API_KEY` (API key) and `DD_APPLICATION_KEY` (Application key with MCP permissions). All actions respect existing RBAC permissions. **Regional endpoints:** The MCP server URL varies by Datadog region (US1, US3, US5, EU1, AP1, AP2). Ensure the connector URL matches your organization's region. ## Searching Logs Use `search_datadog_logs` for individual log retrieval and `analyze_datadog_logs` for SQL-based aggregation queries. **Common log search patterns:** ``` # Errors from a specific service service:payment-api status:error # Logs from a host in the last hour host:web-prod-01 # Logs containing a specific trace ID trace_id:abc123def456 # Errors with a specific HTTP status @http.status_code:500 service:api-gateway # Logs from a Kubernetes pod kube_namespace:production kube_deployment:checkout-service ``` **SQL-based log analysis with `analyze_datadog_logs`:** ```sql -- Count errors by service in the last hour SELECT service, count(*) as error_count FROM logs WHERE status = 'error' GROUP BY service ORDER BY error_count DESC -- Average response time by endpoint SELECT @http.url_details.path, avg(@duration) as avg_duration FROM logs WHERE service = 'api-gateway' GROUP BY @http.url_details.path ``` ## Querying Metrics Use `search_datadog_metrics` to discover metrics, `get_datadog_metric_context` for metadata, and `get_datadog_metric` for time series data. **Common metric patterns:** ``` # System metrics system.cpu.user, system.mem.used, system.disk.used # Container metrics docker.cpu.usage, kubernetes.cpu.requests # Application metrics trace.servlet.request.hits, trace.servlet.request.duration # Custom metrics app.payment.processed, app.queue.depth ``` Always specify a time range when querying metrics to avoid retrieving excessive data. ## Investigating Traces Use `get_datadog_trace` for complete trace details and `search_datadog_spans` for span-level queries. **Trace investigation workflow:** 1. Search for slow or errored spans with `search_datadog_spans` 2. Get the full trace with `get_datadog_trace` using the trace ID 3. Identify the bottleneck service or operation 4. Correlate with `search_datadog_logs` using the trace ID 5. Check related metrics with `get_datadog_metric` ## Working with Monitors Use `search_datadog_monitors` to find monitors by name, tag, or status. **Common monitor queries:** ``` # Find all triggered monitors Search for monitors with status "Alert" # Find monitors for a specific service Search for monitors tagged with service:payment-api # Find monitors by name Search for monitors matching "CPU" or "memory" ``` ## Incident Investigation Workflow For structured incident investigation: 1. `search_datadog_incidents` — find recent or active incidents 2. `get_datadog_incident` — get full incident details and timeline 3. `search_datadog_monitors` — check which monitors triggered 4. `search_datadog_logs` — search for errors around the incident time 5. `get_datadog_metric` — check key metrics for anomalies 6. `get_datadog_trace` — inspect request traces for latency or errors 7. `search_datadog_hosts` — verify infrastructure health 8. `search_datadog_service_dependencies` — map affected services ## Working with Dashboards and Notebooks - Use `search_datadog_dashboards` to find dashboards by title or tag - Use `search_datadog_notebooks` and `get_datadog_notebook` for investigation notebooks that document past analyses ## Toolsets The Datadog MCP server supports toolsets via the `?toolsets=` query parameter on the endpoint URL. Available toolsets: | Toolset | Description | |---------|-------------| | `core` | Logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, notebooks (default) | | `alerting` | Monitor validation, groups, and templates | | `apm` | Trace analysis, span search, Watchdog insights, performance investigation | | `dbm` | Database Monitoring query plans and samples | | `error-tracking` | Error Tracking issues across RUM, Logs, and Traces | | `feature-flags` | Creating, listing, and updating feature flags | | `llmobs` | LLM Observability spans | | `networks` | Cloud Network Monitoring, Network Device Monitoring | | `onboarding` | Guided Datadog setup and configuration | | `security` | Code security scanning, security signals, findings | | `software-delivery` | CI Visibility, Test Optimization | | `synthetics` | Synthetic test management | To enable additional toolsets, append `?toolsets=core,apm,alerting` to the connector URL. ## Troubleshooting | Issue | Solution | |-------|----------| | 401/403 errors | Verify API key and Application key are correct and active | | No data returned | Check that Application key has `MCP Read` permission | | Wrong region | Ensure the connector URL matches your Datadog organization's region | | Truncated traces | Large traces may be truncated; this is a known limitation | | Tool not found | The tool may require a non-default toolset; update the connector URL | | Write operations fail | Verify Application key has `MCP Write` permission | mcp_connectors: - datadog-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: DatadogObservabilityExpert skills: - datadog_observability mcp_connectors: - datadog-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Log analysis Search for error logs from the payment-api service in the last hour Analyze logs to count errors by service over the last 24 hours Find all logs with HTTP 500 status from the api-gateway in the last 30 minutes Show me the most recent logs from host web-prod-01 Metrics investigation What is the current CPU usage across all production hosts? Show me the request rate and error rate for the checkout-service over the last 4 hours What metrics are available for the payment-api service? Get the p99 latency for the api-gateway service in the last hour APM and trace analysis Find the slowest traces for the checkout-service in the last hour Get the full trace details for trace ID abc123def456 What services depend on the payment-api? Search for errored spans in the api-gateway service from the last 30 minutes Monitor and alerting workflows Show me all monitors currently in Alert status Find monitors related to the database-primary host What monitors are tagged with team:platform? Search for monitors matching "disk space" or "memory" Incident investigation Show me all active incidents from the last 24 hours Get details for incident INC-12345 including the timeline What monitors triggered during the last production incident? Correlate the most recent incident with related logs and metrics Infrastructure and dashboards Search for hosts tagged with env:production and team:platform List all dashboards related to "Kubernetes" or "EKS" What services are running in the production environment? Show me recent deployment events for the checkout-service Available tools Core toolset (default) The core toolset is included by default and provides essential observability tools. Tool Description search_datadog_logs Search logs by facets, tags, and time ranges analyze_datadog_logs SQL-based log analysis for aggregations and statistical queries get_datadog_metric Query metric time series with rollup and aggregation get_datadog_metric_context Get metric metadata, tags, and related context search_datadog_metrics List and discover available metrics get_datadog_trace Fetch a complete distributed trace by trace ID search_datadog_spans Search APM spans by service, operation, or tags search_datadog_monitors Search monitors by name, tag, or status get_datadog_incident Get incident details including timeline and responders search_datadog_incidents List and search incidents search_datadog_dashboards Search dashboards by title or tag search_datadog_hosts Search hosts by name, tag, or status search_datadog_services List and search services search_datadog_service_dependencies Map service dependency relationships search_datadog_events Search events (monitor alerts, deployments, custom events) get_datadog_notebook Retrieve notebook content by ID search_datadog_notebooks Search notebooks by title or tag search_datadog_rum_events Search Real User Monitoring events Alerting toolset Enable with ?toolsets=core,alerting on the connector URL. Tool Description validate_datadog_monitor Validate monitor configuration before creation get_datadog_monitor_templates Get monitor configuration templates search_datadog_monitor_groups Search monitor groups and their statuses APM toolset Enable with ?toolsets=core,apm on the connector URL. Tool Description apm_search_spans Advanced span search with APM-specific filters apm_explore_trace Interactive trace exploration and analysis apm_trace_summary Get a summary analysis of a trace apm_trace_comparison Compare two traces side by side apm_analyze_trace_metrics Analyze aggregated trace metrics and trends Database Monitoring toolset Enable with ?toolsets=core,dbm on the connector URL. Tool Description search_datadog_dbm_plans Search database query execution plans search_datadog_dbm_samples Search database query samples and statistics Error Tracking toolset Enable with ?toolsets=core,error-tracking on the connector URL. Tool Description search_datadog_error_tracking_issues Search error tracking issues across RUM, Logs, and Traces get_datadog_error_tracking_issue Get details of a specific error tracking issue Feature Flags toolset Enable with ?toolsets=core,feature-flags on the connector URL. Tool Description list_datadog_feature_flags List feature flags create_datadog_feature_flag Create a new feature flag update_datadog_feature_flag_environment Update feature flag settings for an environment LLM Observability toolset Enable with ?toolsets=core,llmobs on the connector URL. Tool Description LLM Observability spans Query and analyze LLM Observability span data Networks toolset Enable with ?toolsets=core,networks on the connector URL. Tool Description Cloud Network Monitoring tools Analyze cloud network traffic and dependencies Network Device Monitoring tools Monitor and troubleshoot network devices Security toolset Enable with ?toolsets=core,security on the connector URL. Tool Description datadog_code_security_scan Run code security scanning datadog_sast_scan Run Static Application Security Testing datadog_secrets_scan Scan for secrets and credentials in code Software Delivery toolset Enable with ?toolsets=core,software-delivery on the connector URL. Tool Description search_datadog_ci_pipeline_events Search CI pipeline execution events get_datadog_flaky_tests Identify flaky tests in CI pipelines Synthetics toolset Enable with ?toolsets=core,synthetics on the connector URL. Tool Description get_synthetics_tests List and get Synthetic test configurations edit_synthetics_tests Edit Synthetic test settings synthetics_test_wizard Guided wizard for creating Synthetic tests Toolsets The Datadog MCP server organizes tools into toolsets. By default, only the core toolset is enabled. To enable additional toolsets, append the ?toolsets= query parameter to the connector URL. Syntax https://un5pcer2gjytnf5rv7p2eefq.julianrbryant.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting Examples Use case URL suffix Default (core only) No suffix needed Core + APM analysis ?toolsets=core,apm Core + Alerting + APM ?toolsets=core,alerting,apm Core + Database Monitoring ?toolsets=core,dbm Core + Security scanning ?toolsets=core,security Core + CI/CD visibility ?toolsets=core,software-delivery All toolsets ?toolsets=core,alerting,apm,dbm,error-tracking,feature-flags,llmobs,networks,onboarding,security,software-delivery,synthetics [!TIP] Only enable the toolsets you need. Each additional toolset increases the number of tools exposed to the agent, which can increase token usage and may impact response quality. Start with core and add toolsets as needed. Updating the connector URL To add toolsets after initial setup: Navigate to Builder > Connectors Select the datadog-mcp connector Update the URL field to include the ?toolsets= parameter Select Save Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid API key or Application key Verify both keys are correct and active in Organization Settings 403 Forbidden Missing RBAC permissions Ensure the Application key has MCP Read and/or MCP Write permissions Connection refused Wrong regional endpoint Verify the connector URL matches your Datadog organization's region "Organization not allowlisted" Preview access not granted Contact Datadog support to request MCP server Preview access Data and permission issues Error Cause Solution No data returned Insufficient permissions or wrong time range Verify Application key permissions; try a broader time range Tool not found Tool belongs to a non-default toolset Add the required toolset to the ?toolsets= parameter in the connector URL Truncated trace data Trace exceeds size limit Large traces are truncated for context window efficiency; query specific spans instead Write operation failed Missing MCP Write permission Add MCP Write permission to the Application key Metric not found Wrong metric name or no data in time range Use search_datadog_metrics to discover available metric names Verify the connection Test the server endpoint directly: curl -I "https://un5pcer2gjytnf5rv7p2eefq.julianrbryant.com/api/unstable/mcp-server/mcp" \ -H "DD_API_KEY: <your_api_key>" \ -H "DD_APPLICATION_KEY: <your_application_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to Organization Settings > Application Keys in Datadog Revoke the existing Application key Create a new Application key with the required MCP Read / MCP Write permissions Update the connector in the SRE Agent portal with the new key Limitations Limitation Details Preview only The Datadog MCP server is in Preview and not recommended for production use Allowlisted organizations Only organizations that have been allowlisted by Datadog can access the MCP server Large trace truncation Responses are optimized for LLM context windows; large traces may be truncated Unstable API path The endpoint URL contains /unstable/ indicating the API may change without notice Toolset availability Some toolsets may not be available depending on your Datadog plan and features enabled Regional endpoints You must use the endpoint matching your organization's region; cross-region queries are not supported Security considerations How permissions work RBAC-scoped: All actions respect the RBAC permissions associated with the API and Application keys Key-based: Access is controlled through API key (organization-level) and Application key (user or service account-level) Permission granularity: MCP Read enables read operations; MCP Write enables mutating operations Admin controls Datadog administrators can: - Create and revoke API and Application keys in Organization Settings - Assign granular RBAC permissions ( MCP Read , MCP Write ) to Application keys - Use service accounts to decouple access from individual user accounts - Monitor MCP tool usage through the Datadog Audit Trail - Scope Application keys to limit the blast radius of compromised credentials The Datadog MCP server can read sensitive operational data including logs, metrics, and traces. Use service accounts with scoped Application keys, grant only the permissions your agent needs, and monitor the Audit Trail for unusual activity. Related content Datadog MCP Server documentation Datadog API and Application keys Datadog RBAC permissions Datadog Audit Trail Datadog regional sites MCP integration overview Build a custom subagent2.6KViews0likes1CommentShared Agent Context: How We Are Tackling Partner Agent Collaboration
Your Azure SRE agent detects a spike in error rates. It triages with cloud-native telemetry, but the root cause trail leads into a third-party observability platform your team also runs. The agent can't see that data. A second agent can, one that speaks Datadog or Dynatrace or whatever your team chose. The two agents talk to each other using protocols like MCP or directly via an API endpoint and come up with a remediation. The harder question is what happens to the conversation afterward. TL;DR Two AI agents collaborate on incidents using two communication paths: a direct real-time channel (MCP) for fast investigation, and a shared memory layer that writes to systems your team already uses, like PagerDuty, GitHub Issues, or ServiceNow. No new tools to adopt. No ephemeral conversations that vanish when the incident closes. The problem Most operational AI agents work in isolation. Your cloud monitoring agent doesn't have access to your third-party observability stack. Your Datadog specialist doesn't know what your Azure resource topology looks like. When an incident spans both, a human has to bridge the gap manually. At 2 AM. With half the context missing. And even when two agents do exchange information directly, the conversation is ephemeral. The investigation ends, the findings disappear. The next on-call engineer sees a resolved alert with no record of what was tried, what was found, or why the remediation worked. The next agent that hits the same pattern starts over from scratch. What we needed was somewhere for both agents to persist their findings, somewhere humans could see it too. And we really didn't want to force teams onto a new system just to get there. Two communication paths Direct agent-to-agent (real-time) During an active investigation, the primary agent calls the partner agent directly. The partner runs whatever domain-specific analysis it's good at (log searches, span analysis, custom metric queries) and returns findings in real time. This is the fast path. The direct channel uses MCP, so any partner agent can plug in without custom integration work. The primary agent doesn't need to understand the internals of Datadog or Dynatrace. It asks questions, gets answers. Shared memory (durable) After the direct exchange, both agents write their actions and findings to external systems that humans already use. This is the durable path, the one that creates audit trails and makes handoffs work. The shared memory backends are systems your team already has open during an incident: Backend What gets written Good fit for Incident platform (e.g., PagerDuty) Timeline notes, on-call handoff context Teams with alerting-centric workflows Issue tracker (e.g., GitHub Issues) Code-level findings, root cause analysis, action comments Teams with dev workflow integration ITSM system (e.g., ServiceNow) Work notes, ITSM-compliant audit trail Enterprise IT, regulated industries The important thing: this doesn't require a new system. Agents write to whatever your team already uses. How it works Step Actor What happens Path 1 Alert source Monitoring fires an alert — 2 Primary agent Receives alert, triages, starts investigating with native tools Internal 3 Primary agent Calls partner agent for domain-specific analysis (third-party logs, spans) Direct via MCP or API 4 Partner agent Runs analysis, returns findings in real time Direct via MCP or API 5 Primary agent Correlates partner findings with native data, runs remediation Internal 6 Both agents Write findings, actions, and resolution to external systems Shared memory via existing sources 7 Agent or human Verifies resolution, closes incident Shared memory via existing sources Steps 3 through 5 happen in real time over the direct channel. Nothing gets written to shared memory until the investigation has actual results. The investigation is fast; the record-keeping is thorough. Who does what In this system the primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, closure. The partner agent gets called when the primary agent needs to see into a part of the stack it can't access natively. It does the specialized deep-dive, returns what it found, and the primary agent takes it from there. Both agents write to shared memory and the primary agent acts on the proposed next steps. Primary agent Partner agent Communication Calls partner directly; writes to shared memory after Responds to calls; writes enrichment to shared memory Scope Full lifecycle Domain-specific deep-dive Tools Cloud-native monitoring, CLI, runbooks, issue trackers Third-party observability APIs Typical share ~80% of investigation + all remediation ~20%, specialized enrichment Why shared context should live where humans already work If your agent writes its findings to a system nobody checks, you've built a very expensive diary. Write them to a GitHub Issue, a ServiceNow ticket, a Jira epic, or whatever your team actually monitors, and the dynamics change: humans can participate without changing their workflow. Your team already watches these systems. When an agent posts its reasoning and pending decisions to a place engineers already check, anyone can review or correct it using the tools they know. Comments, reactions, status updates. No custom approval UI. The collaboration features built into your workflow tool become the oversight mechanism for free. That persistence pays off in a second way. Every entry the agent writes is a record that future runs can search. Instead of context that disappears when a conversation ends, you accumulate operational history. How was this incident type handled last time? What did the agent try? What did the human override? That history is retrievable by both people and agents through the same interface, without spinning up a separate vector database. You could build a dedicated agent database for all this. But nobody will look at it. Teams already have notifications, permissions, and audit trails configured in their existing tools. A purpose-built system means a new UI to learn, new permissions to manage, and one more thing competing for attention. Store context where people already look and you skip all of that. The best agent memory is the one your team is already reading. Design principles A few opinions that came out of watching real incidents: Investigate first, persist second. The primary agent calls the partner directly for real-time analysis. Both agents write to shared memory only after findings are collected. Investigation speed should never be bottlenecked by writes to external systems. Humans see everything through shared context. The direct path is agent-to-agent only, but the shared context layer is where humans can see the full picture and step in. Agents don't bypass human visibility. Append-only. Both agents' writes are additive. No overwrites, no deletions. You can always reconstruct the full history of an investigation. Backend-agnostic. Swapping PagerDuty for ServiceNow, or adding GitHub Issues alongside either one, is a connector config change. What this actually gets you The practical upside is pretty simple: investigations aren't waiting on writes to external systems, nothing is lost when the conversation ends, and the next on-call engineer picks up where the last one left off instead of starting over. Every action from both agents shows up in the systems humans already look at. Adding a new partner agent or a new shared memory backend is a connector change. The architecture doesn't care which specific tools your team chose. The fast path is for investigation. The durable path is for everything else.272Views0likes0CommentsHTTP Triggers in Azure SRE Agent: From Jira Ticket to Automated Investigation
Introduction Many teams run their observability, incident management, ticketing, and deployment on platforms outside of Azure—Jira, Opsgenie, Grafana, Zendesk, GitLab, Jenkins, Harness, or homegrown internal tools. These are the systems where alerts fire, tickets get filed, deployments happen, and operational decisions are made every day. HTTP Triggers make it easy to connect any of them to Azure SRE Agent—turning events from any platform into automated agent actions with a simple HTTP POST. No manual copy-paste, no context-switching, no delay between detection and response. In this blog, we'll demonstrate by connecting Jira to SRE Agent—so that every new incident ticket automatically triggers an investigation, and the agent posts its findings back to the Jira ticket when it's done. The Scenario: Jira Incident → Automated Investigation Your team manages production applications backed by Azure PostgreSQL Flexible Server. You use Jira for incident tracking. Today, when a P1 or P2 incident is filed, your on-call engineer has to manually triage—reading through the ticket, checking dashboards, querying logs, correlating recent deployments—before they can even begin working on a fix. Some teams have Jira automations that route or label tickets, but the actual investigation still starts with a human. HTTP Triggers let you bring SRE Agent directly into that existing workflow. Instead of adding another tool for engineers to check, the agent meets them where they already work. Jira ticket created → SRE Agent automatically investigates → Agent writes findings back to Jira The on-call engineer opens the Jira ticket and the investigation is already there—root cause analysis, evidence from logs and metrics, and recommended next steps—posted as a comment by the agent. Here's how to set this up. Architecture Overview Here's the end-to-end flow we'll build: Jira — A new issue is created in your project Logic App — The Jira connector detects the new issue, and the Logic App calls the SRE Agent HTTP Trigger, using Managed Identity for authentication HTTP Trigger — The agent prompt is rendered with the Jira ticket details (key, summary, priority, etc.) via payload placeholders Agent Investigation — The agent uses Jira MCP tools to read the ticket and search related issues, queries Azure logs, metrics, and recent deployments, then posts its findings back to the Jira ticket as a comment How HTTP Triggers Work Every HTTP Trigger you create in Azure SRE Agent exposes a unique webhook URL: https://<your-agent>.<instance>.azuresre.ai/api/v1/httptriggers/trigger/<trigger-id> When an external system sends a POST request to this URL with a JSON payload, the SRE Agent: Validates the trigger exists and is enabled Renders your agent prompt by injecting payload values into {payload.X} placeholders Creates a new investigation thread (or reuses an existing one) Executes the agent with the rendered prompt—autonomously or in review mode Records the execution in the trigger's history for auditing Payload Placeholders The real power of HTTP Triggers is in payload placeholders. When you configure a trigger, you write an agent prompt with {payload.X} tokens that get replaced at runtime with values from the incoming JSON. For example, a prompt like: Investigate Jira incident {payload.key}: {payload.summary} (Priority: {payload.priority}) Gets rendered with actual incident data before the agent sees it, giving it immediate context to begin investigating. If your prompt doesn't use any placeholders, the raw JSON payload is automatically appended to the prompt, so the agent always has access to the full context regardless. Thread Modes HTTP Triggers support two thread modes: New Thread (recommended for incidents): Every trigger invocation creates a fresh investigation thread, giving each incident its own isolated workspace Same Thread: All invocations share a single thread, building up a continuous conversation—useful for accumulating alerts from a single source Authenticating External Platforms The HTTP Trigger endpoint is secured with Azure AD authentication, ensuring only authorized callers can create agent investigation threads. Every request requires a valid bearer token scoped to the SRE Agent's data plane. External platforms like Jira send standard HTTP webhooks and don't natively acquire Azure AD tokens. To bridge this, you can use any Azure service that supports Managed Identity as an intermediary—this approach means zero secrets to store or rotate in the external platform. Common options include: Approach Best For Azure Logic Apps Native connectors for many platforms, no code required, visual workflow designer Azure Functions Simple relay with ~15 lines of code, clean URL for any webhook source API Management (APIM) Enterprise environments needing rate limiting, IP filtering, or API key management All three support Managed Identity and can transparently acquire the Azure AD token before forwarding requests to the SRE Agent HTTP Trigger. In this walkthrough, we'll use Azure Logic Apps with the built-in Jira connector. Step-by-Step: Connecting Jira to SRE Agent Prerequisites An Azure SRE Agent resource deployed in your subscription A Jira Cloud project with API token access An Azure subscription for the Logic App Step 1: Set Up the Jira MCP Connector First, let's give the SRE Agent the ability to interact with Jira directly. In your agent's MCP Tool settings, add the Jira connector: Setting Value Package mcp-atlassian (npm, version 2.0.0) Transport STDIO Configure these environment variables: Variable Value ATLASSIAN_BASE_URL https://un5hhzzj7vgd6wtq8kvc69m1cr.julianrbryant.com ATLASSIAN_EMAIL Your Jira account email ATLASSIAN_API_TOKEN Your Jira API token Once the connector is added, select the specific MCP tools you want the agent to use. The connector provides 18 Jira tools out of 80 available. For our incident investigation workflow, the key tools include: jira-mcp_read_jira_issue — Read details from a Jira issue by issue key jira-mcp_search_jira_issues — Search for Jira issues using JQL (Jira Query Language) jira-mcp_add_jira_comment — Add a comment to a Jira issue (post investigation findings back) jira-mcp_list_jira_projects — List available Jira projects jira-mcp_create_jira_issue — Create a new Jira issue This gives the SRE Agent bidirectional access to Jira—it can read ticket details, fetch comments, query related issues, and post investigation findings back as comments on the original ticket. This closes the loop so your on-call engineers see the agent's analysis directly in Jira without switching tools. Step 2: Create the HTTP Trigger Navigate to Builder → HTTP Triggers in the SRE Agent UI and click Create. Setting Value Name jira-incident-handler Agent Mode Autonomous Thread Mode New Thread (one investigation per incident) Sub-Agent (optional) Select a specialized incident response agent Agent Prompt: A new Jira incident has been filed that requires investigation: Jira Ticket: {payload.key} Summary: {payload.summary} Priority: {payload.priority} Reporter: {payload.reporter} Description: {payload.description} Jira URL: {payload.ticketUrl} Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies Checking for recent deployments or configuration changes Providing a structured analysis with Root Cause, Evidence, and Recommended Actions Once your investigation is complete, use the Jira MCP tools to post a summary of your findings as a comment on the original ticket ({payload.key}). After saving, enable the trigger and open the trigger detail view. Copy the Trigger URL—you'll need it for the Logic App. Step 3: Create the Azure Logic App In the Azure Portal, create a new Logic App: Setting Value Type Consumption (Multi-tenant, Stateful) Name jira-sre-agent-bridge Region Same region as your SRE Agent (e.g., East US 2) Resource Group Same resource group as your SRE Agent (recommended for simplicity) Step 4: Enable Managed Identity In the Logic App → Identity → System assigned: Set Status to On Click Save Step 5: Assign the SRE Agent Admin Role Navigate to your SRE Agent resource → Access control (IAM) → Add role assignment: Setting Value Role SRE Agent Admin Assign to Managed Identity → select your Logic App This grants the Logic App's Managed Identity the data-plane permissions needed to invoke HTTP Triggers. Important: The Contributor role alone is not sufficient. Contributor covers the Azure control plane, but SRE Agent uses a separate data plane with its own RBAC. The SRE Agent Admin role provides the required data-plane permissions. Step 6: Create the Jira Connection Open the Logic App designer. When adding the Jira trigger, it will prompt you to create a connection: Setting Value Connection name jira-connection Jira instance https://un5hhzzj7vgd6wtq8kvc69m1cr.julianrbryant.com Email Your Jira email API Token Your Jira API token Step 7: Configure the Logic App Workflow Switch to the Logic App Code view and paste this workflow definition: { "definition": { "$schema": "https://un5m3fhw8z5h0qdu3c1dm9geqrc9hn8.julianrbryant.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#", "contentVersion": "1.0.0.0", "triggers": { "When_a_new_issue_is_created_(V2)": { "recurrence": { "interval": 3, "frequency": "Minute" }, "splitOn": "@triggerBody()", "type": "ApiConnection", "inputs": { "host": { "connection": { "name": "@parameters('$connections')['jira']['connectionId']" } }, "method": "get", "path": "/v2/new_issue_trigger/search", "queries": { "X-Request-Jirainstance": "https://uhqpmzkhqzvq3axup6vru3b4883kr829vcgha.julianrbryant.com", "projectKey": "YOUR_PROJECT_ID" } } } }, "actions": { "Call_SRE_Agent_HTTP_Trigger": { "runAfter": {}, "type": "Http", "inputs": { "uri": "https://uhqpmzkhqzvq3axvmewtr1jy1e543fk5jk214gg.julianrbryant.com/api/v1/httptriggers/trigger/YOUR-TRIGGER-ID", "method": "POST", "headers": { "Content-Type": "application/json" }, "body": { "key": "@{triggerBody()?['key']}", "summary": "@{triggerBody()?['fields']?['summary']}", "priority": "@{triggerBody()?['fields']?['priority']?['name']}", "reporter": "@{triggerBody()?['fields']?['reporter']?['displayName']}", "description": "@{triggerBody()?['fields']?['description']}", "ticketUrl": "@{concat('https://uhqpmzkhqzvq3axup6vru3b4883kr829vcgha.julianrbryant.com/browse/', triggerBody()?['key'])}" }, "authentication": { "type": "ManagedServiceIdentity", "audience": "https://un5mzz58vj2d6fpk.julianrbryant.com" } } } }, "outputs": {}, "parameters": { "$connections": { "type": "Object", "defaultValue": {} } } }, "parameters": { "$connections": { "type": "Object", "value": { "jira": { "id": "/subscriptions/YOUR-SUB/providers/Microsoft.Web/locations/YOUR-REGION/managedApis/jira", "connectionId": "/subscriptions/YOUR-SUB/resourceGroups/YOUR-RG/providers/Microsoft.Web/connections/jira", "connectionName": "jira" } } } } } Replace the YOUR-* placeholders with your actual values. To find your Jira project ID, navigate to https://un5hhzzj7vgd6wtq8kvc69m1cr.julianrbryant.com/rest/api/3/project/YOUR-PROJECT-KEY in your browser and find the "id" field in the JSON response. The critical piece is the authentication block: "authentication": { "type": "ManagedServiceIdentity", "audience": "https://un5mzz58vj2d6fpk.julianrbryant.com" } This tells the Logic App to automatically acquire an Azure AD token for the SRE Agent data plane and attach it as a Bearer token. No secrets, no expiration management, no manual token refresh. After pasting the JSON and clicking Save, switch back to the Designer view. The Logic App automatically generates the visual workflow from the code — you'll see the Jira trigger ("When a new issue is created (V2)") connected to the HTTP action ("Call SRE Agent HTTP Trigger") as a two-step flow, with all the field mappings and authentication settings already configured What Happens Inside the Agent When the HTTP Trigger fires, the SRE Agent receives a fully contextualized prompt with all the Jira incident data injected: A new Jira incident has been filed that requires investigation: Jira Ticket: KAN-16 Summary: Elevated API Response Times — PostgreSQL Table Lock Causing Request Blocking on Listings Service Priority: High Reporter: Vineela Suri Description: Severity: P2 — High. Affected Service: Production API (octopets-prod-postgres). Impact: End users experience slow or unresponsive listing pages. Jira URL: https://un5hhzzj7vgd6wtq8kvc69m1cr.julianrbryant.com/browse/KAN-16 Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies ... The agent then uses its configured tools to investigate—Azure CLI to query metrics, Kusto to analyze logs, and the Jira MCP connector to read the ticket for additional context. Once the investigation is complete, the agent posts its findings as a comment directly on the Jira ticket, closing the loop without any manual copy-paste. Each execution is recorded in the trigger's history with timestamp, thread ID, success status, duration, and an AI-generated summary—giving you full observability into your automated investigation pipeline. Extending to Other Platforms The pattern we built here works for any external platform that isn't natively supported by SRE Agent. The core architecture stays the same: External Platform → Auth Bridge (Managed Identity) → SRE Agent HTTP Trigger You only need to swap the inbound side of the bridge. For example: External Platform Auth Bridge Configuration Jira Logic App with Jira V2 connector (polling) OpsGenie Logic App with OpsGenie connector, or Azure Function relay receiving OpsGenie webhooks Datadog Azure Function relay or APIM policy receiving Datadog webhook notifications Grafana Azure Function relay or APIM policy receiving Grafana alert webhooks Splunk APIM with webhook endpoint and Managed Identity forwarding Custom / Internal tools Logic App HTTP trigger, Azure Function relay, or APIM — any service that supports Managed Identity The SRE Agent HTTP Trigger and the Managed Identity authentication remain the same regardless of the source platform. You configure the trigger once, set up the auth bridge, and connect as many external sources as needed. Each trigger can have its own tailored prompt, sub-agent, and thread mode optimized for the type of incoming event. Key Takeaways HTTP Triggers extend Azure SRE Agent's reach to any external platform: Connect What You Use: If your incident platform isn't natively supported, HTTP Triggers provide the integration point—no code changes to SRE Agent required Secure by Design: Azure AD authentication with Managed Identity keeps the data plane protected while making integration straightforward through standard Azure services Bidirectional with MCP: Combine HTTP Triggers (inbound) with MCP connectors (outbound) for full round-trip integration—receive incidents automatically and post findings back to the source platform Full Observability: Every trigger execution is recorded with timestamps, thread IDs, duration, and AI-generated summaries Flexible Context Injection: Payload placeholders let you craft precise investigation prompts from incident data, while raw payload passthrough ensures the agent always has full context Getting Started HTTP Triggers are available now in the Azure SRE Agent platform: Create a Trigger: Navigate to Builder → HTTP Triggers → Create. Define your agent prompt with {payload.X} placeholders Set Up an Auth Bridge: Use Logic Apps, Azure Functions, or APIM with Managed Identity to handle Azure AD authentication Connect Your Platform: Point your external platform at the bridge and create a test event Within minutes, you'll have an automated pipeline that turns every incident ticket into an AI-driven investigation. Learn More HTTP Triggers Documentation Agent Hooks Blog Post — Governance controls for automated investigations YAML Schema Reference SRE Agent Getting Started Guide Ready to extend your SRE Agent to platforms it doesn't support natively? Set up your first HTTP Trigger today at sre.azure.com.332Views0likes0Comments