Addteq – Your DevOps Experts and partners of Atlassian and Dynatrace – is building tighter DevOps Use Case integrations between Dynatrace and the Atlassian tool suite. Addteq also educates and helps our joint customers on how to our tools can optimize your end-to-end delivery processes.
Not only are they building these integrations and educating you on DevOps best practices, but they also use Dynatrace to monitor their own internal DevOps toolchain such as JIRA, Confluence, Bitbucket and Bamboo, to name a few!
On a recent status call with Himanshu Chhetri (CTO) and Sukhbir Dhillon (CEO), they mentioned how well the Dynatrace AI worked for them after deploying the Dynatrace OneAgent on their internal servers. Dynatrace automatically identified the root cause of a slowdown of their JIRA and Confluence, even before their developers got heavily impacted by this problem. This obviously speaks to the pro-active nature of Dynatrace.
#1 – The Dynatrace Problem Ticket
Every time Dynatrace detects an anomaly in your environment it creates a problem ticket. Thanks to the Dynatrace JIRA Integration, which is currently being extended by Addteq, the problem ticket automatically created a JIRA ticket in Addteq’s JIRA instance. This ticket triggered their own internal problem resolution workflow.
The Dynatrace Problem ticket indicated that six services were impacted including JIRA, Confluence as well as some shared services such as the TokenService that experienced a very high failure rate:
#2 – Problem Evolution
One of the features that gets people excited about the Dynatrace AI, is the fact that Dynatrace correlates all events that happen on all depending components into a single problem ticket. Instead of having to look at events from your log monitoring, infrastructure monitoring, application performance monitoring and end user monitoring tools; you get all this information in a single spot: Dynatrace!
And the one view in the Dynatrace Web UI that shows all these events along a timeline is the Problem evolution view that gives us a time-lapse option to “replay” the chain of events. Here is the problem evolution for their problem:
#3 – Automatic Deployment Detection
If you look at the distribution of events (top right) we can see that this problem when through two phases. Each phase shows a spike of events coming through: one shortly after midnight – the second one shortly before 2AM.
The first bulk of events are all related with a restart and a redeploy of the jira-install service. Turns out that this is “normal.” Well – kind of normal. Digging through the automatically detected deployment events shows us that every time they restart that service, we see high CPU and error log messages, resulting in some of the failures we can observe in the other depending services:
#4 – Python process gone ROGUE!
The second bulk of events is related to the real problem. Turns out that the server addteq-crowd, a Linux machine hosting Atlassian Crowd, runs out of CPU. Crowd is single sign on services that is used by all other Atlassian tools such as JIRA and Confluence. If this service is impacted it impacts everyone else.
Looking closer at this Linux machine shows us that it is not Crowd itself, which runs in the Tomcat container, that uses all the CPU. Turns out it is the Python-based app called duplicity which is used for file and directory backups:
Duplicity runs on every of Addteq’s hosts but only runs into high CPU on addteq-crowd. This can easily be seen by looking at the Dynatrace Process Group overview for Duplicity – showing us resource consumption of all instances of Duplicity across all hosts where it runs:
Tip: Process Group Detection is a key capability in Dynatrace. The automatic detection works extremely well but can always be customized to your special needs. To learn more check out Mike Kopp’s blog on Enhanced Process Group Detection.
Actions based on Problem Detection
Our friends from Addteq weren’t aware of Duplicity having an issue on a single machine and didn’t know the actual impact it had on Atlassian Crowd, which impacted all the other services. Because Dynatrace automatically analyzes all this data and is aware of all these dependencies it is possible to identify these problems that we would have normally not even thought of.
It’s great to see a partner like Addteq not only building integrations and therefore extending the Dynatrace ecosystem, it’s also great that they “Walk the Talk” and are actively using Dynatrace to ensure their systems are optimally running, ensuring their employees are not impacted by any python scripts gone rogue! 😊
If you want to try Dynatrace yourself, simply sign up for our Dynatrace SaaS Trial. If you want to learn more about how to optimize JIRA and Confluence in particular read my blog on “Optimizing Atlassian JIRA and Confluence Productivity with Dynatrace”.
This syndicated content is provided by Dynatrace and was originally posted at https://www.dynatrace.com/news/blog/dynatrace-ai-in-action-rogue-python-script-impacting-atlassian-devops-tools/