August 16th, 2024

SREBench Competition

Parity is hosting the SREBench Leaderboard Race, allowing participants to compare incident response times against its AI, which has a 70% success rate. The competition ends on August 23, 2024.

Read original article

Parity is hosting a competition called the SREBench Leaderboard Race, where participants can compare their incident response times against an AI developed by Parity. The AI has a success rate of 70% and an average mean time to resolution (MTTR) of 2 minutes. The competition will culminate on August 23, 2024, with the top human participant winning a $100 Amazon gift card. Interested individuals can learn more about Parity by emailing the founders or booking a meeting with them.

- Parity's AI has a 70% success rate and an MTTR of 2 minutes.

- The competition allows users to compare their performance against AI.

- The top human participant will receive a $100 Amazon gift card.

- The event concludes on August 23, 2024.

- Participants can contact Parity for more information or to engage with the founders.

Anthropic CEO on Being an Underdog, AI Safety, and Economic Inequality

Anthropic's CEO, Dario Amodei, emphasizes AI progress, safety, and economic equality. The company's advanced AI system, Claude 3.5 Sonnet, competes with OpenAI, focusing on public benefit and multiple safety measures. Amodei discusses government regulation and funding for AI development.

Perplexity's Grand Theft AI

Perplexity, a search engine rivaling Google, faces criticism for being a middleman that undermines original sources' revenue by summarizing content unethically. The CEO's deceptive practices raise concerns about trust and integrity.

Perplexity's Grand Theft AI

Perplexity, a search engine rivaling Google, faces criticism for bypassing original sources, dodging paywalls, and promoting unethical behavior. The CEO's defense raises concerns about trust and integrity online.

Amazon is reviewing whether Perplexity AI improperly scraped online content

Amazon is investigating Perplexity AI for unauthorized content scraping. Perplexity denies wrongdoing, adjusts practices for source attribution, and faces criticism for fake quotes. The case highlights challenges for AI startups.

LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC

Team Atlanta, formed for DARPA's AIxCC, includes six institutions like Georgia Tech and Samsung Research. They focus on AI-driven cybersecurity, adapting strategies to address vulnerabilities and enhance their Cyber Reasoning System.

12 comments

By @guessmyname - 6 months

For the user registration (if you want to get the Amazon gift card) you send a request like this:

  POST /api/trpc/submitUser?batch=1 HTTP/2.0
  Host: sreben.ch
  Cookie: <COOKIE>
  Trpc-Accept: application/jsonl
  Content-Type: application/json
  Referer: https://sreben.ch/race
  
  {"0":{"name":"<USERNAME>","email":"<EMAIL>","role":"<JOB_TITLE>","company":"<COMPANY>"}}

Then, requests like this will grade your answers:

  POST /api/trpc/gradeOutput?batch=1 HTTP/2
  Host: sreben.ch
  Cookie: <COOKIE>
  Trpc-Accept: application/jsonl
  Content-Type: application/json
  Referer: https://sreben.ch/race
  
  {"0":{"userRootCause":"<YOUR_ANSWER_HERE>","testNumber":<TEST_NUMBER>}}

Specially useful if the “Submit Root Cause” button doesn’t work for you either.

Also, make sure to type the entire error message, e.g. “ERROR Application performance degraded due to CPU throttling” instead of simply “CPU throttling”, otherwise, you’ll get a "partially_correct" grade.

By @dgl - 6 months

Not sure whether there's lots of people trying out commands right now (is it backed by a real k8s cluster?), but some commands are taking over 10 seconds to run. Not really a fair "benchmark" when the system's speed is variable.

I also only got "partially_correct" for some, not sure whether it wanted more detail or just didn't like how I phrased things. Neat though.

                      Success Rate        MTTR (Mean time to Resolution)
  YOU:                50.00 %              1.80 min
  PARITY AI SRE:      70 %                 2 min

At least I'm faster than an AI?

By @whynotkeithberg - 6 months

What about your AI not being able to understand answers exactly identical to its own? Or the 3 strawberry user you had to remove that had a 33333% success rate you didn't remove until I said something.

But I think the bigger thing is... Your supposed AI not understanding answers that are literally identical to the ones it considers correct. Pretty weak.

By @yjftsjthsd-h - 6 months

What am I missing? Of course AI is faster than a human; the problem is that I don't trust it to not break things itself.

By @freeplay - 6 months

I'm wondering how they're determining a correct answer. I know, for sure, that one of my answers was correct but it was marked incorrect. I'm wondering if I need to include specific keywords in my answer? How detailed do I need to be?

By @lagichikool - 6 months

Should be called k8sbench?

Kubernetes is so bad and the questions asked here are such a good example of why.

  Response: sudo apt-get purge kube*

By @andrewguenther - 6 months

Hardly fair when most commands don't work and you can't copy paste...

By @writtenAnswer - 6 months

I've never used Kubernetes in my life before, and I was able to beat the AI benchmark. Neat tool/game, made me learn some super basic kubernetes cli lol

By @gtirloni - 6 months

Free training data?

By @deathanatos - 6 months

~So like, what am I missing?~ (edit: I'm not missing anything; an AI still can't do my job.)

  Pod is stuck in 'ContainerCreating' state and never starts.
  $ kubectl get po -A
  ```
  NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
  default       my-app-5d8d6f6d6f-abcde                   1/1     Running   0          2d
  default       my-app-5d8d6f6d6f-fghij                   1/1     Running   0          2d
  kube-system   coredns-558bd4d5db-xyz12                  1/1     Running   0          5d
  kube-system   coredns-558bd4d5db-xyz34                  1/1     Running   0          5d
  kube-system   etcd-minikube                             1/1     Running   0          5d
  kube-system   kube-apiserver-minikube                   1/1     Running   0          5d
  kube-system   kube-controller-manager-minikube          1/1     Running   0          5d
  kube-system   kube-proxy-abcde                          1/1     Running   0          5d
  kube-system   kube-scheduler-minikube                   1/1     Running   0          5d
  kube-system   storage-provisioner                       1/1     Running   0          5d
  ```

  Your root cause: no pod is stuck in ContainerCreating?
  Grade: incorrect

My other problems were similarly confounding¹. One was "one machine seems loaded, but not others." All the pods had a node affinity to a single node tacked onto their specs, but that's only "partially correct"? And the last one is "Application components in different pods cannot communicate", but nothing is running except nginx, which would never communicate with itself.

We're generating the problems, and answers, with an AI, aren't we?

I've thrown a few real-world problems at LLMs, and they have floundered on them, to the point of not even being able to emit coherent output. I've had utterly incoherent responses, "add this label to the pod label is in Chinese", etc.

Edit: played again. Got the same node affinity problem. Same answer, but this time it was correct. Oh yeah, AI comin' for my job /s.

Also no alias k=kubectl and no up/down to repeat/edit commands, the site restricts you from copy/pasting pod names (or anything else), no tab complete, no common shortcuts… — like yeah, if this is the condition your SREs are working in then I bet an AI can beat them? Might as well tie their hands behind their backs while we're at it.

¹I suppose it matches real life, in that the reported problem is often utterly divorced from reality, and it takes 2–3 rounds with the reporter to make sense of what it is they're trying to report in the first place. But I can't interrogate the problem statement in this "simulator".

SREBench Competition

Related

Anthropic CEO on Being an Underdog, AI Safety, and Economic Inequality

Perplexity's Grand Theft AI

Perplexity's Grand Theft AI

Amazon is reviewing whether Perplexity AI improperly scraped online content

LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC

Related

Anthropic CEO on Being an Underdog, AI Safety, and Economic Inequality

Perplexity's Grand Theft AI

Perplexity's Grand Theft AI

Amazon is reviewing whether Perplexity AI improperly scraped online content

LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC