Here’s my story about hosting Reflection 70B on @hyperbolic_labs:
On Sep 3, Matt Shumer reached out to us, saying he wanted to release a 70B LLM that should be the top OSS model (far ahead of 405B), and he asked if we were interested in hosting it. At that time, I thought it was a fine-tuned model that surpasses 405B in certain areas like writing, and since we always want people to have easy access to open-source models, we agreed to host it.
Two days later, on the morning of Sep 5, Matt made the announcement and claimed the model outperformed closed-source models across several benchmarks. He uploaded the first version to Huggingface. We downloaded and tested the model, but I didn’t see the <thinking> tags featured in his demo, so I messaged him on X to let him know. Later, I saw his tweet saying there’s an issue with the tokenizer in the Huggingface repo (x.com/mattshumer_/status/183…), so we patiently waited.
I woke up at 6 AM PST on Sept 6 and found I had received a DM around 3 AM PST from Sahil Chaudhary, founder of Glaive AI. He told me the Reflection-70B weights had been reuploaded and were ready for deployment. I didn’t know him before and that was the only message I received from him. At around 6:30, I was added to a Slack channel with Matt to help streamline communication. I focused on deploying the model, and around 9 AM our API was live, and the tests showed that the <thinking> and <reflection> tags were finally appearing as expected, so we announced that.
After we released the model, a few people commented that our API worked worse than Matt’s internal demo website (but I kept seeing error codes using their website so I cannot compare the results), so we dug into everything to ensure it wasn’t a problem on our side. At 7 PM, Matt posted in the Slack channel, saying our API “definitely something's a little off”, and asked if we could expose a completions endpoint so he could manually build prompts to diagnose the issue. I set that up for him in the next hour. There was no response from Matt until the next day’s night, and he told us they were focusing on a retrain, which quite surprised me.
On Sep 8, Sunday morning, Matt told us they would have the retrained weights uploaded to HF later and asked if we could host them when they were ready. I said yes and waited for the new models to be uploaded. Several hours later, someone on X pointed out the ref_70_e3 model had been uploaded to HF, so I asked Matt if that was the one. He said it should be, and a while after, he asked us to host it, so I quickly did that. I notified @ArtificialAnlys and later got on a call with their co-founder George in the afternoon, he told me the benchmarking result was not good, much worse than their internal API, and later they posted the results: x.com/ArtificialAnlys/status….
Matt told us that day that they had hosted the “OG weights” themselves and could give us access if we wanted to host them. I replied, “We will wait for the open-source one since we only host open-source models.”
Since then, I’ve asked Matt several times when they plan to release the initial weights, but I haven’t received any response. Over 30 hours have passed, and at this point, I believe we should take down the Reflection API and allocate our GPUs to more useful models after some people (@ikristoph) finish their benchmarking (not sure if it's still useful).
I was emotionally damaged by this because we spent so much time and energy on it, so I tweeted about what my faces looked like during the weekend. But after Reflecting, I don’t regret hosting it. It helped the community identify the issues more quickly.
I don’t want to guess what might have happened, but I think the key reflection is: Attention is not all you need.
Sep 10, 2024 · 10:05 PM UTC
54
55
18
660