How our ever expanding world inspired the next big thing in product management software
Asset growth that’s no longer linear, with terabytes of growth in a day no longer unusual. A userbase which spans multiple timezones, languages and cultures. Increasing visitor numbers with ever-shortening deadlines and ever-increasing expectations of speed, uptime and resilience.
I suspect they’ll be trends all-too familiar to most systems administrators, IT managers, dev managers and CIOs and CTOs. At Sun Branding Solutions we’ve been providing a SaaS offering called Odin to the world’s best-known retailers and brands for well over a decade, and to its credit it’s kept up remarkably well with changes in the way packaging is created, designed and delivered when you consider its roots. Originally built back in the mid-2000’s, it’s followed a roadmap of technologies that will be familiar to just about anyone reading this who’s done any kind of brownfield development – VB6, Classic ASP, ASP.Net Web Forms, ASP.Net MVC, jQuery and so on. It actually works pretty well – it’s still very much alive and in use today for thousands of product launches a day across the USA, UK and mainland EU.
What Odin didn’t do very well, though, was stack up against the growth challenges we face. A few numbers help to put those into context:
- Most of our Odin instances are single-tenant, and each of them handles about 1m hits/day from about 30,000 unique visitors. That load pattern is concentrated pretty heavily around the early afternoon – about 50% of traffic for the day falls into the three hour period between 1pm and 4pm.
- Asset growth between the various Odins was around 500Gb/month in January 2014. By December 2014 it’d grown to about 1Tb/month, and by December 2015 we’re spiking at over 1Tb/week. At that rate of growth, we expect to be dealing with well over 1Tb/day by the end of 2016. Crucially from an infrastructure planning perspective, that growth is driven partly by an increase in business (which is good, because more business = more money, which pays for the infrastructure) but also partly by an increase in the size of assets associated with each packaging item (which means no increase in revenue). That meant we had to look at some slightly non-traditional methods of storing those assets where we could scale the storage without an increase in cost.
- Odin doesn’t scale horizontally – or at least, not easily and not well. A reliance on state means it’s hard to scale the front-end web server into a web farm, load balancing is prohibitively difficult and any sort of HA environment was very, very hard to deliver. That’s increasingly unacceptable in a world where 24/7 uptime is expected and where users work different hours and days depending on where they are in the world.
So… welcome to SUNrise.
From the outset, we had a few fairly ambitious design goals for SUNrise.
- It needed to scale massively. Our definition of “massive” isn’t quite the same as Nasdaq’s (though it isn’t far off – they’re producing about 500Tb/year, and we expect to catch up by the end of this year), but in practical terms it means catering for
- 500,000 concurrent visitors
- 100m hits/day (spread roughly evenly throughout the day at 6m/hour)
- Individual file upload limits of 200Gb/file
- Total asset storage limits of 1Pb/customer (to put that into context, one of our competitors is hoping to support 200 concurrent users and ‘multiple terabytes’ of storage by 2020)
- It needed to be highly available, and low latency. That means multiple web front-ends running in multiple continents. At the same time, we wanted to be able to easily move those instances around the world to suit load patterns – there’s little point having 20 webservers in the UK at 9pm, but that’s right in the middle of the day in Arkansas.
- It needed to meet compliance and regulatory requirements for our customers – that means compliance with the EU’s data protection regulations for example, along with encryption at rest and all the other security stuff that you’d expect.
- It needed to be fast. The nature of what SUNrise (and Odin before it) does is quite heavily process intensive, and that needs to happen asynchronously as much as possible. That means a queue-based architecture and a scalable set of “worker jobs” which would fire up on demand.
It’s been an interesting journey.
We settled on Microsoft Azure; we’re a Microsoft Gold Application Development Partner, so it was the logical choice – but we did consider other offerings too. AWS wasn’t really an option for us because they’re seen as a competitor by many of our retail clients, and Google’s offering didn’t really seem to have the vision or roadmap we needed. Azure fits well with our .Net background, as you’d expect, but when you add in Powershell scripting and the commercial flexibility with Microsoft it was a fairly easy choice.
Azure Queue and Blob storage powers the queue based processing and our main application filestore, and we use SQL Database (as a service) rather than SQL-on-VMs for the backing database. We’ve learned a few things along the way about that – not least that SQL Database scaling has some challenges of its own (it behaves quite differently to a traditional SQL Server VM, especially in terms of connection latency, and that’s meant a mindset change for devs). It’s also very much a scale-out rather than scale-up model – which is great, and definitely the way forward, but another cause of some tooth-grinding on the dev’s part.
Like a lot of Azure early adopters we started out with Web Roles and Worker Roles, and have migrated over to Web Apps as they’ve matured as an offering – another learning curve for devs who have to get out of some old (bad!) habits like using local file storage for temporary data, or relying on state. They provide the scaling we were looking for out of the box – at its simplest, “add some more instances when CPU averages more than X percent, and take them away again when it drops back down” – and with Traffic Manager sat in front of them we’re able to route traffic to closest nodes and rely on Azure’s fast back-haul under the hood. Perhaps most importantly, they don’t need the emulator which took an hour to start up each time you hit Debug in Visual Studio!
The worker roles are still there; we use those for performing all kinds of ‘offline’ tasks like ripping files, and they have similar scaling properties which allows us to adjust for demand. Initially we ran our own Lucene instances for search, but administering and debugging it was just hard enough to be too time consuming – we’ve now migrated to Azure Search (in for a penny, in for a pound!) which works extremely well (though the documentation could perhaps do with a spruce up..).
Right at the heart of SUNrise is a hand-rolled workflow engine – which might be a surprise to those of you thinking “is there anything in the Microsoft stack these guys didn’t use?” by now. We did look at Workflow Foundation (and got as far as building a few PoC’s with it) and some other alternative off-the-shelf workflow engines, but none of them did quite what we wanted, and it’s so core to SUNrise that we didn’t want to compromise too early on. That too runs as a set of worker roles backed off with Azure Queue Storage, scales dynamically and lets us handle events from simple task completion through to SAP integration within a user-defined (in a web-based editor) workflow.
Did we achieve our goals?
So far, so good! We’ve got automated load testing hammering the site with hundreds of thousands of simulated users, uploading lots of large files and completing thousands of work items a minute, and in failover testing we’ve proved the concept of a distributed network of web and worker roles with seamless recovery from the loss of a whole region.
We’ve got an exciting roadmap ahead of us with lots more goodies to add, but it’s fair to say that SUNrise is really a next-generation solution when you compare it to the competitive marketplace.
We’re pretty pleased with it – why not drop us a line and give it a go for yourself or check out our webcast first.