Architected Availability: 2013

Monday, November 4, 2013

Getting the Most From AWS Ephemeral Volumes

I had mentioned in How to Use LVM and LUKS with EBS Volumes that I used LVM and LUKS for the ephemeral volumes at Lucid Software. I have made the associated scripts public on Github, with an Apache 2.0 license. I'll take just a minute and describe what the script is and how we use it at Lucid.

How to Use LVM and LUKS with EBS Volumes

A while back, I had posted my findings on encryption at rest using LUKS. Circling back, here's the procedure I used. Although I was operating on Ubuntu 12.04 and EBS volumes, this same procedure can be used in many different scenarios.

Cloud Connect 2013

Another month, another conference. I am currently in Chicago, IL, getting ready to present at Cloud Connect Chicago. I'll be presenting Case Study: Lucidchart's Migration to VPC.

Zabbix vs Graphite

Monitoring solutions have been around for some time, but I still haven't found the perfect one. I first implemented Zabbix for Lucidchart in late 2011, and, just a few months ago, I installed Graphite. I'd like to take you through my decision process so you can find the right monitoring tool for your needs.

Comparing these two products is not easy, because they were designed to do different things. Zabbix was meant to be a server monitoring solution, while Graphite is more of a data collection and reporting tool. What I'd really like to see is a merger of the two tools, but that probably won't happen anytime soon.

PDF Service Memory Leaks

One of the most attractive features of Lucidchart is the direct mapping of pixels from screen to page. An essential part of this process is our PDF generator. JSON render data goes in and a PDF or an image comes out. Though it sounds simple, it contains 13k lines of Scala code, heavily uses Akka actors to gather and render fonts, images, and pages, depends on 8 internally maintained jars and 83 others, and is responsible for generating 50k PDFs and images a day (1.5M per month, 18.25M per year) at its current load. This is anything but a simple service.

Keeping this service running smoothly is a high priority. On July 8, a code release to the Lucidchart editor uncovered several issues with the PDF service. More specifically, the new image manager allowed users to retrieve images from Facebook, Flickr, and Dropbox. With these changes, our robust system fell on its face. PDF JVMs were crashing hundreds of times a day, causing those servers to be terminated and replaced with new ones. It wreaked havoc on our users and our uptime.

Memory Leak in Opscode Chef Daemon

At Lucidchart, we use Opscode's Chef to manage all of our servers, and have been since early 2012. We've used the same version of Ruby and Chef the whole time - no upgrades, downgrades, or new modules. Out of nowhere, every one of our servers start running out of memory. It caused our production site to run slow, servers to fail health checks, and the ops team to scramble.

Northeast PHP Conference 2013 Sessions

In addition to the strangeloop workshop, I've been accepted to speak at the 2013 Northeast PHP Conference. The Northeast PHP Conference is a web developer conference focused on PHP, Web Technology, and UX. The conference is in Boston, Massachusetts, on Aug 16-18, 2013.

Encryption at Rest using LUKS

At past companies, encryption at rest was done at the application layer. Only part of the data had to be encrypted, so code was inserted into the model that would encrypt the sensitive data before inserting into the database and decrypt after retrieval. This approach worked, and had no impact on the database - the hardest layer to scale.

At Lucidchart, we have failed to close large sales due to lack of encryption. Large companies want to make sure that their proprietary information is transmitted and stored using industry standard encryption. I took on the task to find a method of encryption that made the most sense for our use case, and had little overhead on our systems. After a lot of testing, benchmarking, and evaluating, I came to the conclusion that encrypting the disks on our database servers using LUKS was, and still is, the best solution.

Strange Loop 2013 Workshop

I've been accepted as a speaker at strangeloop, a programmer conference that focuses on emerging languages, concurrent and distributed systems, new database technologies, front-end web, and mobile apps. The conference is in St Louis, Missouri, on September 18-20, 2013.

My workshop is entitled Your First Scala Web Application using Play! 2.1. Here's the quick blurb I submitted.

Failover with PHP & MySQL using IPTables

Not too long ago, I had to replace a MySQL server in our production environment. Its replacement would have the same CNAME, and so no configuration change was needed. This particular server is in a master-master replication set with another database server. The application is set up with automated failover between the two servers. While the failover was automated, I didn't want to just shut down the database server because its slave may miss some writes. I needed a way to fake shutdown to the web servers, but keep communication open between database servers.

Looking back, it would have been a lot easier to change configuration, wait 30 minutes for it to propagate, replace the MySQL instance, change the configuration again, and wait 30 minutes for it to come back up. At the time, however, I thought of a simpler solution. I would use iptables to reject all traffic from web servers to this one database server. Once the connection was rejected, the failover would take place, and the replication would continue to work.

Not only did my plan not work, it caused about 5 minutes of downtime on our production servers.

Storing UUIDs with Anorm

I am not a big fan of Object Relational Mappers (ORMs). It is too easy to turn a simple query into a massive CPU-eating hydra - complete with 7 outer joins and no index usage. When I write code, I try to write it in such a way that the worst things to do are the hardest. That way, when I'm in a rush, I still write good code.

Enough about my rant: enter Anorm. I've been developing software on the Scala / Play stack for over a year now. Anorm is currently the supported database connector. It is a thin layer on top of JDBC that takes out all of the ugliness, but leaves the manual SQL writing goodness.

One of the things that Anorm does not do is read and write UUIDs. In 200 lines of copy-paste code, you'll have Anorm reading and writing UUIDs like a pro. A side effect of this code is that Anorm will also be able to handle byte arrays.

"In" Clause with Anorm

At Lucidchart, two coding styles have been adopted because Anorm doesn't support the 'in' clause.

Code that allows for, or is conducive to, SQL injection.

Example:
SQL(""" SELECT * FROM users WHERE id IN (%s) """.format(userIds.mkString(",")))
Code that is hard to follow and duplicate.

Example
val params = List(1, 2, 3)
val paramsList = for ( i <- 0 until params.size ) yield ("userId" +
i) // results in List("userId0", "userId1", "userId2")
SQL(""" SELECT * FROM users WHERE id IN ({%s}) """.format(paramsList.mkString("},{")) //
produces "id in ({userId0},{userId1},{userId2})"
.on(paramsList.zip(params)) // produces Map("userId0" -> 1,
"userId1" -> 2, ...)

Ugly much? The madness ends here with code samples.

UUIDs in MySQL

Auto incremented numbers can be so nice for unique IDs in a database; they're guaranteed to be unique, can range to fit your needs (tinyint to unsigned bigint), and easy to use. When working in a sharded environment, they may not fit the bill anymore; you either lose global uniqueness or you lose ease of use (depending on your sharding mechanism). One very good alternative is to use UUIDs.

UUIDs solve the global uniqueness and ease of use problem at the same time, but introduces another problem in the database layer - memory footprint. A UUID is a 36 byte ASCII string of characters representing a 16 byte value. I've seen them stored as CHAR(36), VARCHAR(36), etc. The problem with storing them in their ASCII form is that the index will grow beyond the capacity of the database server faster than when they're stored in a binary form. Memory will fill faster, queries will hit the disk more frequently, and it will ultimately cost more money to save the exact same information. It's not worth it.

Best AWS re:Invent Session

I recently came across a post about James Hamilton, the star AWS employee who keeps the internet alive while living on a boat, and I thought I would write a blurb about his sage wisdom bestowed upon me during the best session I attended at the AWS re:Invent conference.

OpenVPN Configuration for VPC on Ubuntu 12.04

One of my more popular posts so far has been OpenVPN with Amazon VPC. I'm guessing that the lack of a step by step process or, at the very least, the configuration files, has left you annoyed. In this post, I'll give step by step instructions and configuration files to you, so that you can have all the goodness that is a private VPN on VPC.

While the servers that I ran this on are Ubuntu 12.04, it wouldn't surprise me if the configuration and setup worked on various versions of Ubuntu, Debian, Fedora, CentOS, and more.

VPC Migration: Post Mortem

All done! Every last one of the servers is running inside of Amazon's VPC. For the most part, everything went as expected. There are just a few loose ends I'd like to note.

OpenVPN with Amazon VPC

Although I did not initially plan to setup a VPN between Lucidchart's office and the newly setup VPC, I changed my mind before I even migrated the first server.

The reasoning is simple. I don't want our services to be publicly accessible; however, our office needs access to those services. The services I'm talking about include git, chef, apt, jenkins, and more.

These services are not the only issue. Imagine a problem in production that requires manual debugging. I would have to tunnel through the NAT instance manually just to debug the problem server. When I'm having any issue in production, the last thing I want is an extra step.

VPC Migration: DNS

Shortly after starting the migration to VPC, I ran into an unexpected issue with DNS. I hope to give more a more informed view than I received prior to starting the migration.

VPC Migration: NATs & Bandwidth Bottleneck

I ran into an unexpected issue during the migration to VPC over the weekend. The NAT instances, all of which are t1.micro size, could not handle the network traffic between the web servers and the backend servers. Our traffic backed up to the point that requests started timing out. The disastrous result was downtime.

VPC Migration: Setup

In my last post, I gave reasons for and against the move to VPC. I have now set up my VPC and hope to help some trouble-bound soul not have the same mistakes.

The first thing to do in a migration to Amazon's Virtual Private Cloud is to set up the subnets, routes, gateways, and NAT instances. My intention in this post is to layout the steps and generic principles contained in our VPC setup. The setup that I chose is not perfect for all situations. I assume you know how to operate the AWS console and are familiar with basic EC2 and networking concepts.

VPC Migration: Planning

I'm looking at moving all of Lucidchart's servers into Amazon's VPC. This is no small task, nor should it be approached without a good plan and collective knowledge.

I will be recording the migration to VPC during the transition. As part of that, here are the advantages and disadvantages of moving to VPC, and my plan to do it.

Disk Failures and Service Interruptions

I am currently employed as the chief architect at Lucidchart. In my spare time (literally) I am also the ops guy. All of our servers are running on Amazon's EC2 cloud. Using the cloud is amazing and frustrating at the same time. Managing hardware, using tape drives, and co-location facilities are all nightmares; on the other hand, so are service outages, network failures, and ephemeral storage drives.

As the CTO of Amazon, Werner Vogels, says, "Everything fails all the time." I would like to give a report of one such failure: how it happened, what was affected, how we got through it, what I did because of it, and how I'll never have to deal with it again.

Monday, November 4, 2013

Wednesday, October 23, 2013

Friday, August 23, 2013

Friday, August 2, 2013

Thursday, July 18, 2013

Friday, June 21, 2013

Wednesday, June 19, 2013

Thursday, May 23, 2013

Wednesday, May 8, 2013

Monday, April 22, 2013

Wednesday, April 17, 2013

Wednesday, March 27, 2013

Monday, February 25, 2013

Thursday, February 21, 2013

Tuesday, January 29, 2013

Friday, January 25, 2013

Thursday, January 24, 2013

Wednesday, January 23, 2013

Tuesday, January 22, 2013

Saturday, January 19, 2013

Saturday, January 5, 2013