What was happening in this Let's Encrypt CAA bug?

Question

Recently, Let's Encrypt shared this bug that occured in their systems leading to certificate issues to their clients. They describe the bug this way:

The bug: when a certificate request contained N domain names that needed CAA rechecking, Boulder would pick one domain name and check it N times. What this means in practice is that if a subscriber validated a domain name at time X, and the CAA records for that domain at time X allowed Let’s Encrypt issuance, that subscriber would be able to issue a certificate containing that domain name until X+30 days, even if someone later installed CAA records on that domain name that prohibit issuance by Let’s Encrypt.

Does this mean that when a user had more than one domain name that needed CA rechecking, Let's Encrypt would only check the first domain? Was the issue that certificates were issued on domains that were not owned by the user getting the certificate?

score 4 · Accepted Answer · edited Mar 28 '20 at 21:30

Let's Encrypt issued less secure certificates based on the bug. It's more of a race condition than anything else.

A CAA record is an optional DNS record that restricts which certificate providers are allowed to issue certificates for a domain. So if the subscriber validates a domain name at time X and the CAA records for the domain allowed Let's Encrypt issuance during the time the subscriber validated their domain, the subscriber would have 30 days from the validation date to issue valid certs. So if someone later added CAA records that prohibit LE certs to be issued anytime between the time X + 30 days, the subscriber would be able to issue certs ignoring the CAA record.

If you look at the PR that had the bug, this is where it was first triggered when the NewAuthorizationSchema feature flag was enabled in prod.

The problem code is a common mistake in Go, see the relevant code for boulder-ra:

// authz2ModelMapToPB converts a mapping of domain name to authz2Models into a
// protobuf authorizations map
func authz2ModelMapToPB(m map[string]authz2Model) (*sapb.Authorizations, error) {
    resp := &sapb.Authorizations{}
    for k, v := range m {
        // Make a copy of k because it will be reassigned with each loop.
        kCopy := k
        authzPB, err := modelToAuthzPB(&v)
        if err != nil {
            return nil, err
        }
        resp.Authz = append(resp.Authz, &sapb.Authorizations_MapElement{Domain: &kCopy, Authz: authzPB})
    }
    return resp, nil
}

Which is the issue: taking a reference to a loop iterator variable They failed to handle the second loop iterator variable v properly.

And in turn, failed to consider two important fields in this function: IdentifierValue and RegistrationID

func modelToAuthzPB(am *authzModel) (*corepb.Authorization, error) {
    expires := am.Expires.UTC().UnixNano()
    id := fmt.Sprintf("%d", am.ID)
    status := uintToStatus[am.Status]
    pb := &corepb.Authorization{
        Id: &id,
        Status: &status,
        Identifier: &am.IdentifierValue,
        RegistrationID: &am.RegistrationID,
        Expires: &expires,
    }

Boulder (specifically boulder-ra) determines that a given FQDN requires CAA rechecking and it uses the Identifier field from the authorization object, which was incorrect due to the above commit. So the way that the identifier field is handled, it would be the same for all values in a single map. So for instance, if you had multiple authorizations that required CAA rechecking, boulder-ra would recheck only one FQDN and not the others.

A bad bug indeed.

What was happening in this Let's Encrypt CAA bug?

1 Answers1