Using hack_url_re to Auto Detect Website Spoofing Vulnerabilities

Andrew Lee

Phishing attacks often spoof websites in order to steal passwords, tricking users into entering credentials to a website that looks identical to the one they routinetly access. To avoid such trickery, account holders can trust their passwords to password managers like Okta’s SWA plugin, which are not fooled by visual similarity. However, if the code for identifying the website contains any flaws, attackers could exploit them in order to continue to steal passwords. Recognizing this threat, the Okta Research and Exploitation (REX) team has created a tool, hack_url_re, to automatically find vulnerabilities to such attacks so that they may be remediated.

Password managers need to identify the account domain to submit credentials to the correct websites. They sometimes apply pattern matching with regular expressions1, abbreviated as “regex”, and sometimes simply referred to as “patterns”, to website addresses (URLs) in order to identify them. Regex are easy to use, but difficult to apply correctly in a security context, especially to URLs. A common mistake concerns the dot (".") character, which parses as a regex wildcard. For example, the regex represented by the string “a.” matches “aa”, “ab”, “ac”, and many others. To indicate a literal dot, the developer must prefix it with a backslash ("\")2. If a regex contains a mistake, the app may misidentify websites. For example, `^https://m.example\.com$` is supposed to only accept URLs under the root domain `example.com`, but it also accepts `https://mxexample.com`. An attacker could buy `mxexample.com` and a certificate for it, then masquerade as `example.com` to the password manager. The password manager would mistake the attacker site for the legitimate site and send passwords to the attacker, compromising the passwords.

The hack_url_re tool is an automated solution for analyzing URL regexes, in order to discover vulnerable regexes that should be patched. It compiles attack definitions and regex into logical constraints and uses the Z3 solver to solve for attacks. It concludes one of the following:

  • "sat", meaning there is an attack, in which case it emits a "witness".
  • "unsat", meaning there is no attack—within the attack definitions given to it.
  • "unknown"; in the current version of the tool, this has not been observed. In practice, the "unknown" conclusion has only showed up when there was an error in the constraints.

Along with the conclusion, hack_url_re emits a "witness" if it finds a weakness, which can easily be used to check the validity of the issue and to construct an actual proof-of-concept attack. The "witness" is also useful for fixing the regex to prevent the attack. In practice, the false positive rate has been negligible.

Technical Background on Domain Names

The Fully Qualified Domain Name (FQDN) is the part after the protocol and before the path. The root domain includes the top-level domain and the second-level domain3. For example, in “http://sub2.sub1.example.com/some/path”, the FQDN is “sub2.sub1.example.com”, and the root domain is “example.com”. Each root domain is registered by some entity, for example, okta.com is registered by Okta, and any subdomains are also owned by Okta. In hack_url_re, attack definitions include being able to match a large number of domains to a regex,  but hack_url_re only considers it a vulnerability if the attacker can control the root domain. It would be easy to change the tool to be restrictive on the entire FQDN, but this would increase the false-positive rate.

Developments

The automated hacker translates regex and attack definitions into logical constraints and applies the Microsoft Z3 Prover to find solutions. Z3 is a Satisfiability Modulo Theories (SMT) solver, meaning it will try to find solutions to the constraints with respect to theories — in this case, a theory defining the properties byte strings. If the constraints cannot be solved, then no attack is possible within the given attack definitions.

Generally, if a straightforward translation results in poor performance, then the constraints can be changed in a way that preserves satisfiability as much as possible. Preserving satisfiability means if there is an attack under the new constraints, then there is an attack under the original ones (soundness), and if there is an attack in the old constraints, then there is an attack under the new constraints (completeness).

The first step is to parse regex. Thankfully Blukat29’s open source regex-crossword-solver already had a regex parser that only needed a few modifications to support the regex features we required, such as arbitrarily located carets (“^”) and dollars (“$”), non-capturing groups (“(?:)”), and unions with empty patterns (e.g. “(a|)”). In fact, hack_url_re was inspired by and started as a fork of regex-crossword-solver, but after the parsing step, little of the original logic remains.

Next, the tool preprocesses the AST to make it more convenient and faster to analyze. The simplifying transformations include replacing Kleene Star quantifiers with a finite number of choices, and reducing the size of union patterns4. This greatly improves efficiency, without affecting the validity of results. In technical terms, the simplification leads to a new set of constraints that are equisatisfiable with the original set of constraints.

Then, the tool compiles the AST into logical rules on strings. The compilation is straightforward except when handling the global constraints arising from Dollar and Star. This involves associating constraints to string variables representing solutions to subexpressions of the regex. These constraints trigger whenever the corresponding string variable shows up in the solution string for the entire regex.

Next is to define what a root domain is and what an attack is. To give a taste of what a logical definition of a root domain looks like, the following expression shows the main logic defining a root domain with two levels5:

[[{"fid":"516151","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"3":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"class":"media-element file-default","data-delta":"3"}}]]

The initial attack strategy was to find N (in practice, 10) different root domains that match the regex. If the regex accepts too many different domains, then it is likely to accept a domain that was not intended by the developer. This method simply asks Z3 to re-solve for a new root domain until it produces N root domains, or fails to produce a new root domain. This strategy is still used in the “full attack”.

With the “find N root domains” attack definition, and without any simplifications, the tool originally ran very slowly. Therefore some simplifying transformations were introduced while preserving the soundness and completeness of results. This means that if there is an attack, it will still find it, and if there isn't an attack, then the original method would not have found an attack.

The main speed-up was due to bounding Kleene Star ('*') and Kleene Plus ('+'). The bounding is configurable, but in practice they were replaced with a constraint accepting just two choices: 0 or 4 for Star, and 1 or 5 for Plus. The reason the number 4 was chosen for Kleene Star is because 4 is the minimum length needed to demonstrate control over the FQDN if the regex allows an arbitrary prefix. The protocol requires at least 1 character, the “//” separator between the protocol and FQDN needs 2, and the FQDN itself needs 1 character, for a 4-character minimum.

Character sets containing many characters were also reduced. If you are only interested in 10 solutions, it doesn't matter if you include "a" to "z" or "a" to "j".6

Quick Attack

Another speed-up comes at the cost of preserving completeness. That is, it does not find all attacks that the unapproximated method can find, but since the speed-up can be by several folds, it can be very useful as a quick check for a common class of vulnerabilities. Furthermore, the loss of completeness can be made up later by searching for only attacks not covered by the incomplete method.

This “quick attack” approximation replaces character sets and modifies attack definitions while preserving the validity of almost any attack it finds. This method does introduce some false positives that have, in practice, corresponded to unexploitable issues which we decided were still worth fixing (examples are provided below). The wildcard (".") was replaced with a configurable illegal URL character, which in practice was set to "<". This character was excluded from all other character sets. For example, the pattern "[^a]"7did not accept "<" in the modified ruleset.

The “quick attack” strategy searches for solutions to the modified constraints where wildcard taint markers are in the protocol or domain, and reports these as vulnerabilities. That is, if the regex accepts a string with a taint character, "<", in the FQDN, then by replacing the taint character with a valid URL character, one can construct a valid attack. This is because if a regex pattern accepts "<" in the FQDN, then either the pattern itself contains a literal "<", or it had a wildcard. In the first case, the pattern has invalid URL literals and should be fixed. In the latter, it is very likely to be vulnerable to a real attack. Here are a few examples:

  • In "https://m<example.com", if "<" is replaced with "y", then the attacker can buy "myexample.com" and steal passwords intended for "example.com".
  • In "https://sub<<<<.example.com", the attacker can replace "<<<<" with ".co/", so that ".example.com" becomes part of the path, and the domain becomes "sub.co".
  • In "<<<<<<<<<<<<<https://.example.com", the attacker can replace "<<<<<<<<<<<<<" with "https://a.co/", so that ".example.com" becomes part of the path, and the domain becomes "sub.co". (Note that this attack usually means the regex did not have a "^" at the beginning.)

Here are some examples of attacks that have low exploitability or are unexploitable in the real world, but are worthwhile to fix:

  • In "https://sub<.example.com", the attacker can replace "<" with "/", to change the FQDN to "sub". This isn't a public domain one can buy8, nor would it be easy to obtain a certificate for it, if the attacker spoofs DNS. Still, a mistake could be made, allowing an attacker to register and certify "sub" as a domain, or a CA for an internal corporate network could issue a cert for "sub", since internal domains. Therefore the tool reports this as an attack.
  • In “https://<.example.com”, the attacker can replace “<” with “/” again, making the FQDN empty. Though this is not exploitable, and it would be easy to modify the tool to not report this as an issue, it still clearly indicates a mistake that was deemed to be worth noting.

A crucial feature of the tool is that it will report the attack string it found as a "witness", so the user can validate the attack, possibly using a simple script. (Note that when reporting witnesses, the tool replaces all "<" with a skull emoji to help the wildcards stand out.)

With all wildcard-based attacks out of the way, we may want to find attacks that the quick attack misses. For example, the pattern "^https://[a-z]example\.com$" would accept domains like "aexample.com" but makes no use of dot wildcards. In practice, issues that do not arise from misuse of dots are rare, but it is still best to be safe and look for them. Therefore, after searching for dot-based attacks, the tool falls back to the original "find N root domains" method. However, this time it does not consider dots at all, and replaces them all with an empty string. In this way, it does not need to consider attacks that were already covered by the quick attack.

Caveats

There are a few regex and URL features that were not used in the tested regexes, so they are not (yet) implemented.

  • Basic auth
  • Lookaheads and lookbehinds
  • Group references

Performance and Accuracy

The quick method can check about 100 regex in an hour using 6 threads on a Macbook Pro. The full attack varies greatly depending on the input, but it can take about 4 to 10 times longer.

The false positive rate was negligible during testing on real URL regexes.

The Okta REX team is sharing the experience of creating the tool and making the source code available to inspire and help inform the development of new automated security tools. We hope it may also be more directly useful for others who face similar problems with identifying websites using regular expressions.

Footnotes

A regex is a pattern that can be applied to strings to see if the string matches the pattern. For example, the regex encoded by "[abc]-xyz" accepts the strings "a-xyz", "b-xyz", "c-xyz". For more details, see https://en.wikipedia.org/wiki/Regular_expression.

This is called "escaping" the dot, i.e., the dot escapes from the usual parsing rules.

3 There is also an optional DNS root, which appears as an extra dot at the end of the FQDN. This is an obscure part of the FQN definition. Additionally, it is possible for the root domain, the domain registered to an entity, to include a third-level domain if the top-level domain is a country code, for example, in `example.co.uk`. Attacks involving three-level root domains are implemented in hack_url_re, but are not discussed in detail in this blog post, as handling them is a straightforward extension of handling two-level root domains.

Union patterns are those that accept various options. For example, “(a|b|c)” accepts “a”, “b”, or “c”.

The actual code is somewhat more complicated to also handle three-level root domains.

Assuming no regex lookaheads.

The regex pattern “[^a]” accepts all characters besides “a”. The “^” is a meta-character indicating set complement, i.e., “[^abc]” can be read as “a character that is not ‘a’, ‘b’, or ‘c’”.

Public domains require a dot and a TLD. "example.com" is valid, but "example" is not. Furthermore, as of this writing, there is no valid single-letter TLD, although there is no rule against it. Domain naming rules are complex, and in the name of efficiency and simplicity, the tool refrains from defining them in fine detail.

Andrew Lee
Security Engineer

Andrew Lee is a security engineer on Okta’s Research and Exploitation (REX) team, where he works with developers to create secure solutions by employing a mix of manual and automated techniques. He is curious about the evolution of all complex systems—from the living to the artificial—and how they develop defenses against exploitation. With a focus on automated reasoning and cryptography, Andrew’s research explores methods that distinguish between the benign and the malicious.