/ DOTNET

Fare (xeger) - generate string that matches regex pattern in C#

Matching text using regex patterns is extremely useful. But sometimes we need to invert this process and create text which will match the regex. The solution has already been written.

When I faced this problem for the first time, and quick search on the internet didn’t bring up anything useful, I thought:

I will implement simple generator. I know basic rules of how regex patterns work, I just need to invert them with correct random value.

Hey, I’m a software developer, so called Not Invented Here Syndrome is our frequent problem :) Looking through the internet I can tell that I’m not the only one who had the same idea about generating regex inputs.

So I started implementation, but after a while I realized that I don’t have enough time to satisfy all cases, so I focused only on basics, which could be enough for my problem.

Luckily I decided to repeat internet research, this time putting bigger effort into it. There were no many articles around this topic, but this time I found solution which was easy to use and better than my own implementation.

Fare project

Fare is a .NET port of libraries appreciated by Java community: dk.brics.automaton and xeger. The first library has focus in nondeterministic and deterministtic finite automaton. On top of it is built xeger, which is meant to be opposite of regex - it uses finite state machine for creating text which match passed regular expression.

In Java world you need them both. In .NET we have them under one project called Fare, which stands for Finite Automata and Regular Expressions.

It’s convenient library known by many companies as well as open source projects, for example AutoFixture or WireMock.Net.

Fare Hello world example

You can download Fare from nuget, it works for .NET Framework and for .NET Core. For example, I’m writing following instructions with Fare version 2.1.1 on Linux Mint 19 Tara in .NET Core 2.2 console application.

You just need to create Xeger object with your regex pattern (as string) and with Random object.

string regex = "t.m";

Xeger xeger = new Xeger(regex, new Random());

Console.WriteLine($"Input text matching regex: '{regex}' is: '{xeger.Generate()}'");

/* Example output:
Input text matching regex: 't.m' is: 't8m'
*/

Basically that’s it. In above example input will be random every time. To see how generated inputs might differ, let’s replace t.m pattern with t.*m pattern and loop generating ten times. As you will see, with .* pattern results might be any size, including 0 (see last input in my example, it’s real value copied from my console output).

string regex = "t.*m";

Xeger xeger = new Xeger(regex, new Random());

for(var i=0; i<10; i++)
    Console.WriteLine($"Input text matching regex: '{regex}' is: '{xeger.Generate()}'");
                
/* Example output:
Input text matching regex: 't.*m' is: 'trmYwmm'
Input text matching regex: 't.*m' is: 'tsm!tw-m:s}mm'
Input text matching regex: 't.*m' is: 't}}*3molmrxBGmmss-mm'
Input text matching regex: 't.*m' is: 'tLy4mmmm'
Input text matching regex: 't.*m' is: 'tCkm=!?iI|mm"LmmV}wm"63mmnb.G+mxzumNm`wn[m'
Input text matching regex: 't.*m' is: 'tux@mByyQ~8vxm'
Input text matching regex: 't.*m' is: 'tn|mmu npmj/~w#mmmmmm9mm'
Input text matching regex: 't.*m' is: 't}Zm'
Input text matching regex: 't.*m' is: 'tomX}8kkV{j)x<S}_mTmm'
Input text matching regex: 't.*m' is: 'tm'
*/

Deterministic Fare input

When we pass new Random() to Xeger constructor, Fare generates random input. Usually that’s what we want, but sometimes we need to have the same input for given regex every time (that was my production case). How to achieve that? You can simply pass constant seed to dotnet Random constructor, for example 0 :) That’s it, now under the same runtime your input generation sequence will be the same.

string regex = "t.*m";

Xeger xeger = new Xeger(regex, new Random(0)); // Note zero in Random constructor

for(var i=0; i<10; i++)
    Console.WriteLine($"Input text matching regex: '{regex}' is: '{xeger.Generate()}'");
                                
/* Output sequence in my environment will be every run the same:
Input text matching regex: 't.*m' is: 'tm'
Input text matching regex: 't.*m' is: 't~6x~bm^m'
Input text matching regex: 't.*m' is: 'tmoBlz=~z5mC1zvcmmqm'
Input text matching regex: 't.*m' is: 't|lym'
Input text matching regex: 't.*m' is: 'teDsmmmmy6m'
Input text matching regex: 't.*m' is: 'tmmmm'
Input text matching regex: 't.*m' is: 't~pm'
Input text matching regex: 't.*m' is: 't|5mmrw5|ommxmpNlo`x~G^wym'
Input text matching regex: 't.*m' is: 't%+u=mm~mrmmkm*mmmlU/w7\OyncmrDdx<lnm'
Input text matching regex: 't.*m' is: 't$X`w`mm:m~o~m'
*/

Benchmarks

Usually fare is quite fast and should be sufficient for production requirements. I have created a few benchmarks, with usual process priority, just for orientation. As you will see, reusing xeger is really fast (and generation time is repeatable) and creating xeger requires acceptable time as well. But remember that xeger with complex regex needs much more time than simple one, for example creating xeger with regex matching email address (taken from regextester.com) takes for me 50 milliseconds: ^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$.

Measured operation Preparation Mean time
new Xeger(_regex, _random); _random = new Random(); _regex = "t"; 274.6 ns (nanos.)
new Xeger(_regex, _random); _random = new Random(); _regex = "t.m"; 105.9 us (micros.)
new Xeger(_regex, _random); _random = new Random(); _regex = "t.*m"; 160.7 us (micros.)
new Xeger(_regex, _random); _random = new Random(); _regex = "alice|tom"; 343.6 us (micros.)
new Xeger(_regex, _random); _random = new Random();
_regex = "[a-z.!#$%&'*+=?^_`{|}~-]+@.*(?:[a-z]{0,61}.*)?(?:\\..*(?:[a-z]{0,61}[a-z])?)*";
9.042 ms (millis.)
_xeger.Generate(); _xeger = new Xeger("t", new Random()); 650.5 ns (nanos.)
for(int i=0; i<1000; i++) _xeger.Generate(); _xeger = new Xeger("t", new Random()); 656.0 us (micros.)
_xeger.Generate(); _xeger = new Xeger("t.m", new Random()); 1.420 us (micros.)
for(int i=0; i<1000; i++) _xeger.Generate(); _xeger = new Xeger("t.m", new Random()); 1.437 ms (millis.)
_xeger.Generate(); _xeger = new Xeger("t.*m", new Random()); 9.989 us (micros.)
for(int i=0; i<1000; i++) _xeger.Generate(); _xeger = new Xeger("t.*m", new Random()); 10.05 ms (millis.)
_xeger.Generate(); _xeger = new Xeger("alice|tom", new Random()); 2.217 us (micros.)
for(int i=0; i<1000; i++) _xeger.Generate(); _xeger = new Xeger("alice|tom", new Random()); 2.183 ms (millis.)
_xeger.Generate(); _xeger = new Xeger("[a-z.!#$%&'*+=?^_`{|}~-]+@.*(?:[a-z]{0,61}.*)?(?:\\..*(?:[a-z]{0,61}[a-z])?)*"
, new Random());
8.435 us (micros.)
for(int i=0; i<1000; i++) _xeger.Generate(); _xeger = new Xeger("[a-z.!#$%&'*+=?^_`{|}~-]+@.*(?:[a-z]{0,61}.*)?(?:\\..*(?:[a-z]{0,61}[a-z])?)*"
, new Random());
8.160 ms (millis.)

BenchmarkDotNet=v0.11.4, OS=linuxmint 19 Intel Core i7-4702MQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.105 [Host] : .NET Core 2.2.3 (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 64bit RyuJIT

Downsides of Fare

Pattern with unspecified length might take long time to generate

Sometimes Fare wants to generate extremally long phrase for pattern which doesn’t specify length. I have seen cases for which xeger needed 10minutes to generate! Unfortunately for now there is no way to configure it in Fare (but it’s open source, you can improve it 😺). You can assume that it’s nondeterministic, because you’ll never know for which regex and with which random (or not random) value it will happen. What I do? Before I pass pattern to Xeger, I remove every * character. Asterisk means 0 or more characters, so this way it will always be one, which is correct, but of course it’s trade-off for more limited results. I know that there are still other possible regexs of unknown length, for example \w{3,} or \d+ but in my case it’s enough, I don’t expect such patterns. If it’s not enough in your case, then you should process other possibilities too, to remove every option which takes any length, or not to remove, but to provide simple values for those basic rules. Or fork Fare and allow to configure this behaviour in it :)

Always try..catch, and be prepared for bugs and exceptions

Unfortunately Fare has bugs. Sometimes it crashes, even with simple patterns. For example creating xeger with pattern <a.*b> will cause System.ArgumentException: 'a.*b' not found. Once I have seen a bug, which allowed creating not valid input. It turned out that Xeger treated multiplied start/end of line marks like regular characters. Input would be fine, but it had ^ or $ character at the beginning and/or at the end. Such pattern came to my system only once, but this particular one was important to me, so I changed my code to check whether generated input is valid and if it’s not, then try to fix it with trimming every ^ and $ characters. I know that it’s naive, but if it doesn’t help I just skip the pattern, it is allowed in my case (in general) - not every regex need to be inverted, occasionally I can skip one and it’s fine. You can reproduce this bug this way: new Xeger("^0$|^1$", new Random()) (in my environment it’s always 0$ or ^1).

You may find unsupported pattern

As you know regex is wide and you might find construction not implemented by xeger. But majority of patterns are satisfied and to be honest - I can’t tell which are not. There is no documentation in Fare project, and the one in Java’s xeger is not up to date, because I have checked some limitations they have mentioned and those not supported patterns (as they say) work properly… So there are probably some not implemented cases, but it’s minority :)

Summary

So to summarise, fare is a great library and does the job but… not entire job. You need to add your own few lines of code and get prepared for exceptional situations and bugs. In my case it’s enough and loosing input occasionally is acceptable in my situation. If it’s not acceptable in yours, I still recommend using Fare but with fixing/extending it first. That way or another it’s cheaper to start from theirs code base than writing everything from scratch. They’re open source so you can share your work with pull request, to help others and bring yourself a bit of splendour :)

tometchy

Tometchy

Passionate focused on agile software development and decentralized systems

Read More