Thursday, September 8, 2022

[SOLVED] Create a file of a specific size with random printable strings in bash

September 08, 2022 bash, random, string

Issue

I want to create a file of a specific size containing only printable strings in bash.

My first thought was to use /dev/urandom:

dd if=/dev/urandom of=/tmp/file bs=1M count=100
  100+0 records in
  100+0 records out
  104857600 bytes (105 MB, 100 MiB) copied, 10,3641 s, 10,1 MB/s

file /tmp/file && du -h /tmp/file
  /tmp/file: data
  101M  /tmp/file

This leaves me with a file, of my desired size, but not only containing printable strings.

Now, I can use strings to create a file only containing printable strings.

cat /tmp/file | strings > /tmp/file.txt
file /tmp/file.txt && du -h /tmp/file.txt 
  /tmp/file.txt: ASCII text
  7,0M  /tmp/file.txt

This leaves me with a file containing only printable strings, but with the wrong file size.

TL;DR

How can I create a file of a specific size, containing only printable strings, in bash?

Solution

The correct way is to use a transformation like base64 to convert the random bytes to characters. That will not erase any of the randomness from the source, it will only convert it to some other form.
For a (a little bit bigger) file of 1 MegaByte of size:

dd if=/dev/urandom bs=786438 count=1 | base64 > /tmp/file

The resulting file will contain characters in the range A–Za–z0–9 and +/=.

Below is the reason for the file to be a little bigger, and a solution.

You could add a filter to translate from that list to some other list (of the same size or less) with tr.

cat /tmp/file | tr 'A-Za-z0-9+/=' 'a-z0-9A-Z$%'

I have left the = outside of the translation because for an uniform random distribution it is better to leave out the last characters that will (almost) allways be =.

Size

The size of the file will get expanded from the original size used from /dev/random in a factor of 4/3. That is because we are transforming 256 byte values into 64 different characters. That is done by taking 6 bits from the stream of bytes to encode each character. When 4 characters have been encoded (6*4=24 bits) only three bytes have been consumed (8*3=24).

So, we need a count of bytes multiple of 3 to get an exact result, and multiple of 4 because we will have to divide by that.
We can not get a random file of exactly 1024 bytes (1k) or 1024*1024 = 1,048,576 bytes (1M) because both are not exact multiple of 3. But we can produce a file a little bigger and truncate it (if such precision is needed):

wanted_size=$((1024*1024))
file_size=$(( ((wanted_size/12)+1)*12 ))
read_size=$((file_size*3/4))

echo "wanted=$wanted_size file=$file_size read=$read_size"

dd if=/dev/urandom bs=$read_size count=1 | base64 > /tmp/file

truncate -s "$wanted_size" /tmp/file

The last step to truncate to the exact value is optional.

Randomness generation.

As you are going to extract so much random values from urandom, please do not use random (use urandom) or your app will be blocked for a long time and the rest of the computer will work without randomness.

I'll recommend that you install the package haveged:

haveged uses HAVEGE (HArdware Volatile Entropy Gathering and Expansion) to maintain a 1M pool of random bytes used to fill /dev/random whenever the supply of random bits in dev/random falls below the low water mark of the device.

If that is possible.

Answered By - user8017719

Answer Checked By - Pedro (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0