Thursday, November 30, 2023

[SOLVED] Find all unique occurrences of first 3 char, count number of occurrences and write output to a file

November 30, 2023 grep, powershell, regex, switch-statement, windows

Issue

I have a text file with over 250 million lines. Each line has a 3 digit area code followed by a comma and a 7 digit number.

Sample Input File:
201,2220000
201,5551212
/> 310,5552481
376,1239876
443,0002222
572,8880099
...

I would like to generate an output file which lists each unique area code and the number of occurrences of that area code (only looking at the first 3 characters of each line).

Example output (area code, count):
201, 44556
202, 34529
...

I am working in a Windows 10 environment.

After considerable research, I was able to use the Switch function with regex in PowerShell to achieve something very close. The problem with this solution is that I need to know which area codes I am looking for (and I don't know all the area codes listed in this file).

I would like to modify the solution such that it finds all unique area codes and then run the code.

Here's what I have tried:

Say, I want to search for the following four area codes: 201,202,203,205
My text file is datafile.txt

$count1 = 0
$count2 = 0
$count3 = 0
$count4 = 0
switch -File C:\datafile.txt -Exact -Regex { '201\S{8}' { ++$count1 } }
Write-Output "Area Code 201: $($count1)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '202\S{8}' { ++$count2 } }
Write-Output "Area Code 202: $($count2)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '203\S{8}' { ++$count3 } }
Write-Output "Area Code 203: $($count3)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '205\S{8}' { ++$count4 } }
Write-Output "Area Code 204: $($count4)" | Format-Table | Out-File "C:\summary.txt" -append

This code generates the file summary.txt and appends the counts to the area codes. However, I think this is inefficient as:

I need to know all the area codes that are in this datafile.
I have to add 3 lines of code for every additional area code.

Would appreciate any help improving this code or for using an alternate solution (I found a thread on Stackoverflow that uses grep https://stackoverflow.com/questions/61229157/using-regex-in-grep-for-windows-command-line, but it has the same limitation - you need to know what string you are searching for.

Solution

Assuming I understood correctly, there is no regex needed here, just .SubString(0, 3) to get the first 3 characters from each line and a hashtable to ensure unique codes and efficiency.

Indeed, switch -File is awesome for this task and should be used to read your file. Otherwise, for simplicity and also keeping it efficient, you could use File.ReadLines.

$map = @{ }
switch -File path\to\source\file.txt {
    Default {
        $map[$_.Substring(0, 3)] += 1
    }
}

$map.GetEnumerator() | ForEach-Object {
    [pscustomobject]@{
        Code  = $_.Key
        Count = $_.Value
    }
} | Export-Csv path\to\resultOfUniqueCodes.csv -NoTypeInformation

Answered By - Santiago Squarzon

Answer Checked By - Clifford M. (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 30, 2023

[SOLVED] Find all unique occurrences of first 3 char, count number of occurrences and write output to a file

Issue

Solution

Popular Posts

Labels