Issue
I have a text file with over 250 million lines. Each line has a 3 digit area code followed by a comma and a 7 digit number.
Sample Input File:
201,2220000
201,5551212
/>
310,5552481
376,1239876
443,0002222
572,8880099
...
I would like to generate an output file which lists each unique area code and the number of occurrences of that area code (only looking at the first 3 characters of each line).
Example output (area code, count):
201, 44556
202, 34529
...
I am working in a Windows 10 environment.
After considerable research, I was able to use the Switch function with regex in PowerShell to achieve something very close. The problem with this solution is that I need to know which area codes I am looking for (and I don't know all the area codes listed in this file).
I would like to modify the solution such that it finds all unique area codes and then run the code.
Here's what I have tried:
- Say, I want to search for the following four area codes: 201,202,203,205
- My text file is datafile.txt
$count1 = 0
$count2 = 0
$count3 = 0
$count4 = 0
switch -File C:\datafile.txt -Exact -Regex { '201\S{8}' { ++$count1 } }
Write-Output "Area Code 201: $($count1)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '202\S{8}' { ++$count2 } }
Write-Output "Area Code 202: $($count2)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '203\S{8}' { ++$count3 } }
Write-Output "Area Code 203: $($count3)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '205\S{8}' { ++$count4 } }
Write-Output "Area Code 204: $($count4)" | Format-Table | Out-File "C:\summary.txt" -append
This code generates the file summary.txt and appends the counts to the area codes. However, I think this is inefficient as:
- I need to know all the area codes that are in this datafile.
- I have to add 3 lines of code for every additional area code.
Would appreciate any help improving this code or for using an alternate solution (I found a thread on Stackoverflow that uses grep https://stackoverflow.com/questions/61229157/using-regex-in-grep-for-windows-command-line, but it has the same limitation - you need to know what string you are searching for.
Solution
Assuming I understood correctly, there is no regex needed here, just .SubString(0, 3)
to get the first 3 characters from each line and a hashtable to ensure unique codes and efficiency.
Indeed, switch -File
is awesome for this task and should be used to read your file. Otherwise, for simplicity and also keeping it efficient, you could use File.ReadLines
.
$map = @{ }
switch -File path\to\source\file.txt {
Default {
$map[$_.Substring(0, 3)] += 1
}
}
$map.GetEnumerator() | ForEach-Object {
[pscustomobject]@{
Code = $_.Key
Count = $_.Value
}
} | Export-Csv path\to\resultOfUniqueCodes.csv -NoTypeInformation
Answered By - Santiago Squarzon Answer Checked By - Clifford M. (WPSolving Volunteer)