Friday, May 27, 2022

[SOLVED] Extract JSon Format from Large Text File

Issue

Introduction

Hi! I'm trying to extract JSon from a 300K line text file that has a combination of Text output and JSon format from HTTP Result. The big size in lines makes it unable to retain the JSon manually.

Problematic

Don't have much choice, i probably need to fix it manually using a command-line. Here's how it's looks like inside the file:

[2K  100.00% - C: 164148 / 164149 - S: 263 - F: 3686 - dhcp-140-247-148-215.fas.harvard.edu:443 - id3.sshws.me
[2K  100.00% - C: 164149 / 164149 - S: 263 - F: 3686 - public-1300503051.cos.ap-shanghai.myqcloud.com:443 - id3.sshws.me
[2K
[
  {
    "Request": {
      "ProxyHost": "pro.ant.design",
      "ProxyPort": 443,
      "Bug": "pro.ant.design",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://pro.ant.design/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  },
  {
    "Request": {
      "ProxyHost": "industrialtech.ft.com",
      "ProxyPort": 443,
      "Bug": "industrialtech.ft.com",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://industrialtech.ft.com/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  }
]

Several problem to this if using RegEx is:

  • It has multiple JSon object

  • The Text string that doesn't part of JSon has [ and :

I realize the problem when trying to use sed regex.

sed '/^[/,/^]/!d'

Solution

You can remove all lines that start with [ and any non-whitespace char:

sed '/^\[[^[:space:]]/d' file > newfile

Details:

  • ^ - start of a line
  • \[ - [ char
  • [^[:space:]] - any non-whitespace chars.


Answered By - Wiktor Stribiżew
Answer Checked By - Mary Flores (WPSolving Volunteer)