---
title: Initial parsing work on large JSON corpus
layout: post
---
Yesterday, I wrote of a challenge that I faced in working out which texts in a corpus have decent OCR and, then, which texts they actually are. This morning, I put together a small script that has a first go at this. I enclose this below for anybody who is interested.
The basic steps are:
1. Read input from CSV and build a list of titles and identifiers (stripping all punctuation from titles and limiting it to 5 words)
2. Sequentially read in the JSON files from disk, performing the same stripping transform above on the first ten elements in the dictionary.
3. See if the transformed title is in the transformed first ten elements of the JSON file.
This currently yields me about 15 good titles out of every 1,000 JSON files. That said, these JSON files are not all in English. And many of them have bad OCR.
In any case, I'll continue to refine this and expand the filters as safely as I can.
# coding=UTF-8
import csv
import re
import glob
import json
def load_csv():
titles = {}
# load the CSV file of titles
with open('/home/martin/Mounts/THREETB/Corpus/book-list.csv', 'rb') as csvfile:
csv_file = csv.reader(csvfile, delimiter=',', quotechar='"')
# iterate over the CSV
for row in csv_file:
# extract a potential title and substitute out all punctuation
titles[row[0]] = re.sub('[\.\?\(\)\]\[,;:\'!\*!“]', '', re.sub('\[.+?\]', '', row[7])).lower().replace(' ', ' ').strip()
# remove any titles here that are either blank or less than three words long or less than 6 chars total
if titles[row[0]] == '' or len(titles[row[0]].split(' ')) < 3 or len(titles[row[0]]) < 6:
del titles[row[0]]
if row[0] in titles:
try:
# shorten title to first five words
titles[row[0]] = ' '.join(titles[row[0]].split(' ')[0:5])
except IndexError:
# title was short
pass
return titles
def parse_json(folder, titles):
directory_to_parse = '/home/martin/Mounts/THREETB/Corpus/json/{0}'.format(folder)
jsons = glob.glob('{0}/*.json'.format(directory_to_parse))
ret = {}
file_counter = 0
for json_file in jsons:
with open(json_file, 'rb') as json_file_handle:
file_counter += 1
if file_counter == 1000:
print "Processed 1,000 JSON files"
file_counter = 0
loaded_json = json.load(json_file_handle)
# check the first eight entries of the JSON
try:
for x in range(0, 10):
subbed_text = re.sub('[\.\?\(\)\]\[,;:\'!\*!“]', '', re.sub('\[.+?\]', '', str(loaded_json[x]))).lower().replace(' ', ' ').strip()
should_break = False
for key, title in titles.iteritems():
if title in subbed_text:
print '[{0}] {1}: {2}'.format(key, title, json_file)
ret[key] = json_file
should_break = True
# remove the key to avoid duplicates
del titles[key]
break
if should_break:
break
except IndexError:
# if we arrive here it's a short JSON
pass
except:
pass
return ret
if __name__ == '__main__':
titles = load_csv()
for folder_name in range(0, 25):
parse_json(str(folder_name).zfill(4), titles)