Dewey Decimal Classification

Fun with DDC

Intro🔗

Most of us have probably set a foot in a library with the intent to find a book. You'd probably have noticed that the books are 'sorted' by topic. Maybe you even saw a number like '745.2' that covers a category of books. That is the Dewey Decimal Classification

For the fontter project I needed a structured way to use sa classification.

This post is a quick summary of the steps that I took.

First step: download the file🔗

I found this list that I could use as a starting point. It is probably outdated, but for my needs it was just sufficient.

We will use the aiohttp client to download the page and store it locally.

file: 1_prepare.sh

#!/bin/bash

pip install bs4

file: download_file.py

#!/usr/bin/env python3

import os
import asyncio

import aiohttp

IN_FILE = './tmp/ddc.html'
URL = 'http://bpeck.com/references/ddc/ddc.htm'

async def download_file(url: str) -> str:
    "aiohttp client"
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            if resp.status == 200:
                content = await resp.text()
                return content

async def read_or_create_file(url: str, in_file: str) -> str:
    "make a local cache, to work from (during development)"
    if os.path.exists(in_file):
        with open(in_file, 'r') as file_h:
            content = file_h.read()
    else:
        content = await download_file(url)
        with open(in_file, 'w') as file_h:
            file_h.write(content)
    return content

async def main(loop: any) -> None:
    content = await read_or_create_file(URL, IN_FILE)

if __name__ == '__main__':
    LOOP = asyncio.get_event_loop()
    LOOP.run_until_complete(main(LOOP))

source ./1_prepare.sh
./download.py

Second step: parse the html🔗

We will use beautifull soup to extract the data in the html tables and save it in a json file.

file: parse_file.py

#!/usr/bin/env python3

import json
import asyncio
from typing import Dict

from bs4 import BeautifulSoup

from download_file import read_or_create_file, IN_FILE, URL

OUT_FILE = './tmp/ddc.json'

def split_ddc(summary: str) -> str:
    parts = summary.split(' ')
    return parts[0], ' '.join(parts[1:])

def parse(content) -> Dict[str, str]:
    ddict = {}
    soup = BeautifulSoup(content, 'html.parser')
    for table in soup('table', cellpadding="2", width="99%", border="1"):
        rows = table.findAll('tr')
        summary = rows[0].p.contents[0].contents[0].string
        dec, subject = split_ddc(summary)
        ddict[dec] = subject
        for cell in rows[1].findAll('td'):
            dec = ""
            subject = ""
            size = len(cell.contents)
            if size == 1:
                dec = cell.contents[0].contents[0].string
                subject = cell.contents[0].contents[2].string
            elif size == 3:
                dec = cell.contents[0].contents[0].string
                subject = cell.contents[2].contents[0].string
            else:
                pass
            ddict[dec] = subject
        for cell in rows[2].findAll('td'):
            for item in cell.findAll('font'):
                summary = item.contents[0].string
                dec, subject = split_ddc(summary)
                ddict[dec] = subject
            for p_item in cell.findAll('p'):
                for item in p_item.findAll('font'):
                    summary = item.contents[0].string
                    dec, subject = split_ddc(summary)
                    ddict[dec] = subject
    return ddict

async def main(loop: any) -> None:
    content = await read_or_create_file(URL, IN_FILE)
    ddc = parse(content)
    with open(OUT_FILE, 'w') as file_h:
        file_h.write(json.dumps(ddc))

if __name__ == '__main__':
    LOOP = asyncio.get_event_loop()
    LOOP.run_until_complete(main(LOOP))

pip install bs4
export PYTHONPATH=`pwd`/..
touch __init__.py
./parse_file.py

Now we have the file saved as ./tmp/ddc.json

Third step: prepare the database table.🔗

file: create_table.sql

CREATE TABLE IF NOT EXISTS ddc(
    ddc_cid         VARCHAR( 8 )    PRIMARY KEY,
    ddc_name        VARCHAR( 256 )                                  NOT NULL
);

Fourth step: insert the data into the table🔗

We will use asyncpg to extract the json rows and insert them in the database table.

file: import_data.py

#!/usr/bin/env python3

import re
import json
import asyncio
import asyncpg

from parse_file import OUT_FILE

CONF = {
    'database': 'a-database',
    'user': 'a-user',
    'password': 'a-password',
}

async def create_record(conn: asyncpg.Connection, cid: str, name: str) -> None:
    sql = """INSERT INTO ddc
        ( ddc_cid, ddc_name )
    VALUES ( $1, $2 )
    RETURNING *
    """
    transaction = conn.transaction()
    await transaction.start()
    try:
        row = await conn.fetchrow(sql, cid, name)
    except:
        await transaction.rollback()
        raise
    finally:
        await transaction.commit()

async def gen_ddc(conn: asyncpg.Connection) -> None:
    with open(OUT_FILE, 'r') as file_h:
        text = file_h.read()
        data = json.loads(text)
        pattern = re.compile(r"\d{3}(\.\d+)*")
        for cid, name in data.items():
            if pattern.match(cid):
                await create_record(conn, cid, name)

async def main(loop: any) -> None:
    db_pool = await asyncpg.create_pool(**CONF, loop=loop)
    conn = await db_pool.acquire()

    await gen_ddc(conn)

    await conn.close()
    await db_pool.release(conn)
    await db_pool.close()

if __name__ == '__main__':
    LOOP = asyncio.get_event_loop()
    LOOP.run_until_complete(main(LOOP))

pip install asyncpg
./import_data.py

Nice, now we can query it,

SELECT *
FROM ddc
WHERE ddc_cid IN ('700', '740', '745', '745.2')
ORDER BY ddc_cid;
 ddc_cid |         ddc_name
---------+---------------------------
 700     | The arts
 740     | Drawing & decorative arts
 745     | Decorative arts
 745.2   | Industrial art and design
(4 rows)

Fifth step: create a view🔗

Just a query is not enough. I want the whole display of the hiarchy in one row, even if that means one badass query.

file: create_view.sql

CREATE OR REPLACE VIEW ddc_v AS
    WITH first AS (
        SELECT LEFT(ddc_cid, 1) AS a,
                ddc_cid,
                ddc_name
        FROM ddc
        WHERE ddc_cid IN ('0', '100', '200', '300', '400', '500', '600', '700', '800', '900')
        ORDER BY ddc_cid
    ), second AS (
        SELECT LEFT(ddc_cid, 1) AS a,
                LEFT(ddc_cid, 2) AS b,
                ddc_cid,
                ddc_name
        FROM ddc
        WHERE ddc_cid ~ '0$' AND ddc_cid !~ '\.'
        ORDER BY ddc_cid
    ), third AS (
        SELECT LEFT(ddc_cid, 1) AS a,
                LEFT(ddc_cid, 2) AS b,
                LEFT(ddc_cid, 3) AS c,
                ddc_cid,
                ddc_name
        from ddc
        WHERE ddc_cid !~ '\.'
        ORDER BY ddc_cid
    ), fourth AS (
        SELECT LEFT(ddc_cid, 1) AS a,
                LEFT(ddc_cid, 2) AS b,
                LEFT(ddc_cid, 3) AS c,
                array_to_string(regexp_matches(ddc_cid, '\d*\.(\d+)'), '') AS d,
                ddc_cid,
                ddc_name
        from ddc
        WHERE ddc_cid ~ '.'
        ORDER BY ddc_cid
    )
    SELECT
        t4.ddc_cid,
        COALESCE(  t4.ddc_name, t3.ddc_name, t2.ddc_name, t1.ddc_name) AS name,
        t1.ddc_name AS t1_name,
        t2.ddc_name AS t2_name,
        t3.ddc_name AS t3_name,
        t4.ddc_name AS t4_name,
        CAST( CAST( t1.a AS INT) * 100 AS TEXT ) AS a,
        CAST( CAST( t2.b AS INT) *  10 AS TEXT ) AS b,
        t3.c AS c,
        t4.d AS d
    FROM First t1
        INNER JOIN second t2 ON t2.a = t1.a
        INNER JOIN third t3 ON t3.b = t2.b
        LEFT OUTER JOIN fourth t4 ON t4.c = t3.c
    ORDER BY t1.a, t2.b, t3.c, t4.d ASC NULLS FIRST

Some interesting things to note here are:

  • Common Table Expressions (CTE) are used so structure the code
  • PostgreSQL regular expression are used in the WHERE clauses:
    • second: WHERE ddc_cid ~ '0$' AND ddc_cid !~ '\.'
    • third: WHERE ddc_cid !~ '\.'
    • fourth: WHERE ddc_cid ~ '.'
  • column 'fourth.d' uses these two interesting functions array_to_string(regexp_matches(ddc_cid, '\d*\.(\d+)'), '')
SELECT *
FROM ddc_v
WHERE ddc_cid = '745.2';
 ddc_cid |           name            | t1_name  |          t2_name          |     t3_name     |          t4_name          |  a  |  b  |  c  | d
---------+---------------------------+----------+---------------------------+-----------------+---------------------------+-----+-----+-----+---
 745.2   | Industrial art and design | The arts | Drawing & decorative arts | Decorative arts | Industrial art and design | 700 | 740 | 745 | 2
(1 row)

What can be improved?🔗

  • [ ] The parsing routine is a bit ugly.
  • [ ] The first level of hiarchy has a IN condition, this could be a set of VALUES or a SEQUENCE.
  • [ ] The aliases of the column; 'a', 'b', 'c', 'd' are not so well chosen.