Reddit NER for Place Names

github

Published

October 31, 2023

Fine-tuned bert-base-uncased for named entity recognition, trained using wnut_17 with 1,001 additional comments from Reddit. This model is intended solely for place name extraction from social media text, other entities have therefore been removed.

GitHub README

This GitHub repository contains the code relating to the NER model for place name identification from Reddit comments. This model is hosted on the HuggingFace Model Hub, allowing for easy use in Python.

Training monitored using DagsHub and MLFlow.

Reproduce Model

To retrain the model locally using the WNUT_17 corpus:

python -m src.train --dataset "wnut_17"

Train this model using CoNLL03, CoNLLpp, or OntoNotes 5 corpora:

python -m src.train --dataset "tner/ontonotes5" / "conllpp" / "conll2003"

Note that dvc repro reproducibly builds this model and uploads it to Hugging Face, if I build future versions.

Project layout

src
├── common
│   └── utils.py  # utility functions
├── pl_data 
│   ├── conll_dataset.py  # reader for conll format
│   ├── datamodule.py  # generic datamodule
│   ├── jsonl_dataset.py  # reader for doccano jsonl format
│   └── test_dataset.py  # reader for testing dataset
├── pl_metric
│   └── seqeval_f1.py  # F1 metric
├── pl_module
│   ├── ger_model.py  # model implementation
└── train.py  # training script

DVC pipeline

stages:
  train:
    cmd: python -m src.train
    deps:
    - data/doccano_annotated.jsonl

    - src/train.py
    outs:
    - logs
    frozen: false
  upload:
    cmd: python -m src.train --upload=true
    deps:
      - data/doccano_annotated.jsonl

      - src/train.py
    frozen: true

DVC DAG

flowchart TD
    node1["train"]
    node2["upload"]

Hugging Face Model README

Entry not found