README.md · LH-Tech-AI/Shield-82M at main

Shield-82M / README.md

LH-Tech-AI

Update README.md

262ef85 verified 20 days ago

preview code

raw

history blame contribute delete

3.54 kB

	---
	datasets:
	- ai4privacy/pii-masking-200k
	base_model:
	- distilbert/distilroberta-base
	pipeline_tag: token-classification
	tags:
	- distil
	- pii
	- security
	- shield
	- small
	- cpu
	- fast
	- open
	- open-source
	- lh-tech
	- bert
	- roberta
	---
	# 🛡️ Shield 82M
	Welcome to Shield 82M, a model designed to filter PII out of texts in any language.

	## Classes
	This model has the following PII classes:
	```plaintext
	['O', 'ACCOUNTNAME', 'ACCOUNTNUMBER', 'AGE', 'AMOUNT', 'BIC', 'BITCOINADDRESS', 'BUILDINGNUMBER', 'CITY', 'COMPANYNAME', 'COUNTY', 'CREDITCARDCVV', 'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'CURRENCY', 'CURRENCYCODE', 'CURRENCYNAME', 'CURRENCYSYMBOL', 'DATE', 'DOB', 'EMAIL', 'ETHEREUMADDRESS', 'EYECOLOR', 'FIRSTNAME', 'GENDER', 'HEIGHT', 'IBAN', 'IP', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE', 'JOBTYPE', 'LASTNAME', 'LITECOINADDRESS', 'MAC', 'MASKEDNUMBER', 'MIDDLENAME', 'NEARBYGPSCOORDINATE', 'ORDINALDIRECTION', 'PASSWORD', 'PHONEIMEI', 'PHONENUMBER', 'PIN', 'PREFIX', 'SECONDARYADDRESS', 'SEX', 'SSN', 'STATE', 'STREET', 'TIME', 'URL', 'USERAGENT', 'USERNAME', 'VEHICLEVIN', 'VEHICLEVRM', 'ZIPCODE']
	```

	# Base model
	This model is based on distilroberta-base.

	# Examples
	The model has an accuracy score of ~96% (0.961206).
	<br>
	Here are a few examples:
	### Test with name, email and phone
	```plaintext
	Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
	Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
	```
	### Basic test
	```plaintext
	Original: I live in Cambridge
	Protected: I live in [ADDRESS]
	```
	### French test (multilingual)
	```plaintext
	Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78.
	Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE].
	```

	## Quickstart
	To use this model, just download `use.py` from this repo and launch it:
	```bash
	mkdir Shield-82M
	cd Shield-82M
	wget https://huggingface.co/LH-Tech-AI/Shield-82M/resolve/main/use.py
	python3 use.py
	```

	This outputs something like:
	```bash
	Loading Shield-82M from LH-Tech-AI/Shield-82M...

	Loading weights: 100%
	103/103 [00:00<00:00, 773.65it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight]

	Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
	Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
	```

	To use it with your own text, you'll have to adjust this line of code in `use.py`:
	```python
	sample = "My name is John Doe. Email: john@example.com. Phone: +49 123 45678."
	```

	## Training data
	This model was trained on the first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main) for 3 epochs.

	## Training details
	- Epochs: 3
	- Max Lenght: 512
	- Base model: [distilroberta-base](https://huggingface.co/distilbert/distilroberta-base)
	- Data: first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main)
	- GPU: 2x Kaggle T4
	- Training time: 06:38 min
	- Engine: HF Transformers

	The following table shows the training process:

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| 1 \| 1.048266 \| 0.250184 \| 0.904065 \| 0.932844 \| 0.918229 \| 0.949456 \|
	\| 2 \| 0.257664 \| 0.193614 \| 0.939548 \| 0.949651 \| 0.944572 \| 0.959521 \|
	\| 3 \| 0.199425 \| 0.181754 \| 0.939833 \| 0.952215 \| 0.945983 \| 0.961206 \|

	You can find the full training code in `train.ipynb`. Runs on 2x Kaggle T4 in ~7mins.