Malaysian Address Normalization with Python (Quick Start)

5 min readMay 16, 2021

A Sample Use Case to get Started with Malaya NLTK.

A quick start to
1 get started with Malaya NLTK.
2 get started with Malaysian address normalization using Python.

Malaya NLTK greatly simplifies Malaysian address normalization logic versus 5 years ago when I first built one.

Notes:
1 You will need to expand the images to have a better view.
2 Non-comprehensive quick start. If you are interested to know more, feel free to get connected.

##########################
# Start of Code
####################################################
# Introduction
##########################
# Getting Started with Malaya, 
# a Natural-Language-Toolkit library for bahasa Malaysia, 
# powered by Deep Learning Tensorflow.
# Malaya is the brainchild of Mr Husein Zolkepli
########################### Address Normalization (Standardization) forms the basis of household sizing in marketing# to facilitate analysis on household composition
# from opportunity sizing of products and services, down to category spent.# Knowing/ having inferred the age composition of each household,
# marketeers can then have a better view on the life stages of each family
# and therefore provide greater context and personalization.# Other than that, behavioural analysis such as spending patterns, lifestyle, affluence index etc
# further augment the context of the analysis
##########################

##########################
# setup
##########################
# installation guide:
# https://malaya.readthedocs.io/en/latest/Installation.html
# pip install malaya
##########################
# import libraries
%%time
import malaya
import pandas as pd
import string
import numpy as np 
import re
##########################
def probability(sentence_piece: bool = False, **kwargs):
    """
    Train a Probability Spell Corrector.
Parameters
    ----------
    sentence_piece: bool, optional (default=False)
        if True, reduce possible augmentation states using sentence piece.
Returns
    -------
    result: malaya.spell.Probability class
    """
##########################
prob_corrector = malaya.spell.probability()
symspell_corrector = malaya.spell.symspell()
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector)
##########################

##########################
# Sample Data
##########################
# sample data
# source: https://www.bestrandoms.com/random-address-in-my
# for simplicity, address is split by Address1, Postcode and State
# we will only show normalization example for Address1 and State
##########################
df = pd.DataFrame({
    'ID':[
        'abc123',
        'bcd234',
        'cde345',
        'def456',
        'efg567',
        'fgh678',
        'ghi789',
        'hij890',
        'ijk901',
        'jkl012',
        'klm123'
    ], 
    'Address1':[
        '5, Jalan 2/2, Taman Bunga',
        '2, Jln 2, Kampng Baru',
        'No 3, Jalan Dua, Tmn Perdana',
        'NO 18-1, Jalan 13/5 Taman Melati Gombak',
        '29E Jln Tandang Seksyen 51 Petaling Jaya',
        'Lorong Hulu Balang 30B, Taman Sentosa,',
        '73 Jln Yam Tuan Seremban',
        'Menara CITIBANK, Jln Ampang, Kul',
        '57-B Jalan Ss22/19 Damansara Jaya',
        '87, Lebuh Macallum, George Town, 10300',
        'Lrg Panggung, City Centre, 50000'
    ],
    'Postcode':[
        '88811',
        '88800',
        '58700',
        '53100',
        '46050',
        '42100',
        '70000',
        '50450',
        '47400',
        '10300',
        '50000'
    ],
    'State':[
        'KL',
        'KUALA LUMPUR',
        'KL.',
        'Wilayah Persekutuan KL',
        'Selangor',
        'Selngor',
        'Negeri Sembilan',
        'Kuala Lumpur',
        'Selangor',
        'Pulau Pinang',
        'KL'
    ]
})
##########################

##########################
# Abbreviation dictionary
##########################
# sample shortform/abbreviation dictionary
# this helps us to immediately convert common short hands
# in actual, you will need more, eg those for tingkat, medan etcshortform_dict = {
    'tmn': 'taman',
    'kg': 'kampung',
    'jln': 'jalan',
    'jln.': 'jalan',
    'lrg': 'lorong',
    'st': 'street'
}
##########################
# functions to be used
##########################
# remove punctuations
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '').lower()
    return text# change to lowercase
def apply_lowercase(text):
    text = text.lower()
    return text# endcomma stripping
def remove_endcomma(text):
    try:
        return text.rstrip(',')
    except:
        return text# Malaya normalizer
def applymalaya_normalizer(cell):
    try:
        return list(normalizer.normalize(cell).items())[0][1]
    except:
        return cell# Malaya Spellcorrector. we opt for the simplest spell corrector
def applymalaya_symspell_corrector(cell):
    try:
        return symspell_corrector.correct(cell)
    except:
        return cell# Malaya word2num    
def applymalaya_word2num(cell):
    try:
        return malaya.word2num.word2num(cell)
    except:
        return cell
##########################

# switch to lower case, then split with space as delimiter
# '2, Jln 2, Kampng Baru' --> '2, jln 2, kampng baru'
# Kampng --> kampng
# '2, jln 2, kampng baru' --> '2,' + 'jln' + '2,' + 'kampng' + 'baru'df["Address1_Update1"] = df['Address1'].apply(apply_lowercase)
dfx1 = pd.concat([df['ID'],df['Address1'],df['Postcode'],df['State'],df['Address1_Update1'],df['Address1_Update1'].str.split(' ', expand=True)], axis=1)
dfx1

# strip the comma at the end
# for example, the comma at the end of jalan 2/2 will be strippedcolumn_list=dfx1.columns.tolist()[5:]
dfx2 = dfx1.apply(lambda x: x.apply(remove_endcomma) if x.name in column_list else x)
dfx2

# applying abbreviations mappings
# for example 'jln' --> 'jalan'
# applying these visible ones, even though Malaya NLTK is capable of thisdfx3 = dfx2.apply(lambda x: x.map(shortform_dict).replace(np.nan, x, regex=True) if x.name in column_list else x)
dfx3

# word2num is applied to facilitate geocoding exercises (not covered here)
# 'Jalan Dua' --> 'Jalan 2'dfx4 = dfx3.apply(lambda x: x.apply(applymalaya_word2num) if x.name in column_list else x)
dfx4

# spell corrector
# 'kampng' --> 'kampung'
# luckily 'kampng' --> 'kmpg' did not materialize HAHAdfx5 = dfx4.apply(lambda x: x.apply(applymalaya_symspell_corrector) if x.name in column_list else x)
dfx5

# random exercise to normalize state
# you can opt for a dictionary to streamline this further
# for example, 'wilayah persekutuan kuala lumpur' --> 'kuala lumpur'dfx5['combined'] = dfx5[column_list].apply(lambda row: ' '.join(row.values.astype(str)), axis=1).str.replace('nan','').str.replace('None','')
dfx5["State_Normalized"]=dfx5["State"].apply(remove_punctuations).apply(applymalaya_normalizer).str.lower().apply(applymalaya_symspell_corrector)
dfx5

col_dict = {'State_Normalized': 'Normalized_State', 'combined': 'Normalized_Address'}  
dfx5.columns = [col_dict.get(x, x) for x in dfx5.columns]
dfx5

# quick comparison of pre-normalized versus normalized addressdfx5[["ID", "Address1","Postcode","State","Normalized_Address","Normalized_State"]]
dfx5['Pre_Normalized_Address'] = dfx5['Address1'] + ' ' + dfx5['Postcode'] + ' ' + dfx5['State']
dfx5['Normalized_Address'] = dfx5['Normalized_Address'] + ' ' + dfx5['Postcode'] + ' ' + dfx5['Normalized_State']
from IPython.display import display
display(dfx5[["ID", "Pre_Normalized_Address","Normalized_Address"]].style.set_properties(**{
        'width': '500px',
        'max-width': '500px'
    }))
##########################
# End of Code
##########################

References:

https://malaya.readthedocs.io/en/latest/
https://www.bestrandoms.com/random-address-in-my
https://www.google.com.my/maps for the image

Mr Husein Zolkepli is the brainchild of Malaya NLTK library. Do support Malaya and Malay-Dataset development on BuyMeACoffee: https://www.buymeacoffee.com/huseinzolkepli

Malaysian Address Normalization with Python (Quick Start)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by EeLianChong

No responses yet