Analyzing Montgomery County crimes to find useful patterns

Montgomery is a county in the Maryland US located on the east coast Montgomery County, Maryland.

Improvement of various aspects of social life entitles a proactive and reactive analysis. With this in mind I will be looking to to find patterns by performing time, location and crime classification analysis. For this project I will heavily rely on graph visualization as a picture worths a thousand words.

Here are some conclusion highlights:

  1. Time analysis:
    1. Most of the crimes are committed Tuesday
    2. On 24 hour basis most of the crimes are committed between 7 a.m - 11 p.m
    3. October has the highest crime count
  2. Classification analysis:
    1. Violent/Non-Violent crimes rates are pretty even 42.8%/57.2%
  3. Location analysis:
    1. Cities with highest crime counts are : Silver Spring, Rockville, Gaithersburg
    2. Most of the crimes happen in the street, residence or parking lot
    3. Silver Spring Police District has the highest crime rates

Dataset for this project: here

In [26]:
import pandas as pd
import numpy as np

crimes = pd.read_csv("MontgomeryCountyCrime2013.csv")
crimes.head()
Out[26]:
Incident ID CR Number Dispatch Date / Time Class Class Description Police District Name Block Address City State Zip Code ... Sector Beat PRA Start Date / Time End Date / Time Latitude Longitude Police District Number Location Address Number
0 200939101 13047006 10/02/2013 07:52:41 PM 511 BURG FORCE-RES/NIGHT OTHER 25700 MT RADNOR DR DAMASCUS MD 20872.0 ... NaN NaN NaN 10/02/2013 07:52:00 PM NaN NaN NaN OTHER NaN 25700.0
1 200952042 13062965 12/31/2013 09:46:58 PM 1834 CDS-POSS MARIJUANA/HASHISH GERMANTOWN GUNNERS BRANCH RD GERMANTOWN MD 20874.0 ... M 5M1 470.0 12/31/2013 09:46:00 PM NaN NaN NaN 5D NaN NaN
2 200926636 13031483 07/06/2013 09:06:24 AM 1412 VANDALISM-MOTOR VEHICLE MONTGOMERY VILLAGE OLDE TOWNE AVE GAITHERSBURG MD 20877.0 ... P 6P3 431.0 07/06/2013 09:06:00 AM NaN NaN NaN 6D NaN NaN
3 200929538 13035288 07/28/2013 09:13:15 PM 2752 FUGITIVE FROM JUSTICE(OUT OF STATE) BETHESDA BEACH DR CHEVY CHASE MD 20815.0 ... D 2D1 11.0 07/28/2013 09:13:00 PM NaN NaN NaN 2D NaN NaN
4 200930689 13036876 08/06/2013 05:16:17 PM 2812 DRIVING UNDER THE INFLUENCE BETHESDA BEACH DR SILVER SPRING MD 20815.0 ... D 2D3 178.0 08/06/2013 05:16:00 PM NaN NaN NaN 2D NaN NaN

5 rows × 22 columns

Exploring the data

Each row in the dataset represents a crime being commited. Data contains location information, crime classification as well as various timestamps. In examining the dataset our goals would be to:

  1. Find the columns that have meaningfull information, have minimum missing values, and also hold granular data.
  2. We also need to perform data cleaning/manipulation
In [27]:
crimes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23369 entries, 0 to 23368
Data columns (total 22 columns):
Incident ID               23369 non-null int64
CR Number                 23369 non-null int64
Dispatch Date / Time      23369 non-null object
Class                     23369 non-null int64
Class Description         23369 non-null object
Police District Name      23369 non-null object
Block Address             23369 non-null object
City                      23369 non-null object
State                     23369 non-null object
Zip Code                  23339 non-null float64
Agency                    23369 non-null object
Place                     23369 non-null object
Sector                    23323 non-null object
Beat                      23361 non-null object
PRA                       23363 non-null float64
Start Date / Time         23369 non-null object
End Date / Time           13191 non-null object
Latitude                  23208 non-null float64
Longitude                 23208 non-null float64
Police District Number    23369 non-null object
Location                  23208 non-null object
Address Number            23237 non-null float64
dtypes: float64(5), int64(3), object(14)
memory usage: 3.9+ MB

We need to convert Start Date / Time; End Date / Time; Dispatch Date/ Time to datetime format.

Examine the number of (rows,columns) from the dataframe

In [28]:
crimes.shape
Out[28]:
(23369, 22)

Examining missing values.

In [29]:
crimes.isnull().sum()
Out[29]:
Incident ID                   0
CR Number                     0
Dispatch Date / Time          0
Class                         0
Class Description             0
Police District Name          0
Block Address                 0
City                          0
State                         0
Zip Code                     30
Agency                        0
Place                         0
Sector                       46
Beat                          8
PRA                           6
Start Date / Time             0
End Date / Time           10178
Latitude                    161
Longitude                   161
Police District Number        0
Location                    161
Address Number              132
dtype: int64
  1. Main takeaway here is that End Date / Time has a high number of missing value, for this reason it can't be used for our analysis
  2. A deeper look into the collumns that have missing values is needed, also we have to determine which columns will provide usefull insight
In [30]:
columns_to_keep=['Zip Code','Sector','Beat','PRA','Latitude','Longitude','Location','Address Number']

for col in columns_to_keep:
    item_null=crimes[col].notnull()
    print(col+"\n",crimes[col][item_null==True].head(),"\n")
Zip Code
 0    20872.0
1    20874.0
2    20877.0
3    20815.0
4    20815.0
Name: Zip Code, dtype: float64 

Sector
 1    M
2    P
3    D
4    D
5    P
Name: Sector, dtype: object 

Beat
 1    5M1
2    6P3
3    2D1
4    2D3
5    6P1
Name: Beat, dtype: object 

PRA
 1    470.0
2    431.0
3     11.0
4    178.0
5    444.0
Name: PRA, dtype: float64 

Latitude
 10    39.105561
13    39.064334
14    39.067335
15    39.017814
16    39.178862
Name: Latitude, dtype: float64 

Longitude
 10   -77.144617
13   -76.968985
14   -77.124027
15   -77.047689
16   -77.267406
Name: Longitude, dtype: float64 

Location
 10    (39.105560882140779, -77.144617133574968)
13     (39.064334220776551, -76.96898520383327)
14    (39.067334736049553, -77.124027420153752)
15     (39.017814078946948, -77.04768926351224)
16    (39.178862442227761, -77.267405973712243)
Name: Location, dtype: object 

Address Number
 0     25700.0
10      600.0
11     9200.0
13     2100.0
14     2200.0
Name: Address Number, dtype: float64 

Columns analysis:

  1. Zip Code; Sector; Beat; Address Number can't be used for our analysis as these details are meaningless for a casual reader.
  2. Columns like Dispatch Date / Time; Class Description; City; Start Date / Time; Police District Number will be useful
  3. Latitude and Longitude is not an obvious choice, at least from a comprehensibility standpoint but will be useful for map visualization later on.
  4. Also the number of nan values for End Date / Time column is quite high so for this reason I will choose Dispatch Date / Time instead
  5. I will exclude missing latitude, longitude values from our dataset
In [31]:
#exclude lat&lon missing values
lat_null=crimes['Latitude'].notnull()
lon_null=crimes['Latitude'].notnull()
crimes=crimes[(lat_null==True) & (lon_null==True)]

Time analysis

The aim here is to spot time related patterns in crimes. Time analysis is structured around these questions:

  1. What day of the week are the most crimes committed on? (i.e Monday, Tuesday, etc)
  2. During what time of day are the most crimes committed?
  3. During what month are the most crimes committed?

First step would be to convert the Dispatch Date / Time column from an object type to a datetime type.
Dispatch Date/Time--The actual date and time a Officer was dispatched

In [32]:
# import datetime modules
import datetime as dt
 
# convert 'Dispatch Date / Time' to datetime format
crimes['Dispatch Date / Time']=pd.to_datetime(crimes['Dispatch Date / Time'])

# get crime counts/weekday;hour;month
crimes['Dispatch Day of the Week']=crimes['Dispatch Date / Time'].dt.weekday_name
crimes['Dispatch Hour']=crimes['Dispatch Date / Time'].dt.hour
crimes['Dispatch Month']=crimes['Dispatch Date / Time'].dt.month
In [33]:
# import plotting modules

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()
import cufflinks as cf

# day of the week crime counts
dow=crimes['Dispatch Day of the Week'].value_counts().copy()

# Converting to a series to a dataframe
# It will be easier for plottling
pd.DataFrame({'Dispatch Day of the Week':dow.index,'Counts':dow}).reset_index(drop=True)

# plotting
dow.iplot(theme='pearl', filename='crime_distrib_per_day', title='Crime distribution per day of the week',
         xTitle='Day', yTitle='Count', world_readable=False)
dow