I manage computer networks, work which does not normally
involve Data Science terms like Regression, Machine learning and R. However data is there everywhere and one can
still apply Data Science techniques to glean new insights from age old data sources like syslog. On this blog post I
have outlined how I made use of my new found data science knowledge (Coursera/JHU Data Science
Specialization) to extract and gain insights from syslog data.
Problem: We were migrating to a new virtual private
network (VPN) technology (for the Cisco network folks, moving from Cisco IPSEC client to the Cisco AnyConnect client). VPN related information like Username, IP
address and the VPN client being used would assist with gathering migration metrics, VPN usage trends and tech support related information.
Solution: Using simple Unix ‘grep’ commands I winnowed down my
syslog data to obtain relevant weekly VPN related logs. A sample of three lines of VPN related syslog
data is shown below (I have anonymized private information).
Sep 3 13:18:08 10.10.10.1 Sep 03 2015 18:18:23 firewall-f03 : %ASA-5-713120: Group = IPSEC, Username = user1, IP = 86.65.197.212, PHASE 2 COMPLETED (msgid=575c7b2f)
Sep 3 14:43:24 10.10.10.1 Sep 03 2015 19:43:39 firewall-f03 : %ASA-6-716059: Group <AnyConnect> User <user2> IP <216.58.218.196> AnyConnect session resumed connection
from IP <216.58.218.196>.
Sep 4 10:24:34 10.10.10.1 Sep 04 2015 15:24:34 firewall-f03 : %ASA-5-713120: Group = IPSEC, Username = user3 IP = 208.87.150.50, PHASE 2 COMPLETED (msgid=575c7b2f)
Sep 3 14:43:24 10.10.10.1 Sep 03 2015 19:43:39 firewall-f03 : %ASA-6-716059: Group <AnyConnect> User <user2> IP <216.58.218.196> AnyConnect session resumed connection
from IP <216.58.218.196>.
Sep 4 10:24:34 10.10.10.1 Sep 04 2015 15:24:34 firewall-f03 : %ASA-5-713120: Group = IPSEC, Username = user3 IP = 208.87.150.50, PHASE 2 COMPLETED (msgid=575c7b2f)
This raw VPN related syslog data was further processed using
the R code described below. This code may not be reproducible as it is specific
to my syslog format. It’s not elegant R code; however it does what I need it to
do. The gist of the code is to clean the
syslog data so that the information I need (‘Group’, ‘Username’ and ‘IP’) are
at consistent locations on every line of the log. Here is the flow of the
code..
A: R code to read in the raw syslog data into a Data table
vpnData.
vpnData
<- data.table(readLines("temp2.txt", n = -1L, skipNul = TRUE,
encoding = "UTF-8"))
B: Function to clean each line of syslog.
replaceExpressions
<- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument
<- replaceExpressions.character <-
function(x) {
x <- str_replace_all(x,
"<|>|:|from|connection", "")
x <- str_replace_all(x, "=",
"")
x <- str_replace_all(x, ",",
"")
x <- str_replace_all(x, " ", " ")
x <- str_trim(x)
return(x)
}
C: R code listed
below reads in each row of data table vpnData and picks the ‘Group’, Username’
and ‘IP’ values. This portion of the code references a function called
freegeoip, this function takes the IP address as an argument and returns a number
of geoip parameters. In my code the parameters returned are latitude and
longitude for the IP address. The 5 values (Group, Username, IP, Longitude and
Latitude) are appended to a new data table vpnTable using rbindlist.
The
‘freegeoip’ function is written by Andrew, this code and the database service
holding the geoip information is listed in the references section. The
‘freegeoip’ function and the database service are released under the GPL v2 and
Creative Commons licenses respectively.
vpnTable
<- data.table()
for (i
in 1:nrow(vpnData)) {
dd <- as.vector(as.matrix(vpnData[i,]))
#print(as.vector(replaceExpressions(dd)))
dd <- replaceExpressions(dd)
dd <- str_split(dd, " ")
dd <- dd[[1]]
indx <- match("Group", dd)
geoip <- freegeoip(dd[indx+5])
dd2 <- data.table(Group = dd[indx+1], User
= dd[indx+3], IP = dd[indx+5], Long = geoip$longitude, Lat = geoip$latitude)
vpnTable <- rbindlist(list(vpnTable, dd2))
rm(dd2)
}
D: The ‘freegeoip’ function written by Andrew.
freegeoip
<- function(ip, format = ifelse(length(ip)==1,'list','dataframe'))
{
if (1 == length(ip))
{
# a single IP address
require(rjson)
url <-
paste(c("http://freegeoip.net/json/", ip), collapse='')
ret <- fromJSON(readLines(url,
warn=FALSE))
if (format == 'dataframe')
ret <- data.frame(t(unlist(ret)))
return(ret)
} else {
ret <- data.frame()
for (i in 1:length(ip))
{
r <- freegeoip(ip[i],
format="dataframe")
ret <- rbind(ret, r)
}
return(ret)
}
}
}
The output from the code produces a data table as shown
below
>
vpnTable
Group
User IP Long
Lat
1: IPSEC user1 86.65.197.212 2.350 48.860
2:
AnyConnect user2 216.58.218.196 -122.058 37.419
3: IPSEC user3 208.87.150.50 -118.417 33.919
Visual
presentations using ggplot can now be produced for management consumption. I
used ggplot to plot histograms for each ‘Group’ to get a weekly trends (in my
case I should see IPSEC going down and AnyConnect increasing).
p2
<- ggplot(data = vpnTable, aes(x = Group)) +
geom_histogram(fill="blue1",
colour="orangered") +
labs(x = "VPN Type", y =
"Conn's") +
ggtitle("Total Conn's per VPN
Type")
One plot which may interest your network
security team would be to plot the Src IP’s on a world map. This is what I did with
just three lines of code.
mapWorld
<- borders("world")
mp
<- ggplot() + mapWorld
mp +
geom_point(data = vpnTable, aes(x = Long, y = Lat), color="red", size
= 2)
There you go; you now have a visual on the source IP’s
connecting to your VPN. If your company is a domestic shop and you see users
(red dot) in other countries, you may want to get network security involved and
investigate this a bit more.
References
R Libraries used
library("ggmap")
library("maptools")
library("maps")
library("data.table")
library("stringr")
library("data.table")