Saturday, October 24, 2015

Data is there everywhere

I manage computer networks, work which does not normally involve Data Science terms like Regression, Machine learning and R. However data is there everywhere and one can still apply Data Science techniques to glean new insights from age old data sources like syslog. On this blog post I have outlined how I made use of my new found data science knowledge (Coursera/JHU Data Science Specialization) to extract and gain insights from syslog data.

Problem:  We were migrating to a new virtual private network (VPN) technology (for the Cisco network folks, moving from Cisco IPSEC client to the Cisco AnyConnect client). VPN related information like Username, IP address and the VPN client being used would assist with gathering migration metrics, VPN usage trends and tech support related information.

Solution: Using simple Unix ‘grep’ commands I winnowed down my syslog data to obtain relevant weekly VPN related logs.  A sample of three lines of VPN related syslog data is shown below (I have anonymized private information).

Sep  3 13:18:08 10.10.10.1 Sep 03 2015 18:18:23 firewall-f03 : %ASA-5-713120: Group = IPSEC, Username = user1, IP = 86.65.197.212, PHASE 2 COMPLETED (msgid=575c7b2f)
Sep  3 14:43:24 10.10.10.1 Sep 03 2015 19:43:39 firewall-f03 : %ASA-6-716059: Group <AnyConnect> User <user2> IP <216.58.218.196> AnyConnect session resumed connection 
from IP <216.58.218.196>.
Sep  4 10:24:34 10.10.10.1 Sep 04 2015 15:24:34 firewall-f03 : %ASA-5-713120: Group = IPSEC, Username = user3 IP = 208.87.150.50, PHASE 2 COMPLETED (msgid=575c7b2f)

This raw VPN related syslog data was further processed using the R code described below. This code may not be reproducible as it is specific to my syslog format. It’s not elegant R code; however it does what I need it to do.  The gist of the code is to clean the syslog data so that the information I need (‘Group’, ‘Username’ and ‘IP’) are at consistent locations on every line of the log. Here is the flow of the code..

A: R code to read in the raw syslog data into a Data table vpnData.

vpnData <- data.table(readLines("temp2.txt", n = -1L, skipNul = TRUE, encoding = "UTF-8"))

B: Function to clean each line of syslog.

replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument <- replaceExpressions.character  <- function(x) {
  x <- str_replace_all(x, "<|>|:|from|connection", "")
  x <- str_replace_all(x, "=", "")
  x <- str_replace_all(x, ",", "")
  x <- str_replace_all(x, "  ", " ")
  x <- str_trim(x)
  return(x)
}

C: R code listed below reads in each row of data table vpnData and picks the ‘Group’, Username’ and ‘IP’ values. This portion of the code references a function called freegeoip, this function takes the IP address as an argument and returns a number of geoip parameters. In my code the parameters returned are latitude and longitude for the IP address. The 5 values (Group, Username, IP, Longitude and Latitude) are appended to a new data table vpnTable using rbindlist.

The ‘freegeoip’ function is written by Andrew, this code and the database service holding the geoip information is listed in the references section. The ‘freegeoip’ function and the database service are released under the GPL v2 and Creative Commons licenses respectively. 
  
vpnTable <- data.table()

for (i in 1:nrow(vpnData)) {
  dd <- as.vector(as.matrix(vpnData[i,]))
  #print(as.vector(replaceExpressions(dd)))
  dd <- replaceExpressions(dd)
  dd <- str_split(dd, " ")
  dd <- dd[[1]]
  indx <- match("Group", dd)
  geoip <- freegeoip(dd[indx+5])
  dd2 <- data.table(Group = dd[indx+1], User = dd[indx+3], IP = dd[indx+5], Long = geoip$longitude, Lat = geoip$latitude)
  vpnTable <- rbindlist(list(vpnTable, dd2))
  rm(dd2)
}
  
D: The ‘freegeoip’ function written by Andrew.

freegeoip <- function(ip, format = ifelse(length(ip)==1,'list','dataframe'))
{
  if (1 == length(ip))
  {
    # a single IP address
    require(rjson)
    url <- paste(c("http://freegeoip.net/json/", ip), collapse='')
    ret <- fromJSON(readLines(url, warn=FALSE))
    if (format == 'dataframe')
      ret <- data.frame(t(unlist(ret)))
    return(ret)
  } else {
    ret <- data.frame()
    for (i in 1:length(ip))
    {
      r <- freegeoip(ip[i], format="dataframe")
      ret <- rbind(ret, r)
    }
    return(ret)
  }
}  

}

The output from the code produces a data table as shown below

> vpnTable
        Group  User             IP     Long    Lat
1:      IPSEC user1  86.65.197.212    2.350 48.860
2: AnyConnect user2 216.58.218.196 -122.058 37.419
3:      IPSEC user3  208.87.150.50 -118.417 33.919

Visual presentations using ggplot can now be produced for management consumption. I used ggplot to plot histograms for each ‘Group’ to get a weekly trends (in my case I should see IPSEC going down and AnyConnect increasing).

p2 <- ggplot(data = vpnTable, aes(x = Group)) +
  geom_histogram(fill="blue1", colour="orangered") +
  labs(x = "VPN Type", y = "Conn's") +
  ggtitle("Total Conn's per VPN Type")

















One plot which may interest your network security team would be to plot the Src IP’s on a world map. This is what I did with just three lines of code.

mapWorld <- borders("world")
mp <- ggplot() + mapWorld
mp + geom_point(data = vpnTable, aes(x = Long, y = Lat), color="red", size = 2)






















There you go; you now have a visual on the source IP’s connecting to your VPN. If your company is a domestic shop and you see users (red dot) in other countries, you may want to get network security involved and investigate this a bit more.

References


R Libraries used

library("ggmap")
library("maptools")
library("maps")
library("data.table")
library("stringr")
library("data.table")