CMIE's firm databases use a fine-grained product code to identify
each product. Each firm is also allocated to a product code based on
its predominant activities. I like to reconstruct a coarse
classification out of this that suits my tastes. I do this using
this R function:
cmie.14.industries <- function(s) {
values.8 <- c("Food","Textiles",
"Chemicals","NonMetalMin",
"Metals","Machinery",
"TransportEq","MiscManuf",
"Diversified","Serv.IT")
names(values.8) <- c("01010101", "01010102",
"01010103", "01010104",
"01010105", "01010106",
"01010107", "01010108",
"01010109","01010408")
values.6 <- c("Serv.Construction","Serv.Other",
"Mining","Electricity")
names(values.6) <- c("010106","010104","010102",
"010103")
if (is.na(s)) {return(NA)}
leading8 <- substr(s, 1, 8)
attempt <- values.8[leading8]
if (!is.na(attempt)) {return(attempt)}
leading6 <- substr(s, 1, 6)
attempt <- values.6[leading6]
if (!is.na(attempt)) {return(attempt)}
leading4 <- substr(s, 1, 4)
if (leading4 == "0102") {return("Serv.Finance")}
return("MISTAKE")
}
This maps each firm into one of 14 coarse categories. Here are some
examples of this in action:
> cmie.14.industries("0102090000000000")
"Serv.Finance"
> cmie.14.industries("0101041502000000")
"Serv.Other"
> cmie.14.industries("0101010601010000")
"Machinery"
So in short, the function cmie.14.industries()
maps a
string like "0101010601010000" into a set of 14 broad industry names
such as "Machinery".
Faced with a file with roughly 48,000 firm-years, at first blush,
it seems that this function has to be run 48,000 times. For a given
firm, this classification could change over time, so it isn't just a
matter of doing this once for each firm. Here is one simple way to do
it:
badway <- function(task) {
result <- rep("", length(task))
for (i in 1:length(task)) {
result[i] <- cmie.14.industries(task[i])
}
result
}
This is just a loop that runs over everything in the supplied
vector and calls cmie.14.industries()
for each
element. The only concession to efficiency is that the empty vector
`result' is allocated ahead of time.
This proves to be quite slow. None of the standard R vectorisation
ideas offer much relief.
The key idea for obtaining a leap in performance was that while I
had to run through 48,000 firm-years, the industry codes actually
attain only a modest list of possibilities. This makes possible a
table lookup:
goodway <- function(task) {
possibilities <- unique(task)
values <- rep("", length(possibilities))
for (i in 1:length(possibilities)) {
values[i] <- cmie.14.industries(possibilities[i])
}
names(values) <- possibilities
values[task]
}
For a problem of size 1000, this works out to be 13.5 times
faster:
> load("task.rda")
> length(task)
[1] 1000
> system.time(res1 <- badway(task))
user system elapsed
0.030 0.000 0.031
> system.time(res2 <- goodway(task))
user system elapsed
0.002 0.000 0.002
This is just a demo with a 1000-sized task. In my production
situation, the performance difference is even greater,
since badway()
calls cmie.14.industries()
48,000 times while goodway()
only calls it a few hundred
times.
0 comments:
Post a Comment