Practical example of creating a blueprint and derived cost-effectiveness analysis: Targeting the Telecommunications Industry
As an example, let us apply the proposed methodology to an industrial scenario: the targeting use case in telecommunications. Targeting deals with choosing the right set of customers for a promotion or offer, with the goal of minimizing the number of contacts (and related “customer fatigue”), while maximizing the redemption rate, that is the probability with which customers use the promotion (for example, by redeeming a coupon or purchasing a discounted product). An efficient and effective targeting can increase a company’s profitability as it reduces the cost of a campaign, it attracts customers to the shops and induces them to additional spending.
- The technology collects data from different company data sources such as calls, SMSs, and Web logs.
- The ETL functions aggregates incoming data, building a complete profile for each customer.
- A machine learning algorithm (clustering) is trained on to user profile data to identify similarities across users and products and to group customers with a similar purchasing behavior (behavioral customer segmentation).
- The clusters obtained are very likely to represent a high number of customers and therefore, whenever a new customer comes into the system, there is no need to recompute the clusters, but we can just assign the new profile to an existing cluster, based on a minimum distance assessment. This operation is far quicker than the re-computation of the clusters and allows the company to reassign customers to clusters frequently (e.g. once a week), while performing re-clustering only seldom (e.g. once a year).
The figures attached show the intial architecture for the targeting use case before and after, built by selecting components from the overall integrated blueprint explained in the Generic Big Data Analytics Blueprint pipeline.
Next methodological steps: Sizing the architecture, understanding critical activities and sizing the data
Having presented the overall schema, let us follow the other steps of the methodology to size the architecture. The second step of the methodology is to understand the critical activities, which will be the main focus of the architectural sizing, as they are data- and/or processing-intensive. From the architecture in the figure attached, we can clearly distinguish two critical activities: the Extract-Transform-Load function and the clustering function. The ETL functions are designed to execute simple operations, as aggregating daily (or weekly) data for each customer. The problem is the huge amount of data which they have to deal with in a Telecommunication company with millions of customers. For example, counting the minutes spent on call by each customer is a simple sum across the different calls for that customer, but it becomes demanding as the same operation has to be repeated for millions of customers (the size of a medium telco company is 10 M customers).
We have two clustering functions in the targeting use case: the main function, which evaluates, as we already stated, the clusters and calculates the centroids of each cluster based on a machine learning algorithm (e.g. k-means) and the function that assigns the customer to the closest cluster, by evaluating the distance between his/her profile and the various centroids. Let us assume that we use k-means as the clustering algorithm for the first function. A factor that determines the necessary processing capacity is knowing the number of clusters in advance. A technique that can be adopted, in order to overcome this problem, is looking for a number of clusters k that maximizes the inter-cluster difference and minimizes the intra-cluster cohesion. This solution requires to run the k-means algorithm for all the possible values of k or by exploiting previous experience and knowledge, it can be focused on fewer, but more promising values.
However, this procedure can be executed once or a few times in a year, mediating between the high computational requirements of k-means and the accuracy of clusters. In an industry such as telecommunications, clusters are not likely to change over a short period of time. If customer behavior changes, it is likely that changes can be accommodated by assigning the customer to a different cluster.
Sizing the data: We assume that data result from the integration of three databases, calls, SMSs and Web logs. These databases, could have the following schemas (based on our case study analysis):
Call (Contract_id, SIM, phone_id, plan, call_type, call_destination_type, call_destination_id, duration, time, date)
Sms (Contract_id, SIM, phone_id, sms_destination_type, sms_destination_id, content, time, date)
Packet (client_id, socket_src, socket_dst, timestamp_begin, timestamp_end)
Each row in these databases could be around 200 bytes. A small-medium telecommunications company has 5 million customers (large telecom companies can reach 30 million customers in one country). Two other important inputs are the average number of calls per customer, which is around 7/day, and the average number of SMSs sent by each customer, which we estimated to be 10/day.
We need to assess the costs as another step of the methodology. Therefore, we can estimate that a small-medium size company can gather around 3 GB of calls data and 5 GB of sms each day. The processing functions should keep information about the daily usage statistics for each customer in the three categories (calls, sms and internet usage), possibly divided in time slots. Ideally, assuming six time slots 0-4, 4-8, 8-12, 12-16, 16-20, and 20-24, we want to create the following profile for each customer: client_id, (call_minutes, sms_number and internet_size) for the six time slots and a total weekly aggregation, which accounts in total for 90~100 bytes; by considering the total number of customers, we derived that 0.5 GB data is generated each week and 185 GB each year.
And we need to understand the processing performance as part of one of the last steps of the methodology. As far as the processing time is concerned, the data preparation procedure is the one taking most of the time. From the benchmarks of the aggregating procedures on Spark, it takes around 2 hours of daily processing to produce aggregated results for each day (in the machine configuration provided in Liu 2015 which used the HiBench benchmark available from the DataBench Toolbox), and two more hours of processing at the end of each week to compute the mean values. Finally, 8 hours of processing are needed to compute the minimum distance between all the customers and the clusters centroids, which adds up to a total of 24 hour of processing each week. We are not considering the time spent in running the k-means, since we supposed to perform it only twice a year.
As a matter of example, given these premises, we have estimated the scaling function for the number of cores, the costs with various Amazon EC2 instances, all running the Spark framework (through Amazon EMR) and using Amazon S3 as storage. At the time of writing this, the results of costs can vary significantly from around 14,000$ to 130,000$ (by an order of magnitude) with the AWS instance, given the same overall resources of a cluster. This suggests that a careful benchmarking of the solution on different machines could provide significant economic benefits. These estimates have been presented to the telecommunications company that has participated in the case-study analysis. Based on cost considerations, we have proposed the application of this framework in order to select a smaller group of “targeted” customers to be called by the company, in place of sending an SMS, which is their current approach with their customers. This would increase redemption rate (calls have a greater redemption rate compared to SMSs), at affordable costs. The company confirmed that this change in approach could also reduce churn and that the results of the sizing and cost analysis were realistic and useful for this type of strategic considerations.
These considerations were based on a cost-benefit analysis about the advantages that a telecommunication organization can gain. A telecommunication company cannot call their customers to present an offer, as the call center is an expensive channel and costs would soar. A promotion has limited returns which would not compensate costs. On the other hand, they can call their competitor’s customers to invite them to churn out with an interesting offer, as in this case costs are offset by a much larger customer lifetime value. A company implementing smarter targeting and customer selection, will result in cost reductions by over 90% if the set of targeted customers does not exceed 10% of the overall number of customers: accurately selecting customers to be called, drastically lowers the number of calls, reducing expenses from 15Me to 400Ke yearly (for a medium-size telecommunication company). On the other hand, targeting through SMSs would not involve the same cost for the company: text messages are very cheap for a firm that holds its own infrastructure. The benefits granted by a smarter promotion campaign through SMSs would be a higher customer satisfaction, given by the lower number of texts received and consequently lower customer fatigue.
It should be noted that costs reported are limited to processing and exclude data management and replication costs. The overall costs of the architecture reported are higher, as different components typically require separate machines. For example, the database has a dedicated machine/cluster. In this example, data are roughly 150 GB/year, but targeting may require a longer time frame and, in turn, involve additional hardware. Companies may decide to keep promotional information on a separate machine, for performance reasons, further increasing the necessary hardware.
However, our estimates provide an order of magnitude for the hardware necessary for the most processing intensive functions and, hence, for the whole architecture. This is effective to support the identification of opportunities and related strategic decisions. A pilot should be performed after this type of considerations to verify costs and perform an accurate software selection with benchmarking.
More information about this example can be found in DataBench deliverable D4.3.