Financial Data Science Applications

Finance was one of the first industries to develop and apply data science methods. What is financial data science? Kaplan Schweser defines financial data science as combining “the traditions of econometrics with the technological components of data science. Financial data science uses machine learning, predictive and prescriptive analytics to provide robust possibilities for understanding financial data and solving related problems.”

Finance Applications

Data science has been applied to many areas of the finance industry. One that people interact with every day (but may not realize it) is financial fraud prevention. If your credit card information is stolen, the company will flag the charge and alert you about a possible fraudulent purchase. But how does the company know whether the credit card purchase is one being completed by the actual owner of the credit card or if it is fraud? Machine learning algorithms have become an integral part of this area of finance because they can quickly handle variety of large quantities of data and effectively predict credit card fraud.

When many people hear “financial data science,” the first thought that comes to mind is algorithmic trading. Algorithmic trading utilizes the power of data science algorithms to identify profitable trading strategies. Algorithmic trading has been applied to many financial institutions including hedge funds, pension funds, and even banks. According to trality.com, “more than 60% of the top-performing hedge fund mangers made use of algorithmic trading.”

While many applications of financial data science exist, one final example is related to loan or credit card approval. This application is often used as an easy introductory model in financial data science courses and slightly more complicated models are applied by financial corporations. The goal of these models is to predict whether a loan or credit card should be approved when an individual applies for the loan. Financial institutions make money from the interest earned on the loan while the primary risk for the company is that the individual will not pay back the money that they are lent. Various datasets and data science models such as logistic regressions, decision trees, and random forests can be used to help the company decide whether the loan should be approved.

Financial Data Professional® (FDP) Charter

With the continued growth in financial data science, standardized professional exams have been created to provide individuals with the applicable skills and knowledge a way to designate themselves as masters of the content. One such designation is the FDP Charter. The FDP institute, FDP exam, and FDP charter are run by Chartered Alternative Investment Analyst Association. For more information about the FDP institute and the exam, check out their website at: https://fdpinstitute.org/

Why pursue Data Science?

What is Data Science?

Looking at the most prestigious universities in USA, here is what they say. According to Stanford Data Science, Data science is the practice of extracting knowledge and insight from data in a reproducible manner, and effectively communicating the results. 

Penn State Data Science says,” Data Sciences is a field that explores the methods, systems, and processes used to extract knowledge from data and turn these insights into discoveries, decisions, and actions.”

 UC Berkeley states,” Data science combines computational and inferential reasoning to draw conclusions based on data about some aspect of the real world.” 

Finally, the John Hopkins Data Science lab emphasizes the importance of data science as follows: “Data science is a fundamental way of thinking in many areas of science, business, and government. We believe all people should be able to develop literacy, fluency, and skill in data science so they can make sense of the data they encounter in their personal and professional lives.” From these statements about data science, one can infer that Data science is a demand of the current time.

Data Science as an Opportunity!

With the advancement in digital technology, the production of data is more than ever before. It has been produced in various formats such as numerical, text, and video are the most common.  The data contains various useful information for the business, government, and policymakers. Since the data itself is not information, the relevant information can be extracted by applying the proper preprocessing and analysis. As the data is proliferating and is readily available, the stakeholders can extract useful information from those data and make an informed decision. So here comes the role of the data scientist. In this competitive market, data scientists extract the knowledge and insight from the data and effectively communicate the outcome to the policymakers. Then the policymakers make a policy or informed decision assisted by the data science.  No matter where you are, at high school or just starting the university, or in graduate school, do not miss to build your data science skills.

Data Science as an Investment!

Every student spends a good amount of money to earn a college degree or a graduate degree.  Since every rational thinker expects to have a great return from their investment, investing in data science skills/degrees is among the best investment alternatives available at the moment.  Here is the justification. There are several advertisements of job openings every time on LinkedIn, Glassdoor, Y Combinator, Analytics Jobs, Big Data Jobs. In addition, the individual company’s websites, for example, Amazon, Google, Apple, Salesforce, Oracle, and Microsoft, to name a few, have openings for data science-related jobs. Furthermore, the 2021 U.S. News Best Jobs Rankings ranked the job of Data Scientist at the top second position among the technology jobs, and the statistician at the second-best business jobs. Also, Glassdoor lists Data Scientist as a top 2nd in the list of the top 50 jobs in the USA for 2021 and mentions the median starting salary of $113,736. The data also indicates significant growth in the relevant jobs and is expected to have continual growth in the near future. Therefore, walking out from the university with a data science degree can help have a high-paying job with top-level job satisfaction.

Data Science is Fun!

Think about what we love most. Is it Sports? Marketing? Technology? Public Health? Biology? Chemistry? Agriculture? Finance? No matter the field someone love, they can find the job that fits their area of interest with the data science skills. It is basically learning the most enjoyable stuff and adding the data science skills to learn even more and do more in that field. Finally, isn’t it awesome getting paid a comfortable salary while doing the stuff we love the most?  In conclusion, a data scientist can practice their favorite field with their data science skills and domain expertise.

Data Science can transform the quality of life!

Data science can be transformative to first-generation and marginalized groups. The tremendous opportunity and wide range of job possibilities with the data science degree can land a high-paying job. This, in fact, leads to financial freedom. According to the  U.S. Bureau of Labor Statistics, the demand for data science skills will drive a 33.8 percent rise in employment in the field through 2026. In addition, the job satisfaction score for the data scientists is 4.1 out of 5, according to the Glassdoor 2021 research. Furthermore, the job setting for data science is flexible, so taking care of family and work during a comfortable time is also a plus. Overall, it can transform the quality of life.

Split Your Data

You have some data and are ready to build a model.  Hooray!  Before you build a model, you use scikit-learn’s train_test_split.  That’s a good start, but I’ve seen data scientists make lots of mistakes with splitting data.  Here’s how this can easily go wrong, so you can avoid these pitfalls.

Set Your Seed

You split your data, build your model, review your results, and everything looks great.  You run your code again and everything changes!  Set your seed for train_test_split (and while you are at it, any other methods you use where a seed can be set) so that your work is reproducible.  This allows you and anyone else to run your code and get the same results.

Check for the Same Entity

Another way your data split can go wrong is when you have an entity – a customer, a patient, a subject, etc. – show up in your data multiple times.  For example, you are working on hospital admissions data, and some patients are admitted multiple times during the window of time your data covers.  These aren’t duplicate records, but if your model trains on records for a patient and then sees that same patient in the test data, your model may already know the “answer” for that patient.

The solution to this is to split your data by putting entities either all in training data or all in test data.  For our example with patients, we could randomly split into train and test by patient, rather than by admission record.  This ensures that our test metrics are good, and the model hasn’t “cheated” by already seeing the answer for the entity in training.

Feature Engineer After Splitting

The order in which you do things matters.  Perform your data splitting before you begin feature engineering.  Your model should know nothing about your test data.  If you are going to engineer features using derived data, such as calculating the mean, and apply this to your data, calculate the derived data after you have split your data and only use your training data.

If you violate this, this is another example of data leakage.  This can result in the metrics on your test data looking better than they really should, because your model has incorporated some information from the test data.  When you deploy this model on data that truly has never been seen before, this often results in a drop in performance.

Note that this also includes data augmentation.  For example, if you are creating more training data for an image classifier by rotating and cropping the images, do this to your training data after the split has occurred, not before.

Time is Important

Using time series data in your modeling can be tricky.  Usually with time series data, your model incorporates some prior values of that data.  For example, you are predicting the number of orders placed today given the number of orders for each of the seven prior days along with some other information.  When splitting this data, you do not want to perform a random split.  If you do, some of the values you want to predict are known features in your training data – orders for October 7th will be predicted from your test data, but that value is a feature in predicting orders for October 8th, which is in your training data.

Instead, split your data based on time.  The most recent data should be your test set.  This mimics what will happen when you start using your model in the real world – your model doesn’t know anything about future values, and instead only knows about what has occurred up through training.  This will avoid having your metrics on the test data look great, but your model performs poorly when used, because it memorized the values from the training data.

Summary

I hope this guidance helps you to avoid some of the common errors made with splitting your data.  Happy modeling!

Transformational Data Science

One data scientist I follow on LinkedIn is Eric Weber, Head of Data Product for Yelp. In a recent post, he wrote, ‘An interesting problem may have no business value, while a boring problem may be transformational for your company.’ This has been my experience and it really stuck with me, although I might substitute the word ‘solution’ for ‘problem.’ I have often labeled complex modeling as ‘interesting,’ and basic exploratory data analysis (EDA) or basic descriptive statistics as ‘boring.’ As a more experienced data scientist now, I’ve changed my view substantially. 

I think it’s easy to assume that a complex solution has more value. Because it takes more time to implement, right? So it must be better? That often isn’t the case. In business, it’s not just the solution itself, but other ecosystem factors come into play, such as time to market, existing solutions in the market from competitors, explainability requirements of solutions, and others. As a trainee, it’s important to shed assumptive labels for different problem solving approaches. More complex doesn’t always mean better. More complex doesn’t always equate to interesting, either.

For data scientists entering the job market, I would encourage you to think about the ‘why’ behind your solutions for different projects. Often, it might be because that was the appropriate model for the type of data you had available. But if you had it to do over again, would you implement a different solution? What is the incremental value gained by a more complex model over a more simplistic solution that might take less time? Talking through this reasoning is one of the hallmarks of an experienced data scientist.  

Several years ago, I was watching Food Network, and a baker mentioned that a very elaborate cake was very easy to make because you could hide your errors very easily. Basic cakes with little decoration were much more difficult. I feel the same way about data science—doing the basics is very hard because you must be disciplined. You must do basics somewhat flawlessly. And that is indeed transformational. 

Data Scientist
GitHub
LinkedIn

Artificial Intelligence Ability to Autonomously Adapt to Changing Demands

One key aspect of robust artificial intelligence is the ability to autonomously adapt to changing task demands or even learning completely new tasks as the need arises. Many data science and machine learning approaches anticipate little-to-no change in task demands over time. In the most typical cases, distinct statistical or machine learning models are assigned to learn entirely separate tasks. Such an approach requires manual intervention when task demands shift over time or when new tasks arise.

Past work in psychology and neuroscience has developed theories on how humans and animals overcome such limitations: one key component is contextualized learning supported by working memory. Computational simulations and neuroimaging studies of psychological working memory function helped establish the neural basis of working memory in the brain as subsisting in the interactions between two brain regions: the prefrontal cortex and the mesolimbic dopamine system.

However, much of the complexity associated with the biological details in such models have been recently been removed, exposing the core computational mechanisms of task-switching behavior. These core mechanisms can be integrated into deep-learning models: a powerful learning framework which has demonstrated tremendous success in a wide variety of domain tasks in recent years (https://www.pnas.org/content/117/48/30033).

Work by two students in the Department of Computer Science, David Ludwig and Lucas Remedios under the direction of Data Science Institute affiliate Dr. Joshua L. Phillips, has helped bridge the gap between these fields by integrating autonomous task-switching mechanisms inspired by human working memory into one of the most popular deep-learning frameworks, TensorFlow/Keras (https://www.tensorflow.org/).

Lucas was an undergraduate who helped to develop the initial framework under funding from the MTSU Undergraduate Research and Creative Activity program, and later David, a graduate student working on a master’s thesis, completed the framework and tested it against a range of different tasks with differing types of data. The framework proposes two complementary mechanisms which allow the deep-learning models to adapt to different tasks over time: a context layer which can swap out task-specific context representations (analogous to the prefrontal cortex in the brain) and a method for computing context loss which can be used to decide when to switch or update context representations (analogous to the mesolimbic dopamine system).

A manuscript describing their work was peer-reviewed and recently accepted at the 33rd IEEE International Conference on Tools with Artificial Intelligence (https://ictai.computer.org/), and code for all models and experiments is freely available online (https://github.com/DLii-Research/context-learning).

(This paper was among 550 papers submitted at the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence and one of 110 accepted as full papers.  It won the Bourbakis-Ramamoorthy Best Paper Award at the conference on Tuesday,  November 2, 2021.) 

Ludwig, D., Remedios, L., and Phillips, J. L. (in press). A neurobiologically-inspired deep learning framework for autonomous context learning. In Proceedings of the IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI 2021)

The Interplay between Data Sciences and Actuarial Sciences

I am Vajira Manathunga, an actuarial science faculty at MTSU. My research area includes predictive analytics, computational statistics, and mathematical modeling. The purpose of this essay is to discuss the intertwined nature of data science and Actuarial sciences. With the emergence of data science and big data in the last decade, actuarial science saw changes in the industry as never before. Society of Actuaries (SOA) and Casualty Actuarial Society (CAS) understood the challenges posed by data sciences and they formally introduced data science into actuarial science with new exams.

What is an actuary? Institute and Faculty of Actuaries in UK defined “Actuaries are problem solvers and strategic thinkers, who use their mathematical skills to help measure the probability and risk of future events. They use these skills to predict the financial impact of these events on a business and their clients”. Therefore, actuaries are interdisciplinary in nature and should be able to draw skills from various fields such as mathematics, statistics, finance, economics, businesses, communication, and computer sciences.

Why should actuaries need data science in their toolset? In today’s world, the volume of data is booming every second as customer information is pouring from digital platforms, smartphones, and smart sensors to company databases. These data contain more personalized information that can be used to provide customized products and services. Therefore, data science skills are becoming a valuable tool for any actuary. Data sciences unify statistics and data analysis techniques, which allow actuaries to work with big data sets and develop predictive models with millions of input variables. For instance, in auto insurance, machine learning techniques are applied to accelerometers and gyroscopes to identify driver risk. In health insurance, neural networks methods are used to identify cancer or heart disease. Similarly, in life insurance, predictive models are used to provide a real-time quote for underwriting variables.

Traditionally actuaries focused on statistical analysis and databases skills such as SQL. However, as data become unstructured and changed from numbers to text to images and to videos, other data science skills are needed. Unstructured data may contain new information such as social media posts, text documents, and various other forms of data. This data may give insurance companies a unique perspective about their customer base and the broader market if analyzed correctly.

So, what would be the future look like for actuaries? Some predict, “The next insurance leaders will use bots, not brokers, and AI, not actuaries”. According to this view, actuaries will no longer be relevant in the insurance field. However, contrary to this belief, some think actuaries are uniquely positioned to adapt and evolve to use data science techniques in mortality modeling, automobile insurance, healthcare insurance, catastrophe risk analysis, claim reserving, valuation, life, and non-life insurance pricing.

No matter what, it is clear that actuaries must have data science skills to compete in the future world. The actuarial science program at MTSU received the 2020 CAS University award for integrating data sciences into the actuarial science curriculum, providing hands-on training to students using data-driven projects and activities in actuarial science and research contribution to actuarial sciences. Our program explicitly offers ACSI 6110, ACSI 5530/4530, and ACSI 4600, which allow students to learn the interdependency nature of actuarial science and data science.

The LAST Last-Mile of Data

I was wrong about the “last mile of data.”

Over a decade ago, this was a term I started using to express the challenges of the data world. In my consulting practice, I had seen how many organizations struggle to bridge the gap between their massive data investments and the minds and actions of decision-makers. Here’s what I said at the time:

This critical bridge between data warehouses and the communication of insights to decision-makers is often weak or missing. Your investments and meticulous efforts to create a central infrastructure can become worthless without effective delivery to end-users. “But how about my reporting interface?” you wonder. That’s a creaky and narrow bridge to rely on for the last mile of business intelligence.

When I talked about “the last mile,” I emphasized the need to better visualize data and communicate insights.

But I was missing something. You can deliver data in ways that are intuitive, friendly, simple, useful…but they still need to be sold.

Sold?! It is an ugly word for data people. It reeks of manipulation and bias.

But if want people to use your data, you need to change behaviors and assumptions. You need to convince your audience that it is worth their attention.

For example, we’ve been working with a global manufacturer committed to becoming more data fluent and data-driven across their worldwide operations. They have invested in data warehouse efforts and designed thoughtful new dashboards. Fortunately, our client realized that “built it and they will come” is a fantasy. Instead, we’ve helped them with a comprehensive plan to ensure their data has an impact. Here are a few of the steps:

● We trained a cohort of evangelists in data storytelling to improve the quality of the data products;
● We developed an internal communications campaign to go alongside their data product rollouts, explaining the value and purpose of each solution;
● We created a support structure and tutorials to ensure that data product users fully understand each data product;
● We gathered feedback and updated their data products based on user needs.

More than anything, the data leadership team recognized that technology and design are not the complete answer. They also need to change the culture and attitude of the organization.

This is the true LAST Last-Mile of Data, the changing of minds to ensure your data gets used.

Zach Gemignani is the CEO of Juice Analytics and a collaborator with the Data Science Institute

MTSU’s Data Science Institute to assist in opioid abuse study

Cynthia Chafin, CHHS associate director of communty programs, is serving as lead principal investigator on the grant with community and public health assistant professor Kahler Stone as co-principal investigator, with biology professor Ryan Otter and the Data Science Institute providing data expertise and support. MTSU graduate student Chipper Smith, a Wilson County resident and public health student, will be assisting with grant activities as he did with the earlier planning grant as a project assistant.

For more details on this study go to MTSUNews