Page 3 – Data Science

Overview

One of the difficult issues we are facing this fall in higher education is to determine what policies we use for AI in our classrooms. The use of generative-AI can effectively break some of the ways we might evaluate our students. If, for example, students are being graded on creative writing, the use of generative-AI by some students could give them an unfair advantage over other students.

To understand how this issue is being addressed elsewhere, I looked at the recently updated publication policies for several scientific journals and societies. This included Science, Nature, Cambridge University Press, Elsevier, IEEE, the American Chemical Society, the American Astronomical Society, the Association of Computing Machines, and the American Physical Society.

The guidelines they used were instructive. They ranged from being very restrictive to being open to usage if the contribution was acknowledged in the paper. No publisher allowed AI to be a co-author, but all of them required the human author to be responsible for the content being presented.

Based on these guidelines and with the help of ChatGPT 4.0, I’ve put together a few template policies that might be suitable for higher education classrooms. These guidelines don’t address the ways we might change our courses. However, they make clear to the students what is considered ethical within a given course. The policies range from prohibiting AI like the Science publication standard to being open to its usage within some guidelines. There can’t be a single standard, but hopefully this will help you think about how you want to use this in your classes.

You are free to adapt and adopt these as needed for your classes.

I’ve also made a link of how generative-AI might be used for students to help generate study materials for their classes.

Using AI to Create Study Materials

Using the transcript of a lecture recorded by Panopto last year, I created useful material for students including:

A lecture outline
A three-paragraph lecture summary
Sample lecture questions using multiple choice and essay formats
A vocabulary list
Sample data tables
A list of images and figures to study
A timeline of events discussed in the lecture
A list of common misconceptions

This material could be generated by either faculty or students using a generative AI with sufficient memory. (I used Claude 2 so I could load the entire transcript into the system.) Please note – you need to review the material before you use or distribute it. Some of the questions had multiple right answers, and the explanations were occasionally misguided.

Template Policies

Policy 1 – Use of AI is Prohibited

The use of AI-generated context including text, images, code, figures, and any other material is strictly prohibited for any material submitted in this class. This includes using this content for homework, papers, codes, or other creative works. This restriction encompasses the creation or revision of work by AI. Violation of this policy will be considered academic misconduct and will be dealt with accordingly. The use of basic word processing AI systems including grammar and spelling checkers need not be disclosed in this class.

Policy 2 – Use of AI is Permitted with Explicit Disclosure

The use of AI-generated content including text, images, code, figures, and other materials is allowed in this class unless otherwise noted in the specific assignment. However, any use of this content must be explicitly disclosed in all academic work. You may use AI generated tools to aid content generation and revision is allowed within these guidelines. All work must comply with MTSU’s policy on academic honesty. Students must ensure the originality of their own work. The use of basic word processing AI systems including grammar and spelling checkers need not be disclosed in this class.

Policy 3 – Controlled Use

The controlled use of AI-generated content in this class is permitted provided that it follows MTSU’s policy on academic honesty and the guidelines on research integrity. Generative AI will not be considered an author, but rather a tool that assists students in their work. Students bear the ultimate responsibility for the originality, integrity, and accuracy of the work for this course. All use of Generative-AI must be declared and explained and must not violate the plagiarism policies for campus or this course. Use of basic word processing AI systems including grammar and spelling checkers need not be disclosed.

Policy 4 – Go for it!

Since we recognize the potential for enhancing the educational process, the use of AI-generated content in this class is welcome. However, the use of AI tools must be acknowledged just like the use of any other software package. (Note: because of their widespread usage, acknowledging AI systems for grammar and spelling checks need not be acknowledged.) Because generative-AI can copy work without using citations, students are still responsible for ensuring the originality, integrity, and accuracy of their work. Violation of academic honesty standards including plagiarism is prohibited under the MTSU academic honesty policy.

AI Authorship on Scientific Papers – August 3, 2023, A Snapshot

This is a compilation of the guidelines being given to authors regarding the use of AI written text. The policies vary from simple disclosure in a cover letter to a complete ban on the text in Science journals. This document is not meant to be complete. It quotes elements of the new AI policies I was able to find on-line. These policies may change.

From Science:

Artificial intelligence (AI). Text generated from AI, machine learning, or similar algorithmic tools cannot be used in papers published in Science journals, nor can the accompanying figures, images, or graphics be the products of such tools, without explicit permission from the editors. In addition, an AI program cannot be an author of a Science journal paper. A violation of this policy constitutes scientific misconduct.

https://www.science.org/content/page/science-journals-editorial-policies?adobe_mc=MCMID%3D79730734082570706754102817179663373464%7CMCORGID%3D242B6472541199F70A4C98A6%2540AdobeOrg%7CTS%3D1675352420#image-and-text-integrity

From Elsevier:

Authorship implies responsibilities and tasks that can only be attributed to and performed by humans. Each (co-) author is accountable for ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved and authorship requires the ability to approve the final version of the work and agree to its submission. Authors are also responsible for ensuring that the work is original, that the stated authors qualify for authorship, and the work does not infringe third party rights.

Elsevier will monitor developments around generative AI and AI-assisted technologies and will adjust or refine this policy should it be appropriate. More information about our authorship policy can be viewed here: https://www.elsevier.com/about/policies/publishing-ethics.

https://www.elsevier.com/about/policies/publishing-ethics/the-use-of-ai-and-ai-assisted-writing-technologies-in-scientific-writing

From the Cambridge University Press:

AI Contributions to Research Content

AI use must be declared and clearly explained in publications such as research papers, just as we expect scholars to do with other software, tools, and methodologies.

AI does not meet the Cambridge requirements for authorship, given the need for accountability. AI and LLM tools may not be listed as an author on any scholarly work published by Cambridge.

Authors are accountable for the accuracy, integrity, and originality of their research papers, including for any use of AI.

Any use of AI must not breach Cambridge’s plagiarism policy. Scholarly works must be the author’s own, and not present others’ ideas, data, words or other material without adequate citation and transparent referencing.

Please note, individual journals may have more specific requirements or guidelines for upholding this policy.

https://www.cambridge.org/core/services/authors/publishing-ethics/research-publishing-ethics-guidelines-for-journals/authorship-and-contributorship#ai-contributions-to-research-content

From IEEE:

Guidelines for Artificial Intelligence (AI)-Generated Text

The use of artificial intelligence (AI)–generated text in an article shall be disclosed in the acknowledgements section of any paper submitted to an IEEE Conference or Periodical. The sections of the paper that use AI-generated text shall have a citation to the AI system used to generate the text.

https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/submission-and-peer-review-policies/

From ACM:

Generative AI tools and technologies, such as ChatGPT, may not be listed as authors of an ACM published Work. The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work. For example, the authors could include the following statement in the Acknowledgements section of the Work: ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations, etc.). If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work.

Basic word processing systems that recommend and insert replacement text, perform spelling or grammar checks and corrections, or systems that do language translations are to be considered exceptions to this disclosure requirement and are generally permitted and need not be disclosed in the Work. As the line between Generative AI tools and basic word processing systems like MS-Word or Grammarly becomes blurred, this Policy will be updated.

https://www.acm.org/publications/policies/new-acm-policy-on-authorship

From the American Chemical Society:

Science publishing is not an exception to the trend of growing use of artificial intelligence and large language models like ChatGPT. The use of AI tools is not a negative thing per se, but like all aspects of publishing research, transparency and accountability regarding their use are critical for maintaining the integrity of the scholarly record. It is impossible to predict how AI will develop in the coming years, but there is still value in establishing some basic principles for its use in preprints.

After consultation with ChemRxiv’s Scientific Advisory Board, ChemRxiv has made the two following adjustments to its selection criteria to cover the use of AI by our authors:

AI tools cannot be listed as an author, as they do not possess the ability to fundamentally review the final draft, give approval for its submission, or take accountability for its content. All co-authors of the text, however, will be accountable for the final content and should carefully check for any errors introduced through the use of an AI tool.

The use of AI tools, including the name of the tool and how it was used, should be divulged in the text of the preprint. This note could be in the Materials and Methods, a statement at the end of the manuscript, or another location that works best for the format of the preprint.

Some authors have already used AI language tools to help polish or draft the text of their work, and others have studied their effectiveness in handling chemistry concepts. See some recent preprints related to ChatGPT here.

ChemRxiv authors are welcome to use such tools ethically and responsibly in accordance with our policy. If you have any questions about the use of AI tools in preparing your preprint, please view our Policies page and the author FAQs or contact our team at curator@chemrxiv.org.

https://axial.acs.org/publishing/new-chemrxiv-policy-on-the-use-of-ai-tools

From the American Astronomical Society:

With this in mind, we offer two editorial guidelines for the use of chatbots in preparing manuscripts for submission to one of the journals of the AAS. First, these programs are not, in any sense, authors of the manuscript. They cannot explain their reasoning or be held accountable for the contents of the manuscript. They are a tool. Responsibility for the accuracy (or otherwise) of the submission remains with the (human) author or authors. Second, since their use can affect the contents of a manuscript more profoundly than, for example, the use of Microsoft Word or even the more sophisticated Grammarly, we expect authors to acknowledge their use and cite them as they would any other significant piece of software. Citing commercial software in the same style as scholarly citations may present difficulties. We urge authors to use whatever sources are most useful to readers, i.e., as detailed a description of the software as possible and/or a link to the software itself. Although these programs will surely evolve substantially in the near future, we think these guidelines should cover their use for years to come.

https://aas.org/posts/news/2023/03/use-chatbots-writing-scientific-manuscripts

From the American Physical Society Physical Review Journals:

Appropriate Use of AI-Based Writing Tools

Large Language Models, such as ChatGPT, are rapidly evolving, and the Physical Review Journals continue to observe their uses in creating and modifying text.

Authors and Referees may use ChatGPT and similar AI-based writing tools exclusively to polish, condense, or otherwise lightly edit their writing. As always, authors must take full responsibility for the contents of their manuscripts; similarly, referees must take full responsibility for the contents of their reports.

An AI-based writing tool does not meet the criteria for authorship because it is neither accountable nor can it take responsibility for a research paper’s contents. A writing tool should, therefore, not be listed as an author but could be listed in the Acknowledgments.

Authors should disclose the use of AI tools to editors in their Cover Letter and (if desired) within the paper itself. Referees should disclose the use of AI tools to editors when submitting a report. These disclosures will help editors understand how researchers use the tools in preparing manuscripts or other aspects of the peer review process.

To protect the confidentiality of peer-reviewed materials, referees should not upload the contents of submitted manuscripts into external AI-assistance tools.

https://journals.aps.org/authors/ai-based-writing-tools

Dr. John Wallin is the Director of the Computational and Data Science Ph.D. Program at Middle Tennessee State University.

Older posts

Newer posts

Course Policies for Using AI

August 8, 2023

Overview

You are free to adapt and adopt these as needed for your classes.

I’ve also made a link of how generative-AI might be used for students to help generate study materials for their classes.

Using AI to Create Study Materials

Using the transcript of a lecture recorded by Panopto last year, I created useful material for students including:

A lecture outline
A three-paragraph lecture summary
Sample lecture questions using multiple choice and essay formats
A vocabulary list
Sample data tables
A list of images and figures to study
A timeline of events discussed in the lecture
A list of common misconceptions

Template Policies

Policy 1 – Use of AI is Prohibited

Policy 2 – Use of AI is Permitted with Explicit Disclosure

Policy 3 – Controlled Use

Policy 4 – Go for it!

AI Authorship on Scientific Papers – August 3, 2023, A Snapshot

From Science:

From Elsevier:

https://www.elsevier.com/about/policies/publishing-ethics/the-use-of-ai-and-ai-assisted-writing-technologies-in-scientific-writing

From the Cambridge University Press:

AI Contributions to Research Content

AI use must be declared and clearly explained in publications such as research papers, just as we expect scholars to do with other software, tools, and methodologies.

AI does not meet the Cambridge requirements for authorship, given the need for accountability. AI and LLM tools may not be listed as an author on any scholarly work published by Cambridge.

Authors are accountable for the accuracy, integrity, and originality of their research papers, including for any use of AI.

Please note, individual journals may have more specific requirements or guidelines for upholding this policy.

https://www.cambridge.org/core/services/authors/publishing-ethics/research-publishing-ethics-guidelines-for-journals/authorship-and-contributorship#ai-contributions-to-research-content

From IEEE:

Guidelines for Artificial Intelligence (AI)-Generated Text

https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/submission-and-peer-review-policies/

From ACM:

https://www.acm.org/publications/policies/new-acm-policy-on-authorship

From the American Chemical Society:

After consultation with ChemRxiv’s Scientific Advisory Board, ChemRxiv has made the two following adjustments to its selection criteria to cover the use of AI by our authors:

https://axial.acs.org/publishing/new-chemrxiv-policy-on-the-use-of-ai-tools

From the American Astronomical Society:

https://aas.org/posts/news/2023/03/use-chatbots-writing-scientific-manuscripts

From the American Physical Society Physical Review Journals:

Appropriate Use of AI-Based Writing Tools

Large Language Models, such as ChatGPT, are rapidly evolving, and the Physical Review Journals continue to observe their uses in creating and modifying text.

To protect the confidentiality of peer-reviewed materials, referees should not upload the contents of submitted manuscripts into external AI-assistance tools.

https://journals.aps.org/authors/ai-based-writing-tools

Dr. John Wallin is the Director of the Computational and Data Science Ph.D. Program at Middle Tennessee State University.

AI-assisted Coding: Exploring GitHub Copilot, GitHub Copilot Labs, and the OpenAI API

April 25, 2023

Since our introduction to GPT in January, there has been an influx of new AI-based coding tools in the market. These tools have been developed not only by large corporations but also by individual users who are leveraging AI engines in their applications. In this talk, we will discuss recent advancements in codes and APIs that can be utilized in your work. We will provide an introductory demo on accessing the OpenAI API in your applications. Additionally, we will cover other tools available in the market, such as langchain, HuggingFace, exhuman, and pinecone. Join us as we explore the AI-assisted coding landscape and how these tools can transform how we develop software and work.

Click here to watch the video.

CDS Seminar: “Environmental Data Science: Recent projects from the Data Science Institute”

February 6, 2023

This seminar was given by Dr. Ryan Otter, the Director of MTSU’s Data Science Institute and a Professor of Biology

The field of environmental science has lagged behind other disciplines in the adaptation of data engineering and data science tools. In this presentation, multiple projects, either currently being developed or recently completed in the Data Science Institute at MTSU will be described. Project 1: Thresholds of Toxicological Concern (TTCs): a new approach methodology that can predict a conservative threshold value for chemicals with little or no information available. Project 2: Multi-Sensor Data System: a data platform built to ingest varying files into a data lake, process them into a data warehouse, build unique data marts, and serve up the results in minutes. Project 3: Streams of Data: a web platform for client interactions.

Watch the video here.

CDS Seminar “The Broader Implications of AI in Basic and Applied Sciences: Featuring ChatGPT, GitHub Copilot, and DALL-E 2*”

January 25, 2023

A Panel Discussion with Faculty Members from the Data Science Program

Keith Gamble (Economics), Ryan Otter (Biology), Qiang Wu (Mathematics), Josh Phillips (CS), and John Wallin (Physics)

Artificial intelligence (AI) has made significant strides in recent years, and its impact on the basic and applied sciences has been significant. Language models such as ChatGPT, GitHub Copilot, and DALL-E 2 have shown exceptional performance in natural language processing and are being used to assist scientists and researchers in a variety of fields. However, as AI becomes more integrated into scientific research and data analysis, it is important to consider the broader implications of its use. This panel discussion will bring together experts from a variety of fields to explore the potential benefits and challenges of AI in the basic and applied sciences. Topics of discussion will include the ways in which AI is being used to assist scientists and researchers, the impact of AI on the scientific process, the ethical and societal implications of AI in the sciences, and how academic programs may need to be redesigned to adapt to the integration of AI in the field. Join us for a thought-provoking conversation about the future of AI in the basic and applied sciences, and specifically about the impact of ChatGPT, GitHub Copilot, and DALL-E 2 on the field.

Watch the video here.

A quick note on predictive modeling process

May 10, 2022

In predictive modeling, we mainly focus on regression and classification problems. The regression problem predicts the value of a continuous dependent variable using one or several independent variables, known as predictors. On the other hand, the classification problem focuses on predicting a categorical target variable using several continuous or categorical predictors. Several statistical and machine learning methods are available in the literature to work on these problems. Some methods assume a specific functional relationship between the predictors and target variable before modeling, known as parametric methods. In contrast, other methods do not assume the nature of a functional relationship between the predictors and target variable beforehand and let the data learn the relationship between them, known as nonparametric methods.

When implementing any class of methods to the data, one is usually focused on building a model that achieves the optimal performance scores on the unseen data. For instance, maximizing the coefficient of determination or minimizing the root mean squared error on the test data for the regression modeling. Similarly, for classification modeling, one generally maximizes the accuracy, sensitivity, or specificity scores on the test data. For unbalanced data, what metrics you seek to optimize depends on the overall goal to achieve. In general, the goal is to make sure that the developed model generalizes well on the unseen data. In fact, training these machine learning models narrow down to the problem of finding the best value of the parameters. Hence, there is an optimization procedure in the model building process, no matter what techniques you use.

For instance, you may want to minimize the sum of squared error if you fit a least-squares regression. Likewise, you minimize the misclassification error, Gini impurity, or entropy if you perform classification analysis using tree-based models. While working with these problems, one is always interested in achieving global optimality. Roughly speaking, the global optimum is the optimal value the function can take among all possible values in the domain, whereas the local optimum is the optimal value the function can take in the neighborhood. The global optimum is guaranteed when we work with a convex function, with constraints being a convex set. By a convex region, we mean a region that contains the line segment joining any two points in that set. This class of optimization problems is known as convex optimization problems. However, there are other classes of optimization problems where the global optimum cannot be guaranteed. Therefore, we need to pay attention to possible nonconvex problems while working with predictive models.

In addition, iterative methods have been implemented in the library to implement these machine learning methods. As a result, many iterations are needed to minimize the loss functions to obtain the best parameter values. Since there may be choices for solvers available, having a basic knowledge of solvers and their limitations can be helpful. In addition, some methods converge faster than others; for example, the first-order methods tend to be slower than the second-order methods. Changing the solver or increasing the number of iterations can help fix those convergence issues. Finally, although computation time may not be an issue while working with small data, the time difference can be significant while performing grid search cross-validation with a larger data set.

Predictive modeling is a vast area; much care needs to be taken to have an effective ready-to-go model. We may talk more about a specific case in another post.

Getting a job after graduation

April 27, 2022

Getting a job after graduation. This seminar was on February 8, 2022, via Zoom. Click here to watch the program.

Computational Papyrology: Using Data Science to Understand Ancient ManuscriptsÂ

April 27, 2022

In the 1920s, archeologists from Oxford University discovered over 600 crates of documents buried outside the Egyptian city of Oxyrhynchus. Since they were found, less than 10% of these manuscripts have been analyzed and published. Using a combination of crowdsourcing and data science, they are developing the software infrastructure to speed up this process. Thus far, they have created algorithms to find the consensus of human transcriptions, developed a training/testing sample of characters, applied neural networks to automatically identify characters, and used sequence alignment to determine if the text belongs to the known literature of the ancient Greeks. The most difficult questions faced, involve the cases where the crowdsourced identifications disagree with the neural network’s classification. In this talk, Dr. John Wallin will review the project’s status and talk about the role of ambiguity in our data set.

This seminar was held on February 1.2022 via Zoom. Click here to watch the program.

Internship experience becomes an opportunity after graduation for one student

April 6, 2022

(Below is the transcript of a recent podcast between Dr. Keith Gamble and student Robert Mepham regarding real-world experiences in Data Science or you can listen here

(Professor Gamble) Before we talk about your summer internship in data science at Homesite Insurance in Boston, tell us more about yourself.

(Robert) Hi, my name’s Robert Mepham, currently a junior at MTSU. I expect to be graduating in December of 2022.

So I only have one more semester after this current one, double majoring in business administration and data science.

(Professor Gamble) Excellent. Why did you choose MTSU?

(Robert) I’m actually from Chicago, Illinois, and my parents, they were getting ready to retire. They knew, right around when I was graduating high school, that they wanted to leave the state of Illinois. They picked Tennessee, specifically Middle Tennessee, and I wanted to be far enough away from them to where, you know, I’d be living on my own, but still close enough where I could go back on weekends and stuff. They’re over in Cookeville where they got a nice house and everything now. MTSU was roughly an hour and a half away from there. So it’s close enough to them, but also far enough to give me some distance. And yeah, I just fell in love with the school.

When I first came to Murfreesboro, when I think I was like 10 or 11 years old, the school was totally different back then. And then, eight years later, when I came for a campus tour, I thought I was just sort of doing my due diligence. I came here, and I was totally blown away with how much it had changed from what I remembered back then. There’s been a ton of new building.

(Professor Gamble) Right. The data science major. It’s still new, so what did you major in when you started at MTSU?

(Robert) When I came to MTSU, I was originally an information systems major that I chose that because I’ve always been into computers, into technology. I’ve always been into data. But at the time, they didn’t have data science as a major. And so information systems, and it is sort of where I felt I could apply my skills the best going towards a potential career.

(Professor Gamble) Excellent. What brought you to the data science major?

(Robert) Oh, honestly, I don’t know if there was one thing that particularly brought me to the major. I knew ever since back in high school, I’ve always been pretty good at math, and I started getting into sports analytics. And when I learned that, of course, people do that for a living, I asked some questions of the people I knew. I shot off a bunch of emails to people who I knew ran, you know, sports websites and stuff. So like, Hey, how did you get into this? And there were like 50 different answers, and no one had the right thing. And I asked them if you could go back and do it again, what would you major in in college that would prepare you for what you do now? 99 percent of them said data science, even though they had gone to college back in like the 80s and data science wasn’t the thing back then, or it wasn’t called what it is now.

But if they could do it over again, they would choose data science. I knew that’s where I wanted to be with somewhere in sports analytics, and therefore I think it kind of just jumped to me and I’ve loved it ever since I’ve started.

(Professor Gamble) Awesome. So let’s talk more now about your internships. That was homesite insurance. What is does that company do?

(Robert) It’s actually a very large insurance company out of Boston. So I like to say that the dirty little secret of the insurance industry is that a lot of the insurance companies you see advertising on TV, they’re not so much insurance companies. They’re more just marketing firms.

And so without naming specific ones, I mean, you know, the type of insurance commercials you see on TV,they will give their website or their phone numbers and their agents in their commercials. They’ll take the policy, but they don’t actually underwrite and service the policy. They pass it off for a commission fee to multiple companies, one of them being homesite insurance. And so, yeah, homesite really just underwrites. I think they have over three billion in written premium as of this year. So that’s, you know, it’s a fairly decent sized insurance company,and they do underwriting work for at least 15 or 16 different advertising type insurance companies.

(Professor Gamble) What was your role?

(Robert) So I worked in commercial operation strategy and analytics department, so that’s for the commercial division of the company. So you’ve got homeowner’s insurance and everything. This was for commercial insurance. So businesses getting insured, whether it be for workers comp like inventory, property insurance, commercial auto insurance, that division of the company and then strategy and analytics. So it’s actually it was a new department at the company.

When I started, my manager, Alex actually had just gotten the promotion and was told to start this new department because they know data science is an ever-expanding field, and they knew that the way that their company was operating. It was staged to grow.

It wasn’t really going to be an inefficient way. They knew that they were going to be wasting a lot of man hours doing certain processes and excel, you know, by hand and on paper that they could do, given some software engineers and some data scientists.

And so our goal was pretty much every week we would go and consult different groups around the company, different departments and different teams. And we would say, what are you guys wasting the most time on on a weekly basis? And then we would monitor them for a couple of weeks, see how they did it.

Usually it consisted of, well, we’ll write a SQL query to pull this data from our database. We’ll go ahead and put that in excel, and we’ll do this list of 100 things to it. And then we send out this file to all, to a vendor, to our manager or whatever. And of course, as you know, in data science, if you have a strict enough set of rules and stuff and what your end product is, you can go ahead and get that down to a 30 second script. And so that was basically my job. The entire past summer and fall working for Homesite was developing those sort of solutions for the company.

(Professor Gamble) You mentioned SQL as a tool. What other tools did you learn in your coursework that you applied?

(Robert) So as far as programming languages, one SQL was a big one, especially because that’s their data management language. But Python and R were also two other ones that I used on pretty much a daily basis, and it was basically down to whatever the manager wanted of the specific team that we were consulting. Whatever the group lead wanted, if they wanted a solution written in, or we would have to figure out how to write it in or if they wanted it written in Python and like automated through IWC, we would have to write it in Python and work on getting like an IWC repository hosted for it.

So yeah, just learning, learning how to code and not so much learning the specifics of like how to write this sort of algorithm or this sort of structure, but also just understanding. How to translate a set of like work requirements or project requirements. And no, here’s generally what I’m going to have to do as far as coding wise and then go on Google it, you know, go out and say, you know, doing such and such in Python and then be able to scroll through Stack Overflow or something to find something close to what you need. Go and manipulate it a little bit for your specific deployment.

(Professor Gamble) What at the end of your internship were you most interested in learning more about? What did you feel like you needed more education in to help you in the next role?

(Robert) I guess what I wanted to learn at the end of my internship, like you said, would be deploying models.I know in classes and on the job you learn how to, you know, manipulate, like, extract, transform, load the data, train any number of models, you know, predicting whether it’s classifying or regression to predict some sort of target variable. We learn all that. And then it’s, OK, that’s great to do it for the training dataset or the one question you had. But in the business context, usually it’s a situation where they want that report run every single week or every month or every quarter, or they want to be able to as part of the insurance thing, the quoting system. So when someone goes in to get an insurance quote, they have to be able to take in all this data on them and they have to be able to provide them with a quote, like a monthly premium quote. And that’s based on a whole whole bunch of different risk factors and risk tables and stuff and a bunch of other variables that we’re able to look up on them based on whatever data they give us. And so being able to develop a model that takes in continuous data from streams like that and to be able to give continuous output, that was sort of where I felt I didn’t feel unprepared. I felt curious if, well, I wanted to learn more because I knew how to do all the training of the model in the, you know, the reports on. Here’s how effective my model is. Here’s, you know, the AUC and the ROIC curve for my model, but now it’s OK. That’s great. Now let’s get that working in the business context on an ongoing basis.

(Professor Gamble) What is next for you this summer and beyond?

(Robert) So I’m actually going back, I think, as of right now. So it’s mid-March, I’m fairly certain I’m going back to homesite. They’ve asked me to come back. I still have a couple other job offers on the table for this summer, but it looks like the best one for me in terms of where I want to be for the summer and what I want to do is going to be going back to homesite, continuing with the same team I was working with last summer. And so that’s for this summer and then beyond.

Like I said, I really want to work in sports analytics, you know, data science as I’m sure a lot of people who have taken any sort of data science class here at MTSU knows it can be applied in literally any sort of business, whether it’s retail, insurance, academia or whatever it is, there’s data science that can be applied anywhere. And so it kind of just depends on what you are passionate about as an individual.

And for me, that’s sports, right. I can run insurance numbers all day, but I’m never going to be super passionate about, you know, look at how much written premium we’re going to have in August of 2024. It doesn’t get me going the way that, you know, certain sports topics do. And so I’m not really sure how I get my foot in the door in that industry, whether it would be to go get a master’s in data science or sports management or something like that, or to hopefully maybe land a gig after graduation with some sort of professional sports team.

(Professor Gamble) Boston is the home of the insurance company. Yes, a major hub of sports and professional sports activity. How did you like living in Boston?

(Robert) So I actually did not live in Boston this past summer. It was a remote work. I was able to visit Boston for a week, which was really cool. Their office is right downtown across from the Garden where the Celtics and the Bruins play. So it was it was interesting to be down there for the week. I was put up in a hotel for the week by the company, got that nine to five experience going out on business lunches and everything with my manager and other people. So it was really cool to have that one week experience. And then for the rest of the summer, it was virtual, you know, distance and stuff. So I was working out of my bedroom, which, you know, have a desk in the corner of my bedroom. That’s where I did most of my work out of and it was pretty cool.

(Professor Gamble) Thank you for interviewing with me.

What should every data scientist know when working with ZIP Codes?

January 18, 2022

Datasets often contain ZIP code fields making it tempting for data scientists to organize data and develop models based on ZIP codes. However, ZIP codes present a significant challenge. When considering a ZIP code, many think of a well-bounded area contained perfectly within another geographic space (such as a city, congressional district, or census tract). However, this is often not the case.

To comprehend the complexities, one must first understand what ZIP codes are and how they work. Modern ZIP codes were implemented in 1963 when the United States Postal Service (USPS) adopted the Zoning Improvement Plan (ZIP) to expand the postal zones established in 1943. Like an automotive VIN, the five-digit ZIP code encodes meaningful information. The first digit indicates a region or group of states that make up a zone. The United States is separated into ten such zones. The following two digits represent the sectional center facility, a mail sorting facility located within one of the ten zones. The last two digits represent the post office or delivery area. In 1983, the USPS added four additional digits, commonly referred to as plus four (stylized as +4), to identify specific streets and street directions. Essentially all of this was done to improve mail delivery, not to create convenient boundaries for data analysis. In fact, if plotted correctly, ZIP codes would appear as lines (representing delivery networks) and points (representing large buildings, campuses, and post office boxes).

Data scientists have encountered other challenges related to ZIP codes. For instance, large portions of the United States, particularly in Alaska and Nevada, do not have an assigned ZIP code. The USPS does not assign ZIP codes to remote regions that do not receive mail. Another challenge is that ZIP codes may change, particularly in response to new construction and new delivery routes. To make things even more confusing – if they weren’t already – some ZIP codes are not stationary. For instance, 96620-2820 is the ZIP+4 for the 5,500+ crew (ship’s company and air wing) aboard the nuclear aircraft supercarrier USS Nimitz.

Data scientists should know that ZIP codes do not always fall within state boundaries (or even of the borders of states within a zone). There are over 100 cases where ZIP codes cross state lines. Even if ZIP codes could be plotted with well-defined boundaries, these would not align with other political boundaries such as county or municipal borders. And ZIP codes certainly do not align with United States Census tracks, block groups, or blocks.

To help facilitate the relationship between census data and ZIP codes, the United States Census Bureau created ZIP Code Tabulation Areas (ZCTAs). Therefore, such ZCTAs are often included in census data. However, ZCTAs are far from precise. The most frequent ZIP code (the mode ZIP code of all mailing addresses) within the block is the ZCTA of the block. If there is no identifiable most frequent ZIP code, the ZCTA of the neighboring block with the longest shared border is assigned. However, ZCTAs are not without other limitations. ZCTAs do not include all ZIP codes, particularly those of large buildings, campuses, or post office boxes. Also, keep in mind that ZCTAs do not use ZIP+4 codes, only the five-digit ZIP code.

Data scientists with access to address data are advised to geocode the addresses to geographic coordinates. Or, if the data has geographic coordinates, start there. Next, a vector overlay operation can determine the relevant census tract or block, congressional district, or political jurisdiction. Doing so presents an opportunity for more precise analyses. Unfortunately, if no addresses are associated with the data, ZCTAs may be the best option to crosswalk from ZIP codes to more meaningful boundaries. Also, some municipalities and for-profit groups provide demographic data collected (or aggregated) to ZIP codes.

If you encounter a situation that requires linking data with ZIP codes to data without ZIP codes, proceed cautiously and be aware of your limitations.

What Makes Spatial Data Special Data?

January 11, 2022

Data scientists work with a wide variety of data. Some of that data likely includes street addresses or coordinates (e.g., latitude and longitude). However, most data scientists have not explored spatial data’s true capabilities (and complexities). There are benefits to working with a simplified view of reality known as spatial data models. Let’s consider two primary spatial data models – discrete and continuous – and learn when to use them. Discrete spatial data denotes known locations with a known boundary (such as the political border of the State of Tennessee). In contrast, continuous spatial data is estimated and does not have a known border (such as where the ocean temperature is at 59 degrees).

Discrete data is stored using vectors (e.g., points, lines, or polygons). Point data can represent where a soil test sample originated, the precise location of study trees, or where an animal was tagged. Line data often represent streams, delivery routes, wildlife migration paths, or streets. Polygon data are closed shapes and frequently represent lakes, forests, or cities.

Vector data are commonly stored as ESRI Shapefiles, consisting of a .shp file and several sidecar files (i.e., .dbf, .shx, .prj) that are all kept together in the same folder. While ESRI Shapefiles are an old and somewhat outdated standard, most public data sources provide geospatial data in this format. Therefore, a data scientist is very likely to encounter ESRI Shapefiles. Alternately, vector data may be stored using proprietary ESRI File Geodatabases (.gdb). More recently, vector data are available through an open and non-proprietary file type known as a GeoPackage (.gpkg). Using Python GeoPandas – which uses the Fiona file handler powered by GDAL (the Geospatial Data Abstraction Library) – all of these file types can be read and explored.

While discrete data are intuitive, working with continuous data adds complexity. Continuous or thematic spatial data can include representations of noise pollution, terrain elevations, precipitation, or wind speed. Data-collecting sensors cannot easily be placed on a perfect grid. Therefore, such information is calculated between discrete data points. Continuous data are frequently represented using raster files. Much like a digital image where each pixel represents a color, the ‘pixels’ of raster files contain data representing values such as a water temperature.

Common raster file types include Erdas Imagine files (consisting of an .img file and an .xml sidecar file), open and non-proprietary GeoPackage (.gpkg) geodatabases, or open standard GeoTIFF (.tif) files. Public data sources often share raster data using Erdas Imagine files, while up-to-date satellite-based optical and radar imagery is now more frequently available in GeoTIFF formats. Data scientists can explore these raster file types using the Rasterio Python library, which relies on GDAL.

We hope that this simplified overview encourages data scientists to go beyond the traditional bounds and focus on a new world of possibilities available through the exploration of spatial data. Once familiar with vector and raster data, data scientists can explore indoor mapping spatial file types (e.g., Apple Venue Format or Revit BIM), three-dimensional spatial files (Collada or Trimble Sketchup), or multitemporal spatial file formats (Network Common Data Form or Hierarchical Data Format).

Spatial data science is truly special data science.

Older posts

Newer posts