Weekly Question #6: Complete by October 19, 2017

Leave your response as a comment on this post by the beginning of class on October 19, 2017. Remember, it only needs to be three or four sentences. For these weekly questions, I’m mainly interested in your opinions, not so much particular “facts” from the class!If you sign in using your AccessNet ID and password you won’t have to fill in the name, email and captcha fields when you leave your comment.

Here is the question:

In class we discussed several examples of how data can be stored in different formats in two different sources (i.e., a date can be represented mm/dd/yyyy or dd/mm/yyyy). Give your own example of a scenario, other than the ones we discussed in class, where data could be represented two different ways. How would you reconcile this conflict when combining the data sources?

Related

55 Responses to Weekly Question #6: Complete by October 19, 2017

Time is an example of data that is able to be entered into a database in two different ways. Most of the world uses the 24-hour clock, but places like the U.S. use the 12-hour clock. For example, 15:00 and 3:00 pm both mean the same thing. It’s important to keep the time values consistent to avoid confusion. It would be best to use the universal 24-hour clock, but this must be used throughout the entire database if you decide to implement it.

Time is another data format the can vary among data sources. One data source may use the 24-hour clock while another may represent time on a 12-hour clock. The second data source will show 5:30 PM as 5:30 PM and the first will just show 17:30. To reconcile the differences I would change the 24-hour clock to the 12-hour clock since this format is easier to read and understand.

I’m actually working on an issue with this on a website project me and a couple of friends are working on – there are two methods of displaying a post ID that someone makes on the website. There is a 64-bit “Snowflake” value that is almost guaranteed to be unique and avoid duplicates, due to the fact that it’s 20 digits long, and a 32-bit, base 64 (1234567890 and the alphabet, both capitalized and not capitalized, ! and ?), 6 digit long “Short ID” that is displayed to the user. The Snowflake value is stored in the back end database while the Short ID gets stored in the user’s cache – the reconciling comes from trying to make the Short ID’s compression algorithm as close to lossless as possible, so that the Snowflake value can be restored from the Short ID.

A good example of data that can be represented in multiple different ways is Currency. Different countries have unique ways of formatting currency, and the formats vary drastically. While a thousand dollars in the US format looks like this: “1,000.00” , one thousand of the Swedish Krona looks like this: “1 000,00”. In order to reconcile a conflict where you are trying to join a databases with multiple currency formats, an established data standard would need to be established in order to avoid confusion and problems within the database.

Phone number is another example that can be entered in different format. For example, a database might key in phone number as ‘xxxxxxxxxx’ in a database or as ‘xxx-xxx-xxxx’. This two different ways of keying in data can become a problem if a person wants to combine or use both database. In my opinion, it is best to let the data type to be VARCHAR(12) so that both format can be used universally.

One example of two different formats is the metric system vs. the american measuring system. Something may be measured in inches on one database and centimeters on the other. Deciding between the two may depend on where the company is now based. Changing the measurements might require going through each record and converting the measurement to the other form. The other option would be to leave them as they are, but if you are a company such as Lowes or Home Depot and this database is a part of your website, it may confuse and frustrate customers leaving the burden of converting units to them.

An example of a scenario where data could be represented two different ways would be time. In the United States we use the 12 hour clock, but most other countries use the 24 hour clock. 1:00 PM is the same as 13:00. The 12 hour clock would be difficult to use in a database, because you would have to include if it is AM or PM. For databases you should probably use a 24 hour clock if you are recording times.

An ecample of a scenario where data could be represented two different ways would be months. For example, you could write out September or you could just use the number 9. I think to make it easier for a database, I would reconcile so all months would go by numbers. This would help with spelling errors as well.

Zip codes tend to have two different formats. One of the formats is a five digit number, and the other is a five digit number, a hyphen, and four more digits after that. The most common zip code format that I see is the five digit number. An easy way to reconcile this conflict when combining the different data sources would be to just show the first five digits of the zip code. This would remove the last four digits on the nine digit zip code, but would not necessarily change the zip code itself.

Time can be stored in a few different ways (whether it’s clocking in or out, or how many seconds it took to complete something).
If, for example, it took a machine 4 minutes and 5 seconds to finish a task, the database can store it as 04:05 or 4:05 (it will depend on the INT).
If, for example, employees are clocking in and clocking out, let’s say at 1:30PM, the database can store that time as 01:30, 1:30, or 13:30.

At work, I am doing a lot of cost basis exceptions for advisors. An advisor reaches out to our team and requests updated share amounts for tax purposes. The biggest problem I have come across is the way in which companies display their account numbers. For example Mass Mutual displays their account number as (xx-xxxxxx) and Wells Fargo’s account numbers are (xxxx-xxxx). It’s very tedious placing the dash in the appropriate place for each company. To solve this problem, one would have to design our database to either remove the dashes all together or create a macro to place the dashes into the appropriate places on the excel spreadsheet we use.

Product/materials weights can be labeled differently depending on where in the world the data is being input. Weight would usually be described in pounds in the USA, but might be described in grams or kilograms in the U.K. A way to reconcile this difference in data inputs could be to run one set of data through a program which automatically converts the unit of measurement to the other wholesale so that every entry from the one geographic location is converted into the unit of measurement used in the other location.

An example of something that can be labeled differently from one database to another would be distance. As we all know there is the metric system and the standard system. If you are transferring data from databases in two different continents of the world such as the United States and Germany, you will have one measuring distance in meters, kilometers, etc. and another in feet, inches, and miles. One way to reconcile this is to have a formula that you multiply each value by in order to convert the data from either metric to standard or standard to metric. This would be a tedious process, but easily applied to massive amounts of data at once.

Another example of data could be stored differently in two databases is different email addresses. I know that I have multiple accounts for several websites using my temple email address and my personal email address. For example, I had an account for Hulu under my personal email address and an account for Spotify under my student email address. So when the companies partnered up I was not able to get the free Hulu with student because I already had an account.

I suppose phone numbers could be in different formats. For example a company dealing with American phone numbers could write 215-555-1234 versus 2155551234. If you are an international business, you also have to deal with the country code, and I have seen that written as either +276 or (+276). In the former’s case, one would just have to remove or add hyphens. For the latter’s, one would just have to add or remove the parenthesis.

An example of a piece of data that can be stored two ways is a phone number with an extension. This could be stored in one column with a larger varchar limit or in two columns. When combining this data, these two field could be joined into one, or you could increase the varchar limit of one of the columns before joining the two tables.

A social security number can be represented in two ways. First is the way it is usually presented: xxx-xx-xxxx. Websites for some places (such as banks) that require your social security number will ask that you type it out as above with the dashes, while some places only ask that you enter it is xxxxxxxxx. The simplest way to handle this, I imagine, would be to enter a command that removes all “-” characters from the social security number entries in a database.

One way data could be represented in two different ways would be with weight. In the United States, we use pounds as a measurement of weight, but in other countries weight is measured in kilograms. You could reconcile this conflict by maybe having a feature in the database that will convert the weight to the appropriate metric system based on what country the information is entered for.

Another scenario of how data can be represented different ways in a database is attributes that have a “Yes” or “No” answer to them that could be represented by binary. For example, when joining customers from two different movie theater databases, an attribute can be whether the customer is a premium member. In a table this can be represented by the variable characters datatype or just binary. I would reconcile the conflict by coding all Yes values as 1’s and No values as 0’s.

An example of how data can be stored in two different formats is a phone number. One way phone numbers can be stored is through dashes for example 123-456-7890. But, its perfectly ok to not use the dashes and it will still make the same phone call for example 1234567890. I would store them as xxx-xxx-xxxx because you can then use the area code as a indicator of where the call is being made.

Measurements can be stored in different formats. For example, a measurement can be entered using either the imperial system(inch, ft, miles) or the metric system(km,cm). When entering a measurement into a database in the United States it would make sense to use the imperial system, but if a Company is international it might be difficult to know which measurement system to use.

One example where data could be represented in two different ways is the abbreviation for Thursday. In a monday through friday work schedule, Company A can expressed Thursday as Th, while Company B can expressed Thursday as R. To reconcile this conflict when combining the data sources, we can just simply update the record and set R as Th or Th as R to have the record unified.

An example of data that could be stored in two different formats could be student grades. One data source might represent a student’s grade in a class using the letter grade, while another source might use the specific number grade. For example, while one database would show a student’s grade as 95, the other would show it as an A. I would reconcile this issue so that all grades are shown in number format to know the exact score and remove any confusion.

Another example is temperature, which can be stored in 3 different unit; Fahrenheit, Celsius, and Kelvin. A way to reconcile this is to have a converter to convert the measurement to the specific unit.

Naming customs of different countries would cause confusion in a data source where names are entered in a single field. In the United States, it is customary to list the given name first followed by the family name or surname. In some other countries, it is customary to list the surname or family name as the first name followed by the given name as the last name. In order to reconcile this conflict, a standard should be established and there should be two separate fields, one clearly labeled for the surname (not labeled as “Last Name”) and one for the given name (not labeled as “First Name”).

An example of how data can be stored in two different formats is the representation of exponential terms. One way exponential terms can be stored is using circumflex where the term would look like (10^6) , the other way is using the letter (e) where the exponential terms would look like (e6). One way to reconcile this conflict would be changing the (10^) format to the (e) format.

In class we discussed several examples of how data can be stored in different formats in two different sources (i.e., a date can be represented mm/dd/yyyy or dd/mm/yyyy). Give your own example of a scenario, other than the ones we discussed in class, where data could be represented two different ways. How would you reconcile this conflict when combining the data sources?

Data can be represented in a myriad of different ways. Looking back to the data visualization lessons from before, and Tuesday’s lecture about dimensional representations, a plethora of paramount examples can be given.
For example, one paragon can be that of data representation of a company’s financial health. Companies use different representations of their data, ranging from bar graphs to conceptual map of company performance. Albeit both can be a clear and explicit representation of consolidated data, problems may occur especially when data from different sources are to be consolidated. A solution to this may include that of ubiquitous adaptation of data sources to be prevalent and similar to each other, thus resulting in better data representation.

One common conflict while storing the same kind of data with different formats include percentages. When storing data involving percentages the two formats that many use are (1) using the whole number followed by the percent symbol e,g. (9%) and, (2) using decimals e.g. 0.09. This conflict lead to many confusions for instance, in my high school, teachers would distort the data of student grades by recording the grade in different formats which would lead to a lot of confusion. This can be avoided by setting a standard on the database as to how the percentage should be recorded and everyone should follow the standard.

An example where data can be represented in different ways would be time since there are two different ways of representing the time. Here in the U.S. we use a 12 hour clock, but the military, as well as other countries, use the 24 hour clock. So for example 1:00 pm could also be represented as 13:00 hours. In order to reconcile this when combining the data I would probably go with the 24 hour clock to represent all of the time data because it is easier for everyone to read.

A database of cars could list their items with brand and model, such as Tesla Model S. If this database were to be combined with an existing one that had individual columns for brand and model, I would use a formula to separate the brand from the model.

There might be some discrepancies concerning how the way customer or patient gender is entered into the system. In some systems it might simply be “F” or “M” while some will use the full word “Female” or Male” (or in both cases, there might be the option to input ‘Other’ or ‘Prefer not to answer’ for example)
I think many companies wouldn’t have as many difficulties with this compared to the importance of recording customer phone numbers and addresses, for example, but if a company were responsible for something like health records, it might be something that could cause an issue.

One instance where data can be represented in two different ways is with measurement units. For example, the U.S. typically uses miles as a unit of measurement, whereas many foreign countries use kilometers or some other metric unit. To reconcile this conflict when combining data sources, I would convert all measurements to the metric system since it is the more commonly used system worldwide.

When storing data for clothing stores the data could be stored in different formats. Shirts, for example, can be recorded as “SM” or “Small” and “XL” or “Extra Large”. You could reconcile this conflict by allowing the businesses to choose their preference. In order to make sure there are no shipping errors or errors in the system identifying a product as small may be more concise than “SM”.

Letters(capital and lower case) are an example of data that can be stored in two formats. Unless a field being typed into is case sensitive, a string of letters could be entered and the format would not matter. Although the proper format is required any time the text is meant to be read/published, there are often situations where a code that’s meant to be redeemed is displayed as case sensitive, but when entered online the capitalization may not be required.

One piece of information that may be formatted differently is salary. Some databases could store salary in terms of thousands of dollars or in just plain dollars. Commas may also be used in some databases. The best way to fix this issue would be to remove commas and have salary listed in pure dollars.

The same can be stored in different formats in different data sources – the most common example that comes to mind is time. Time could be as specific as down to the second or as broad as he hour, but still mean the same thing. You would need to reconcile this conflict when combining the data sources in order to properly analyze the data. You would need to run a query to change the format of the data to look the same way as the other data in order for analysis to work.
Another example I thought of is in MS Access where a data format can be “number” or “short text” for Zip Code. Although the data may look the same, they are not the same type so that would need to be reconciled for analysis as well.

An example of different formats for measurements of data can be percentages and decimals for numbers. Sometimes you will see percentages for easier visibility while other times you’ll see decimals for more accurate readings. When combining databases, it makes sense to try and keep a consistent measurement instead of using both constantly or you could round numbers up to make readings more easily attainable.

Currencies and payments can be viewed and converted in multiple ways. In the EU and UK monetary amounts use commas as Americans would use periods. For example, XXX,XXX,00 instead of XXX.XXX.00. If you have data being input in two different formats, there’ll likely be duplications and major confusion within the database.

An example of where data is stored in two different ways is shoe sizes. Depending on the company or shoe brand, shoes are marked either as US standard or EU standard. In addition, different countries have different ways of measuring shoe sizes. A way to reconcile this conflict is to convert your data into on one standard sizing measurement so that data becomes consistent and unified.

A scenario where data can be entered into a database in different ways could be the 12 hour format and 24 hour format for time. Another case where data can be entered into a database in different ways could also be entering customer’s phone’s numbers. A firm only serving one neighborhood of customers could only take house numbers and realized they’re all the same so they don’t include the area code and the other is to enter all ten digits. Reconciling this difference when putting the two databases together is to format the time data to a 12 hour format since its more universal and easier to understand.

Another example of data that can be entered in two different ways would be time. It can be represented using a 24-hour clock (military time) or a 12-hour clock. For example 15:00 is the same thing as 3:00 pm. You could reconcile this conflict by deciding which one to use from the beginning when creating the database.

One example of data that can be represented in different ways is time. For example, 1:00 pm can also be presented as 13:00 in military times. Also, in Excel, yes can be represented as 1, and no can be represented as 0.

Shoe size is certainly an example of how data can be stored in two different formats. In the US, and the UK, shoe size is measured in barleycorn; whereas, in Asia, it’s a bit different. In Korea, it is in millimeter, and in Japan, it’s in centimeter. The best way to reconcile this conflict would be to have a universal standard when it comes to shoe size measurements. It’ll be hard to combine these data sources, as we’ll constantly have to convert one to the other, etc.

At my internship over the summer, the issue of phone number data entry came up. Some employees were entering phone numbers with periods, some with a dash, others with just spaces or no separators at all. This company’s solution was to write how they wanted the number entered underneath the entry box.

A data type that can be represented in two different ways is the temperature. The United States uses Fahrenheit while many other countries across the world use Celsius. A way to reconcile this conflict would be to make the system understand the relationship between Fahrenheit and Celsius. For example, the database to know that 0 degree Celsius = 32 degree Fahrenheit. The database would either need to put the temperature in all of Fahrenheit or in Celsius, and if one is imputed as the other, the database should be able to convert it to the type of degree that they are using.

An example of data that can be stored in different formats and sources would be social security numbers. There are times where people only need to enter the last 4 digits of their SSN. Other times like when applying for FAFSA, you have to enter your entire SSN to validate who you are.

Shoe sizing is a form of data that does not have a universal standard. Different countries represent sizes using different scales. If a company exports their shoes internationally they may need to convert sizing into one standard sizing chart to reconcile the data and include the conversions for international orders.

I think one of the most classic ways of displaying data in different formats is the imperial vs metric systems. Specifically, the ways that we measure temperature, the Fahrenheit (imperial) is at freezing at 32 degrees, whereas with the metric (celsius) we see freezing point at a clean 0 degrees.

An example of data that can be stored in different formats and sources would be temperature. In the U.S., we use Fahrenheit as our main temperature system. For other countries, they use Celsius. Fahrenheit is a lot greater than Celsius, but they both interpret the same thing.

Whenever I’m buying clothes from a non-US site, I have to convert their sizing to US sizing. I saw the site was using 42/44/46/48/50/52 for tee shirts and I’ve only ever seen tee shirts go by XS/S/M/L/XL. I think to combat this problem would be to use number sizing. But then the problem of units would then come in to play…

An example of data that can be represented in two different ways would be miles per hour and kilometers per hour. In the US we use miles per hour and literally everywhere else in the world Kilometers per hour is used. Maybe it would be easier if we all agreed to use one method rather than get all confused by using two different methods. Or maybe the US should acknowledge that we should catch up with the program standardize all of our methods that are conflicted, but we know that will not happen.

Data for time can be stored in two different formats. Time can be recorded for every 12 hours with AM and PM or 24 hours which does not consist of AM and PM. For example the 12 hour format would be 5:00 P.M although if this was in the 24 hour format it would be just 17:00 hours into the day. A way to work with this would be to just use the extended format which is every 24 hours because when using the format for every 12 hours AM or PM also has to be recorded.

An example of the data that is represented in two different ways is phone number some programs will want 10 digits (XXXXXXXXXX), phone number format [(XXX)XXXXXXX] and others would like phone number spaced out [(XXX) XXX-XXXX]

An example of the data that is represented in two different ways is phone number some programs will want 10 digits (XXXXXXXXXX), phone number format [(XXX)XXXXXXX] and others would like phone number spaced out [(XXX) XXX-XXXX].