Data Science Interview Questions

 Q.What do you mean by word Data Science?

Ans: Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining . know more data science trainingQ.Explain the term botnet?Ans: A botnet is a a type of bot running on an IRC network that has been created with a Trojan.Q.What is Data Visualization?Ans:  Data visualization is a common term that describes any effort to help people understand the significance of data by placing it in a visual context.Q.How you can define Data cleaning as a critical part of process?Ans:  Cleaning up data to the point where you can work with it is a huge amount of work. If we’re trying to reconcile a lot of sources of data that we don’t control like in this flight, it can take 80% of our time.Q.Point out 7 Ways how Data Scientists use Statistics?Ans:

  1. Design and interpret experiments to inform product decisions. 2. Build models that predict signal, not noise. 3. Turn big data a into the big picture 4. Understand user retention, engagement, conversion, and leads. 5. Give your users what they want. 6. Estimate intelligently. 7. Tell the story with the data.

Q.Differentiate between Data modeling and Database design?Ans: Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system.know more data science trainingQ.Describe in brief the data Science Process flowchart?Ans: 1.Data is collected from sensors in the environment. 2. Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing. 3. Exploratory data analysis and statistical modeling may be performed. 4. A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment.Q.What do you understand by term hash table collisions?Ans: Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.Q.Compare and contrast R and SAS?Ans:  SAS is commercial software whereas R is free source and can be downloaded by anyone. SAS is easy to learn and provide easy option for people who already know SQL whereas R is a low level programming language and hence simple procedures takes longer codes.Q.What do you understand by letter ‘R’?Ans: R is a low level language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at BELL.Q.What all things R environment includes?Ans:

  1. A suite of operators for calculations on arrays, in particular matrices, 2. An effective data handling and storage facility, 3. A large, coherent, integrated collection of intermediate tools for data analysis, an effective data handling and storage facility, 4. Graphical facilities for data analysis and display either on-screen or on hardcopy, and 5. A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

What are the applied Machine Learning Process Steps? Ans:

  1. Problem Definition: Understand and clearly describe the problem that is being solved. 2. Analyze Data: Understand the information available that will be used to develop a model. 3. Prepare Data: Define and expose the structure in the dataset. 4. Evaluate Algorithms: Develop robust test harness and baseline accuracy from which to improve and spot check algorithms. 5. Improve Results: Improve results to develop more accurate models. 6. Present Results: Details the problem and solution so that it can be understood by third parties.know more data science training

Q.Compare Multivariate, Univariate and Bivariate analysis?Ans:  MULTIVARIATE: Multivariate analysis focuses on the results of observations of many different variables for a number of objects. UNIVARIATE: Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. BIVARIATE: Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.Q.What is Hypothesis in Machine Learning?Ans:  The hypothesis space used  by  a  machine  learning  system  is  the  set  of  all hypotheses that might possibly be returned by it.  It is typically dened by a hypothesis language, possibly in conjunction with a language bias.Q.Differentiate between Uniform and Skewed Distribution?Ans: UNIFORM DISTRIBUTION: A uniform distribution, sometimes also known as a rectangular distribution, is a distribution that has constant probability. The latter of which simplifies to the expected for . The continuous distribution is implemented as Uniform Distribution SKEWED DISTRIBUTION: In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. The qualitative interpretation of the skew is complicated.Q.What do you understand by term Transformation in Data Acquisition?Ans: The transformation process allows you to consolidate, cleanse, and integrate data. We can semantically arrange the data from heterogeneous sources.Q.What do you understand by term Normal Distribution?Ans: It is a function which shows the distribution of many random variables as a symmetrical bell-shaped graph.Q.What is Data Acquisition?Ans:  It is the process of measuring an electrical or physical phenomenon such as voltage, current, temperature, pressure, or sound with a computer. A DAQ  system comprises of sensors, DAQ measurement hardware, and a computer with programmable software.Q.What is Data Collection?Ans: Data collection is the process of collecting and measuring information on variables of interest, in a proper systematic fashion that enables one to answer stated research questions hypotheses, and revise outcomes.Q.What do you understand by term Use case?Ans: A use case is a methodology used in system analysis to identify, clarify, and organize system requirements. The use case consists of a set of possible      sequences of interactions between systems and users in a particular environment and related to a defined particular goal.Q.What is Sampling and Sampling Distribution?Ans: SAMPLING: Sampling is the process of choosing units (ex- people, organizations) from a population of interest so that by studying the sample we can fairly generalize our results back to the population from which they were chosen. SAMPLING DISTRIBUTION: The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size. know more data science trainingQ.What is Linear Regression?Ans: In statistics, linear regression is an way for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted by X. The case of one explanatory variable is known as simple linear regression.Q.Differentiate between Extrapolation and Interpolation?Ans: Extrapolation is an approximate of a value based on extending a known sequence of values or facts beyond the area that is certainly known. Interpolation is an estimation of a value within two known values in a list of values.Q.How expected value is different from Mean value?Ans: There is no difference. These are two names for the same thing. They are mostly used in different contexts, though if we talk about the expected value of a random variable and the mean of a sample, population or probability distribution.Q.Differentiate between Systematic and Cluster Sampling?Ans: SYSTEMATIC SAMPLING: Systematic sampling is a statistical methology involving the selection of elements from an ordered sampling frame. The most common form of systematic sampling is an equal-probability method. CLUSTER SAMPLING: A cluster sample is a probability sample by which each sampling unit is a collection, or cluster, of elements.Q.What are the advantages of Systematic Sampling?Ans: 1.Easier to perform in the field, especially if a proper frame is not available. 2. Regularly provides more information per unit cost than simple random sampling, in the sense of smaller variances.Q.What do you understand by term Threshold limit value?Ans: The threshold limit value (TLV) of a chemical substance is a level in which it is believed that a worker can be exposed day after day for a working lifetime  without affecting his/her health.Q.Differentiate between Validation Set and Test set?Ans: Validation set: It is a set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network. Test set: A set of examples used only to assess the performance of a fully specified classifier. know more data science trainingQ.How can R and Hadoop be used together?Ans: The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use Map Reduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. Data analysts can then perform complex modeling exercises on a subset of prepared data in R.


Comments

Popular posts from this blog

CoCalc Docker image linux toools

SAP Business One Security Recommendations on Avoiding Risks of Potential Remote Code Execution in SAP HANA